The first lab saw my team exploring the 2014 National Survey on Drug Use and Health and seeking to examine what the consequences of using marijuana meant. Particularly, we wanted to prove if marijuana usage was correlated with cocaine usage and trying stronger drugs, serving as a type of “gateway” drug.
The work below involves manipulating data to get levels of cocaine and marijuana usage and then plotting that data.
Our question concerns whether marijuana use is tied to cocaine use. Specifically, we want to compare whether different levels of marijuana users (or non-users) ever try cocaine or become regular cocaine users.
To address this question, we will use data from the 2014 National Survey on Drug Use and Health, which is conducted annually by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS).
Let’s start by loading in the data. The libraries we use for loading and working with the data are tidyverse, forcats and gridExtra. Note gridExtra will have to be installed to be used.
library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(forcats)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
load("36361-0001-Data.rda")
For the purposes of this analysis, we will arbitarily define use levels as follows:
Non-User: has never tried [substance]
Rare User: has tried [substance] but has not reported using [substance] in the past 30 days
Occasional User: has used [substance] 1-3 times in the past 30 days (less than once a week)
Regular User: has used [substance] 4-12 times in the past 30 days (~1-3 times a week)
Heavy User: has used [substance] > 12 times in the past 30 days (>3 times a week)
UNKNOWN: either the user did not know or refused to answer as to whether or not they tried [substance]. This is used for bad data and or logically incorrect answers from users and we will disregard these answers.
Next we define our criteria in our code and use it to separate our data. We also remove the object da36361 to avoid retyping that variable name and subset our data using only the variables we need.
alldata <- da36361.0001
remove(da36361.0001)
# subset the data using only the variables that we need for our question
data = alldata[,c("MJEVER", "MJDAY30A", "COCEVER", "COCUS30A")]
# our criteria
occasionallow <- 1
occasionalregular <- 3
regularheavy <- 12
Now we paramaterize the marijuana user statuses using our defined criteria code. We will create a new variable called mj_user_status and based on our criteria assign values to each user.
# Code to create marijuana user statuses
data <- mutate(data, mj_user_status =
ifelse(MJEVER == "(2) No", "Non-User",
ifelse(MJEVER == "(1) Yes" & is.na(MJDAY30A), "Rare User",
ifelse(MJDAY30A >= occasionallow & MJDAY30A <= occasionalregular, "Occasional User",
ifelse(MJDAY30A > occasionalregular & MJDAY30A <= regularheavy, "Regular User", "Heavy User")))))
# Code to validate if the user did not answer if they tried marijuana or not
data <- mutate(data, mj_user_status =
ifelse(is.na(MJEVER), "UNKNOWN", mj_user_status))
Next we paramaterize the cocaine user statuses using the same method as above. We will create a new variable called coc_user_status and based on our criteria assign values to each user.
# Code to create cocaine user statuses
data <- mutate(data, coc_user_status =
ifelse(COCEVER == "(2) No", "Non-User",
ifelse(COCEVER == "(1) Yes" & is.na(COCUS30A), "Rare User",
ifelse(COCUS30A >= occasionallow & COCUS30A <= occasionalregular, "Occasional User",
ifelse(COCUS30A > occasionalregular & COCUS30A <= regularheavy, "Regular User", "Heavy User")))))
# Code to validate if the user did not answer if they tried cocaine or not
data <- mutate(data, coc_user_status =
ifelse(is.na(COCEVER), "UNKNOWN", coc_user_status))
Now we will subset our data so it only displays the use status for marijuana and cocaine. We will then count that data
data = data[,c("mj_user_status", "coc_user_status")]
options(tibble.print_max = Inf)
count(data, data$mj_user_status, data$coc_user_status)
## # A tibble: 29 x 3
## `data$mj_user_status` `data$coc_user_status` n
## <chr> <chr> <int>
## 1 Heavy User Heavy User 17
## 2 Heavy User Non-User 1480
## 3 Heavy User Occasional User 101
## 4 Heavy User Rare User 1193
## 5 Heavy User Regular User 26
## 6 Non-User Non-User 31614
## 7 Non-User Occasional User 4
## 8 Non-User Rare User 156
## 9 Non-User Regular User 3
## 10 Non-User UNKNOWN 6
## 11 Occasional User Heavy User 4
## 12 Occasional User Non-User 1188
## 13 Occasional User Occasional User 33
## 14 Occasional User Rare User 408
## 15 Occasional User Regular User 7
## 16 Occasional User UNKNOWN 1
## 17 Rare User Heavy User 8
## 18 Rare User Non-User 13463
## 19 Rare User Occasional User 54
## 20 Rare User Rare User 4145
## 21 Rare User Regular User 21
## 22 Rare User UNKNOWN 6
## 23 Regular User Heavy User 3
## 24 Regular User Non-User 856
## 25 Regular User Occasional User 33
## 26 Regular User Rare User 391
## 27 Regular User Regular User 23
## 28 UNKNOWN Non-User 17
## 29 UNKNOWN UNKNOWN 10
As you can see from the count table, there are only 40 respondents that have at least one label as “Unknown User” out of 55271 total respondents (in other words, where the value is NA). So we will remove these unknown values, effectively “NAs” on whether or not someone tried marijuana or cocaine. This should not impact the overall results as it is a minimal amount we are removing.
#assign all final values to a final variable called final_data
final_data = data
#remove unknowns
final_data = final_data[final_data$mj_user_status != "UNKNOWN",]
final_data = final_data[final_data$coc_user_status != "UNKNOWN",]
Just one more thing we need to do to make the plot more appropriate: We need to order the variables logically.
attach(final_data)
ordered_mj <- factor(mj_user_status,
levels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"))
ordered_coc <- factor(coc_user_status, levels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"))
Now we can plot the data as a stacked bar chart using relative frequency for each variable.
ggplot(final_data)+
geom_bar(aes(x=ordered_mj,fill=ordered_coc),position="fill")+
scale_fill_brewer(palette="YlGnBu",direction=-1,
labels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"),
name="Cocaine User Status")+
labs(x="Marijuana User Status",y="Relative Frequency",
title="Cocaine User Status by Marijuana User Status")+
scale_x_discrete(labels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"))+
theme_classic()
We now verify the values of the graph by counting the values based on user status.
count(final_data, final_data$mj_user_status,
final_data$coc_user_status)
## # A tibble: 24 x 3
## `final_data$mj_user_status` `final_data$coc_user_status` n
## <chr> <chr> <int>
## 1 Heavy User Heavy User 17
## 2 Heavy User Non-User 1480
## 3 Heavy User Occasional User 101
## 4 Heavy User Rare User 1193
## 5 Heavy User Regular User 26
## 6 Non-User Non-User 31614
## 7 Non-User Occasional User 4
## 8 Non-User Rare User 156
## 9 Non-User Regular User 3
## 10 Occasional User Heavy User 4
## 11 Occasional User Non-User 1188
## 12 Occasional User Occasional User 33
## 13 Occasional User Rare User 408
## 14 Occasional User Regular User 7
## 15 Rare User Heavy User 8
## 16 Rare User Non-User 13463
## 17 Rare User Occasional User 54
## 18 Rare User Rare User 4145
## 19 Rare User Regular User 21
## 20 Regular User Heavy User 3
## 21 Regular User Non-User 856
## 22 Regular User Occasional User 33
## 23 Regular User Rare User 391
## 24 Regular User Regular User 23
dim(final_data)
## [1] 55231 2
values = table(ordered_mj, ordered_coc)
values
## ordered_coc
## ordered_mj Non-User Rare User Occasional User Regular User
## Non-User 31614 156 4 3
## Rare User 13463 4145 54 21
## Occasional User 1188 408 33 7
## Regular User 856 391 33 23
## Heavy User 1480 1193 101 26
## ordered_coc
## ordered_mj Heavy User
## Non-User 0
## Rare User 8
## Occasional User 4
## Regular User 3
## Heavy User 17
colSums(values)
## Non-User Rare User Occasional User Regular User
## 48601 6293 225 80
## Heavy User
## 32
sum(colSums(values))
## [1] 55231
sum(rowSums(values))
## [1] 55231
round(prop.table(values, margin=1),2)
## ordered_coc
## ordered_mj Non-User Rare User Occasional User Regular User
## Non-User 0.99 0.00 0.00 0.00
## Rare User 0.76 0.23 0.00 0.00
## Occasional User 0.72 0.25 0.02 0.00
## Regular User 0.66 0.30 0.03 0.02
## Heavy User 0.53 0.42 0.04 0.01
## ordered_coc
## ordered_mj Heavy User
## Non-User 0.00
## Rare User 0.00
## Occasional User 0.00
## Regular User 0.00
## Heavy User 0.01
detach()
Based on the graph, the majority of users across levels of marijuana usage have never tried cocaine (99% for non-users, and 53% for regular users). As the level of marijuana usage increases, the more occurrences and intensity of cocaine usage increase as well.
There is an apparent positive correlation between marijuana use and cocaine use. Also, there is marked jump from marijuana non-users to rare users of marijuana in their cocaine use. This seems to suggest that people who have tried marijuana at least once are much more likely to have tried cocaine at least once than people who have never used marijuana at all. The increase in cocaine users across the x-axis is more gradual after that point.
While informative, the plot we created spends a lot of graphical real estate on non-users. It might be interesting or helpful to augment our plot such that the non-users aren’t shown and the y-axis is augmented to show cocaine use up until a smaller proportion (say 50%). This representation would provide a better visual summary of different levels of cocaine user statuses.