Background on Lab 01

The first lab saw my team exploring the 2014 National Survey on Drug Use and Health and seeking to examine what the consequences of using marijuana meant. Particularly, we wanted to prove if marijuana usage was correlated with cocaine usage and trying stronger drugs, serving as a type of “gateway” drug.

The work below involves manipulating data to get levels of cocaine and marijuana usage and then plotting that data.

Introduction

Our question concerns whether marijuana use is tied to cocaine use. Specifically, we want to compare whether different levels of marijuana users (or non-users) ever try cocaine or become regular cocaine users.

Data

To address this question, we will use data from the 2014 National Survey on Drug Use and Health, which is conducted annually by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS).

Let’s start by loading in the data. The libraries we use for loading and working with the data are tidyverse, forcats and gridExtra. Note gridExtra will have to be installed to be used.

library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(forcats)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
load("36361-0001-Data.rda")

Data Manipulation

For the purposes of this analysis, we will arbitarily define use levels as follows:

Non-User: has never tried [substance]

Rare User: has tried [substance] but has not reported using [substance] in the past 30 days

Occasional User: has used [substance] 1-3 times in the past 30 days (less than once a week)

Regular User: has used [substance] 4-12 times in the past 30 days (~1-3 times a week)

Heavy User: has used [substance] > 12 times in the past 30 days (>3 times a week)

UNKNOWN: either the user did not know or refused to answer as to whether or not they tried [substance]. This is used for bad data and or logically incorrect answers from users and we will disregard these answers.

Next we define our criteria in our code and use it to separate our data. We also remove the object da36361 to avoid retyping that variable name and subset our data using only the variables we need.

alldata <- da36361.0001
remove(da36361.0001)

# subset the data using only the variables that we need for our question
data = alldata[,c("MJEVER", "MJDAY30A", "COCEVER", "COCUS30A")]

# our criteria
occasionallow <- 1
occasionalregular <- 3
regularheavy <- 12

Now we paramaterize the marijuana user statuses using our defined criteria code. We will create a new variable called mj_user_status and based on our criteria assign values to each user.

# Code to create marijuana user statuses
data <- mutate(data, mj_user_status = 
                 ifelse(MJEVER == "(2) No", "Non-User", 
                        ifelse(MJEVER == "(1) Yes" & is.na(MJDAY30A), "Rare User", 
                               ifelse(MJDAY30A >= occasionallow & MJDAY30A <= occasionalregular, "Occasional User", 
                                      ifelse(MJDAY30A > occasionalregular & MJDAY30A <= regularheavy, "Regular User", "Heavy User")))))

# Code to validate if the user did not answer if they tried marijuana or not
data <- mutate(data, mj_user_status = 
                 ifelse(is.na(MJEVER), "UNKNOWN", mj_user_status))

Next we paramaterize the cocaine user statuses using the same method as above. We will create a new variable called coc_user_status and based on our criteria assign values to each user.

# Code to create cocaine user statuses
data <- mutate(data, coc_user_status = 
                 ifelse(COCEVER == "(2) No", "Non-User",
                        ifelse(COCEVER == "(1) Yes" & is.na(COCUS30A), "Rare User",
                               ifelse(COCUS30A >= occasionallow & COCUS30A <= occasionalregular, "Occasional User",
                                      ifelse(COCUS30A > occasionalregular & COCUS30A <= regularheavy, "Regular User", "Heavy User")))))

# Code to validate if the user did not answer if they tried cocaine or not
data <- mutate(data, coc_user_status = 
                 ifelse(is.na(COCEVER), "UNKNOWN", coc_user_status))

Now we will subset our data so it only displays the use status for marijuana and cocaine. We will then count that data

data = data[,c("mj_user_status", "coc_user_status")]

options(tibble.print_max = Inf)
count(data, data$mj_user_status, data$coc_user_status)
## # A tibble: 29 x 3
##    `data$mj_user_status` `data$coc_user_status`     n
##    <chr>                 <chr>                  <int>
##  1 Heavy User            Heavy User                17
##  2 Heavy User            Non-User                1480
##  3 Heavy User            Occasional User          101
##  4 Heavy User            Rare User               1193
##  5 Heavy User            Regular User              26
##  6 Non-User              Non-User               31614
##  7 Non-User              Occasional User            4
##  8 Non-User              Rare User                156
##  9 Non-User              Regular User               3
## 10 Non-User              UNKNOWN                    6
## 11 Occasional User       Heavy User                 4
## 12 Occasional User       Non-User                1188
## 13 Occasional User       Occasional User           33
## 14 Occasional User       Rare User                408
## 15 Occasional User       Regular User               7
## 16 Occasional User       UNKNOWN                    1
## 17 Rare User             Heavy User                 8
## 18 Rare User             Non-User               13463
## 19 Rare User             Occasional User           54
## 20 Rare User             Rare User               4145
## 21 Rare User             Regular User              21
## 22 Rare User             UNKNOWN                    6
## 23 Regular User          Heavy User                 3
## 24 Regular User          Non-User                 856
## 25 Regular User          Occasional User           33
## 26 Regular User          Rare User                391
## 27 Regular User          Regular User              23
## 28 UNKNOWN               Non-User                  17
## 29 UNKNOWN               UNKNOWN                   10

As you can see from the count table, there are only 40 respondents that have at least one label as “Unknown User” out of 55271 total respondents (in other words, where the value is NA). So we will remove these unknown values, effectively “NAs” on whether or not someone tried marijuana or cocaine. This should not impact the overall results as it is a minimal amount we are removing.

#assign all final values to a final variable called final_data
final_data = data

#remove unknowns
final_data = final_data[final_data$mj_user_status != "UNKNOWN",]
final_data = final_data[final_data$coc_user_status != "UNKNOWN",]

Just one more thing we need to do to make the plot more appropriate: We need to order the variables logically.

attach(final_data)
ordered_mj <- factor(mj_user_status, 
                     levels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"))

ordered_coc <- factor(coc_user_status, levels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"))

Visual Display

Now we can plot the data as a stacked bar chart using relative frequency for each variable.

ggplot(final_data)+
  geom_bar(aes(x=ordered_mj,fill=ordered_coc),position="fill")+
  scale_fill_brewer(palette="YlGnBu",direction=-1,
                    labels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"),
                    name="Cocaine User Status")+
  labs(x="Marijuana User Status",y="Relative Frequency",
       title="Cocaine User Status by Marijuana User Status")+
  scale_x_discrete(labels=c("Non-User","Rare User", "Occasional User","Regular User","Heavy User"))+
  theme_classic()

Numerical Verification

We now verify the values of the graph by counting the values based on user status.

count(final_data, final_data$mj_user_status,
      final_data$coc_user_status)
## # A tibble: 24 x 3
##    `final_data$mj_user_status` `final_data$coc_user_status`     n
##    <chr>                       <chr>                        <int>
##  1 Heavy User                  Heavy User                      17
##  2 Heavy User                  Non-User                      1480
##  3 Heavy User                  Occasional User                101
##  4 Heavy User                  Rare User                     1193
##  5 Heavy User                  Regular User                    26
##  6 Non-User                    Non-User                     31614
##  7 Non-User                    Occasional User                  4
##  8 Non-User                    Rare User                      156
##  9 Non-User                    Regular User                     3
## 10 Occasional User             Heavy User                       4
## 11 Occasional User             Non-User                      1188
## 12 Occasional User             Occasional User                 33
## 13 Occasional User             Rare User                      408
## 14 Occasional User             Regular User                     7
## 15 Rare User                   Heavy User                       8
## 16 Rare User                   Non-User                     13463
## 17 Rare User                   Occasional User                 54
## 18 Rare User                   Rare User                     4145
## 19 Rare User                   Regular User                    21
## 20 Regular User                Heavy User                       3
## 21 Regular User                Non-User                       856
## 22 Regular User                Occasional User                 33
## 23 Regular User                Rare User                      391
## 24 Regular User                Regular User                    23
dim(final_data)
## [1] 55231     2
values = table(ordered_mj, ordered_coc)
values
##                  ordered_coc
## ordered_mj        Non-User Rare User Occasional User Regular User
##   Non-User           31614       156               4            3
##   Rare User          13463      4145              54           21
##   Occasional User     1188       408              33            7
##   Regular User         856       391              33           23
##   Heavy User          1480      1193             101           26
##                  ordered_coc
## ordered_mj        Heavy User
##   Non-User                 0
##   Rare User                8
##   Occasional User          4
##   Regular User             3
##   Heavy User              17
colSums(values)
##        Non-User       Rare User Occasional User    Regular User 
##           48601            6293             225              80 
##      Heavy User 
##              32
sum(colSums(values))
## [1] 55231
sum(rowSums(values))
## [1] 55231
round(prop.table(values, margin=1),2)
##                  ordered_coc
## ordered_mj        Non-User Rare User Occasional User Regular User
##   Non-User            0.99      0.00            0.00         0.00
##   Rare User           0.76      0.23            0.00         0.00
##   Occasional User     0.72      0.25            0.02         0.00
##   Regular User        0.66      0.30            0.03         0.02
##   Heavy User          0.53      0.42            0.04         0.01
##                  ordered_coc
## ordered_mj        Heavy User
##   Non-User              0.00
##   Rare User             0.00
##   Occasional User       0.00
##   Regular User          0.00
##   Heavy User            0.01
detach()

Results

Based on the graph, the majority of users across levels of marijuana usage have never tried cocaine (99% for non-users, and 53% for regular users). As the level of marijuana usage increases, the more occurrences and intensity of cocaine usage increase as well.

Conclusions/Discussion

There is an apparent positive correlation between marijuana use and cocaine use. Also, there is marked jump from marijuana non-users to rare users of marijuana in their cocaine use. This seems to suggest that people who have tried marijuana at least once are much more likely to have tried cocaine at least once than people who have never used marijuana at all. The increase in cocaine users across the x-axis is more gradual after that point.

While informative, the plot we created spends a lot of graphical real estate on non-users. It might be interesting or helpful to augment our plot such that the non-users aren’t shown and the y-axis is augmented to show cocaine use up until a smaller proportion (say 50%). This representation would provide a better visual summary of different levels of cocaine user statuses.