Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

Starting in Class 07, we will concatenate all our WSs together to retain the relevant information needed for subsequent classes. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, modifying your research questions, improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

You will need to update any section symbols to be at least 3 deep (###) if you’ve copied them from an older assignment into this document.

Finally, you can delete all this header text once you don’t need to refer to it.


Research Questions

Class 03 Datasets, Codebooks, Personal Codebook

Question of interest

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Initial thinking: (My helpful narrative description to help you get going.) While nicotine dependence is a good starting point, I need to determine what it is about nicotine dependence that I am interested in. It strikes me that friends and acquaintances that I have known through the years that became hooked on cigarettes did so across very different periods of time. Some seemed to be dependent soon after their first few experiences with smoking and others after many years of generally irregular smoking behavior. I decide that I am most interested in exploring the association between level of smoking and nicotine dependence. I add to my codebook variables reflecting smoking levels (e.g., smoking quantity and frequency).

Topic of interest: (You need this part.) I have decided to investigate the relationship between nicotine dependence and the frequency and quantity of smoking on people up to 25 years old. The association may differ by ethnicity, age, gender, and other factors.

How I did it: (My helpful narrative description to help you get going.) I look through the codebook and find some variables of interest. I searched the text with “Ctrl-F” (find) to find these variables. For each variable, I copy/paste the description here, then formatted it so it’s organized. You can choose to use a table or an outline format. I found this verbatim text format to be very easy to format. I retained the “frequency” of each response because it’s interesting to know, and because it was already in the codebook — this value is not required for your codebook.

Codebook

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
VarName (RenamedVarName)
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

IDNUM (ID)
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Nominal
  43093 1-43093. Unique Identification number

SEX (Sex)
  SEX
  Nominal
  18518 1. Male
  24575 2. Female

AGE (Age)
  AGE
  Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 3. Never

TAB12MDX (TobaccoDependence)
  NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
  Nominal
  38131 0. No nicotine dependence
   4962 1. Nicotine dependence

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 7. Never

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 0 Cigarett(s)

ETHRACE2A (Ethnicity)
  IMPUTED RACE/ETHNICITY
  Nominal
  24507 1. White, Not Hispanic or Latino
   8245 2. Black, Not Hispanic or Latino
    701 3. American Indian/Alaska Native, Not Hispanic or Latino
   1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
   8308 5. Hispanic or Latino

MAJORDEPLIFE (Depression)
  MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
  Nominal
  35254 0. No
   7839 1. Yes

Additional variables were created from the original variables:

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. Daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. Weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. Monthly = 6. Once a month or less
    4. Never   = 7. Never

Class 04 Citations

I will use (Beckschäfer et al. 2014; Richert 2013; Murphy 2012; Dean and ebrary, Inc 2014) for this work which was obtained using [@beck2014; @rich2013; @murp2012; @dean2014]. Plus, Beckschäfer et al. (2014) says some interesting stuff and that citation was obtained using @beck2014. For more documentation on bibliographies and citations with R Markdown, see http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html. For general help with R Markdown, see https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.

Class 05 Research Questions

See Class 06 below.

Class 06 Literature Review

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Research question:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
    • Google scholar search: “nicotine dependence smoking frequency”
    • Citation: pdf file available for Dierker et al. (2007)
    • Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
    • Others: Kandel and Chen (2000) and Caraballo, Novak, and Asman (2009)

You don’t need to include images in your literature review. I’m providing these tables to illustrate what these tables look like:

  1. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
    • Google scholar search: “nicotine dependence depression”
    • Citation: pdf file available for Naomi Breslau (1995)
    • Interesting points: Table 2, Smoking and Nicotine Dependence both associated with Education. Table 3, Major depression associated with being nicotine dependent and Sex.
    • Others: N. Breslau et al. (1998)
  1. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
    • In Question 2 we see differences by Education and Sex.

I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.

All required variables have been found and added to my personal codebook (by expanding Class 03).

Data Management

Class 07 Working With Data, Data Management

Purpose of study: smoking and depression in young adults

(Note: This is the last paragraph of the research proposal used to remind the reader of the research objectives.)

The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.

Variables

Variables from NESARC that will be used include (already in codebook, above):

  • IDNUM (unique ID number with no alphabetics),
  • MAJORDEPLIFE (has subject experienced a major depression?),
  • CHECK321 (has subject smoked cigarettes in past 12 months?),
  • AGE (age of subject),
  • TAB12MDX (tobacco dependence past 12 months),
  • S3AQ3B1 (usual frequency when cigarettes smoked),
  • ETHRACE2A (ethnicity of subject),
  • SEX (gender of subject), and
  • S3AQ3C1 (usual smoking quantity of cigarettes when smoked).

Data subset columns

First, the data is placed on the search path using the PDS package.

# data analysis packages
library(tidyverse)  # Data manipulation and visualization suite
## -- Attaching packages ---------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.3.0
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(forcats)    # Factor variables
library(lubridate)  # Dates
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
# data package
library(PDS)

  ## Problems installing PDS package?  Solution.
  ## If you had problems installing the PDS package, no problem; here's how to get the data:
  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Where I have "library(PDS)" in my code, change it to the two lines below.
  ##    Update the "filename" to the name of the file.
  ##
  ## # library(PDS)
  ## load("filename.RData")

dim(NESARC)
## [1] 43093  3008

The variables of interest are specified and a subset of the dataset including only these variables is created and stored in the data frame nesarc.sub.

# variables to include in our data subset
nesarc.sub <-
  NESARC %>%
  select(
    IDNUM
  , SEX
  , AGE
  , CHECK321
  , TAB12MDX
  , S3AQ3B1
  , S3AQ3C1
  , ETHRACE2A
  , MAJORDEPLIFE
  )

Check size and structure of data subset. Note that categorical variables are already Factor variables, but the levels do not have meaningful labels, yet.

# size of subset
dim(nesarc.sub)
## [1] 43093     9
# structure of subset
str(nesarc.sub)
## 'data.frame':    43093 obs. of  9 variables:
##  $ IDNUM       : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
##  $ AGE         : num  23 28 81 18 36 34 19 84 29 18 ...
##  $ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
##  $ TAB12MDX    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ S3AQ3C1     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
##  $ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...

EXTRA: AddHealth, joining waves 1 and 2 together

Joining AddHealth waves 1 and 2 together into a single dataset can be done if you want to use variables from when the participants were both adolecents and adults.

# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
## [1] 6504 2829
dim(addhealth_public4)
## [1] 6504  976
# First column is the unique ID
colnames(AddHealth)[1]
## [1] "AID"
colnames(addhealth_public4)[1]
## [1] "aid"
# Rename W4's unique ID variable "AID" and "aid" to "ID"
#   so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
  AddHealth %>%
  rename(ID = AID)
addhealth_public4 <-
  addhealth_public4 %>%
  rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
## [1] "ID"
colnames(addhealth_public4)[1]
## [1] "ID"
# Join two datasets by unique ID
AH14 <-
  full_join(AddHealth, addhealth_public4)
## Joining, by = "ID"
## Warning: Column `ID` joining factor and character vector, coercing into
## character vector
# now we have twice as many columns
dim(AH14)
## [1] 6504 3804
  # Removing this large object since I don't need this data for the NESARC project
  # DELETE THIS LINE if you're using AddHealth
  rm(AH14)

Renaming Variables

Rename variables NewName = OriginalName.

nesarc.sub <-
  nesarc.sub %>%
  rename(
    ID                = IDNUM
  , Sex               = SEX
  , Age               = AGE
  , SmokingStatus     = CHECK321
  , TobaccoDependence = TAB12MDX
  , SmokingFreq       = S3AQ3B1
  , DailyCigsSmoked   = S3AQ3C1
  , Ethnicity         = ETHRACE2A
  , Depression        = MAJORDEPLIFE
  )

str(nesarc.sub)
## 'data.frame':    43093 obs. of  9 variables:
##  $ ID               : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sex              : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
##  $ Age              : num  23 28 81 18 36 34 19 84 29 18 ...
##  $ SmokingStatus    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
##  $ TobaccoDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SmokingFreq      : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ DailyCigsSmoked  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Ethnicity        : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
##  $ Depression       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...

The first 6 and last 6 values are given by head() and tail().

head(nesarc.sub)
##   ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
## 1  1   1  23          <NA>                 0        <NA>              NA
## 2  2   2  28          <NA>                 0        <NA>              NA
## 3  3   2  81          <NA>                 0        <NA>              NA
## 4  4   1  18          <NA>                 0        <NA>              NA
## 5  5   1  36          <NA>                 0        <NA>              NA
## 6  6   2  34          <NA>                 0        <NA>              NA
##   Ethnicity Depression
## 1         5          0
## 2         5          0
## 3         5          0
## 4         5          0
## 5         2          0
## 6         2          0
tail(nesarc.sub)
##          ID Sex Age SmokingStatus TobaccoDependence SmokingFreq
## 43088 43088   2  18             1                 0           1
## 43089 43089   2  18          <NA>                 0        <NA>
## 43090 43090   1  19          <NA>                 0        <NA>
## 43091 43091   1  18             1                 1           6
## 43092 43092   1  29             1                 1           1
## 43093 43093   1  18          <NA>                 0        <NA>
##       DailyCigsSmoked Ethnicity Depression
## 43088               5         1          0
## 43089              NA         1          0
## 43090              NA         1          0
## 43091               3         1          0
## 43092              20         1          0
## 43093              NA         1          1

A quick summary of the variables indicates many missing values.

# summary of variables -- notice many NAs
summary(nesarc.sub)
##        ID        Sex            Age       SmokingStatus TobaccoDependence
##  1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
##  2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
##  3      :    1             Median :44.0   9   :   22                     
##  4      :    1             Mean   :46.4   NA's:25080                     
##  5      :    1             3rd Qu.:59.0                                  
##  6      :    1             Max.   :98.0                                  
##  (Other):43087                                                           
##   SmokingFreq    DailyCigsSmoked Ethnicity Depression
##  1      :14836   Min.   : 1.00   1:24507   0:35254   
##  6      :  772   1st Qu.: 6.00   2: 8245   1: 7839   
##  4      :  747   Median :15.00   3:  701             
##  3      :  687   Mean   :16.81   4: 1332             
##  2      :  460   3rd Qu.:20.00   5: 8308             
##  (Other):  511   Max.   :99.00                       
##  NA's   :25080   NA's   :25080

Class 08 Subsetting data and R Programming

Coding missing values

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 0 Cigarett(s)
table(nesarc.sub$DailyCigsSmoked)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  934  884  923  573 1070  463  269  299   49 3077   23  230   34   25  851 
##   16   17   18   19   20   21   22   23   24   25   27   28   29   30   33 
##   40   22   59    5 5366    1   10    2    7  155    2    3    3  909    1 
##   34   35   37   39   40   45   50   55   57   60   66   70   75   80   98 
##    1   30    2    1  993    8  106    2    1  241    1   12    2   47   15 
##   99 
##  262
nesarc.sub <-
  nesarc.sub %>%
  mutate(
  # replace NA with 0
    DailyCigsSmoked = replace(DailyCigsSmoked, is.na(DailyCigsSmoked), 0)
  )

table(nesarc.sub$DailyCigsSmoked)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 25080   934   884   923   573  1070   463   269   299    49  3077    23 
##    12    13    14    15    16    17    18    19    20    21    22    23 
##   230    34    25   851    40    22    59     5  5366     1    10     2 
##    24    25    27    28    29    30    33    34    35    37    39    40 
##     7   155     2     3     3   909     1     1    30     2     1   993 
##    45    50    55    57    60    66    70    75    80    98    99 
##     8   106     2     1   241     1    12     2    47    15   262
NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 3. Never

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 7. Never
table(nesarc.sub$SmokingStatus  )
## 
##    1    2    9 
## 9913 8078   22
table(nesarc.sub$SmokingFreq    )
## 
##     1     2     3     4     5     6     9 
## 14836   460   687   747   409   772   102
nesarc.sub <-
  nesarc.sub %>%
  mutate(
  # first change the "factor" variables to "character" to add a new level
    SmokingStatus   = as.character(SmokingStatus)
  , SmokingFreq     = as.character(SmokingFreq  )
  # replace NA with new category values
  , SmokingStatus   = replace(SmokingStatus  , is.na(SmokingStatus), "3")
  , SmokingFreq     = replace(SmokingFreq    , is.na(SmokingFreq)  , "7")
  )

table(nesarc.sub$SmokingStatus  )
## 
##     1     2     3     9 
##  9913  8078 25080    22
table(nesarc.sub$SmokingFreq    )
## 
##     1     2     3     4     5     6     7     9 
## 14836   460   687   747   409   772 25080   102

Coding 9s and 99s as NAs

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc.sub$SmokingStatus  )
## 
##     1     2     3     9 
##  9913  8078 25080    22
table(nesarc.sub$SmokingFreq    )
## 
##     1     2     3     4     5     6     7     9 
## 14836   460   687   747   409   772 25080   102
table(nesarc.sub$DailyCigsSmoked)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 25080   934   884   923   573  1070   463   269   299    49  3077    23 
##    12    13    14    15    16    17    18    19    20    21    22    23 
##   230    34    25   851    40    22    59     5  5366     1    10     2 
##    24    25    27    28    29    30    33    34    35    37    39    40 
##     7   155     2     3     3   909     1     1    30     2     1   993 
##    45    50    55    57    60    66    70    75    80    98    99 
##     8   106     2     1   241     1    12     2    47    15   262
nesarc.sub <-
  nesarc.sub %>%
  mutate(
  # replace missing values
    SmokingStatus   = replace(SmokingStatus  , SmokingStatus   %in% c("9"), NA)
  , SmokingFreq     = replace(SmokingFreq    , SmokingFreq     %in% c("9"), NA)
  , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
  # drop unused factor levels
  , SmokingStatus   = fct_drop(SmokingStatus)
  , SmokingFreq     = fct_drop(SmokingFreq  )
  )

table(nesarc.sub$SmokingStatus  )
## 
##     1     2     3 
##  9913  8078 25080
table(nesarc.sub$SmokingFreq    )
## 
##     1     2     3     4     5     6     7 
## 14836   460   687   747   409   772 25080
table(nesarc.sub$DailyCigsSmoked)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 25080   934   884   923   573  1070   463   269   299    49  3077    23 
##    12    13    14    15    16    17    18    19    20    21    22    23 
##   230    34    25   851    40    22    59     5  5366     1    10     2 
##    24    25    27    28    29    30    33    34    35    37    39    40 
##     7   155     2     3     3   909     1     1    30     2     1   993 
##    45    50    55    57    60    66    70    75    80    98 
##     8   106     2     1   241     1    12     2    47    15
summary(nesarc.sub)
##        ID        Sex            Age       SmokingStatus TobaccoDependence
##  1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
##  2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
##  3      :    1             Median :44.0   3   :25080                     
##  4      :    1             Mean   :46.4   NA's:   22                     
##  5      :    1             3rd Qu.:59.0                                  
##  6      :    1             Max.   :98.0                                  
##  (Other):43087                                                           
##   SmokingFreq    DailyCigsSmoked  Ethnicity Depression
##  7      :25080   Min.   : 0.000   1:24507   0:35254   
##  1      :14836   1st Qu.: 0.000   2: 8245   1: 7839   
##  6      :  772   Median : 0.000   3:  701             
##  4      :  747   Mean   : 6.462   4: 1332             
##  3      :  687   3rd Qu.:10.000   5: 8308             
##  (Other):  869   Max.   :98.000                       
##  NA's   :  102   NA's   :262

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc.sub)
## [1] 43093     9
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
## [1] 43093     9
summary(nesarc.sub)
##        ID        Sex            Age       SmokingStatus TobaccoDependence
##  1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
##  2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
##  3      :    1             Median :44.0   3   :25080                     
##  4      :    1             Mean   :46.4   NA's:   22                     
##  5      :    1             3rd Qu.:59.0                                  
##  6      :    1             Max.   :98.0                                  
##  (Other):43087                                                           
##   SmokingFreq    DailyCigsSmoked  Ethnicity Depression
##  7      :25080   Min.   : 0.000   1:24507   0:35254   
##  1      :14836   1st Qu.: 0.000   2: 8245   1: 7839   
##  6      :  772   Median : 0.000   3:  701             
##  4      :  747   Mean   : 6.462   4: 1332             
##  3      :  687   3rd Qu.:10.000   5: 8308             
##  (Other):  869   Max.   :98.000                       
##  NA's   :  102   NA's   :262

Now we have a dataset with no missing values.

Creating new variables

Additional variables that we created from the original variables are listed in this ammendment to the original codebook.

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. Daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. Weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. Monthly = 6. Once a month or less
    4. Never   = 7. Never

From categories to numeric

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    DaysSmoke =
      case_when(
        SmokingFreq == 1 ~ 30      # 1. Every day
      , SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
      , SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
      , SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
      , SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
      , SmokingFreq == 6 ~ 1       # 6. Once a month or less
      , SmokingFreq == 7 ~ 0       # 7. Never
      )
  )

summary(nesarc.sub$DaysSmoke)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00   10.96   30.00   30.00     102

From numeric to categories based on quantiles

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
  )

The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function. The label "Q1" indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4" indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations.

quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##    0    0    0  300 2940
nesarc.sub <-
  nesarc.sub %>%
  mutate(
    CigsSmokedFac =
      .bincode(TotalCigsSmoked
      , breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
      , include.lowest = TRUE
      )
  , CigsSmokedFac = factor( CigsSmokedFac
                          , levels = c(1, 2, 3, 4)
                          , labels = c("Q1", "Q2", "Q3", "Q4")
                          )
  )

In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.

summary(nesarc.sub$CigsSmokedFac)
##    Q1    Q2    Q3    Q4  NA's 
## 25080     0  8630  9086   297

From many categories to a few

The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
#   if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)
## 
##     1     2     3     4     5     6     7 
## 14836   460   687   747   409   772 25080
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))
## 
##     1     2     3     4     5     6     7 
## 14836   460   687   747   409   772 25080
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc.sub <-
  nesarc.sub %>%
  mutate(
    SmokingFreq3 =
      fct_collapse(
        SmokingFreq
      , "1" = c("1", "2", "3")  # 1. daily
      , "2" = c("4", "5")       # 2. weekly
      , "3" = "6"               # 3. monthly
      , "4" = "7"               # 7. never
      )
  )

# new variable
table(nesarc.sub$SmokingFreq3)
## 
##     1     2     3     4 
## 15983  1156   772 25080
# we will label the new factor variable below, with the others

Review results of new variables

Look at the first 10 observations in the dataset, with special attention to the new variables.

head(nesarc.sub, 10)
##    ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
## 1   1   1  23             3                 0           7               0
## 2   2   2  28             3                 0           7               0
## 3   3   2  81             3                 0           7               0
## 4   4   1  18             3                 0           7               0
## 5   5   1  36             3                 0           7               0
## 6   6   2  34             3                 0           7               0
## 7   7   1  19             3                 0           7               0
## 8   8   2  84             3                 0           7               0
## 9   9   2  29             3                 0           7               0
## 10 10   2  18             3                 0           7               0
##    Ethnicity Depression DaysSmoke TotalCigsSmoked CigsSmokedFac
## 1          5          0         0               0            Q1
## 2          5          0         0               0            Q1
## 3          5          0         0               0            Q1
## 4          5          0         0               0            Q1
## 5          2          0         0               0            Q1
## 6          2          0         0               0            Q1
## 7          2          1         0               0            Q1
## 8          1          0         0               0            Q1
## 9          1          1         0               0            Q1
## 10         5          0         0               0            Q1
##    SmokingFreq3
## 1             4
## 2             4
## 3             4
## 4             4
## 5             4
## 6             4
## 7             4
## 8             4
## 9             4
## 10            4

Labeling Categorical variable levels

Look at summary of dataset to see the state of the labelling before we make changes.

summary(nesarc.sub)
##        ID        Sex            Age       SmokingStatus TobaccoDependence
##  1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
##  2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
##  3      :    1             Median :44.0   3   :25080                     
##  4      :    1             Mean   :46.4   NA's:   22                     
##  5      :    1             3rd Qu.:59.0                                  
##  6      :    1             Max.   :98.0                                  
##  (Other):43087                                                           
##   SmokingFreq    DailyCigsSmoked  Ethnicity Depression   DaysSmoke    
##  7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   : 0.00  
##  1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.: 0.00  
##  6      :  772   Median : 0.000   3:  701              Median : 0.00  
##  4      :  747   Mean   : 6.462   4: 1332              Mean   :10.96  
##  3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:30.00  
##  (Other):  869   Max.   :98.000                        Max.   :30.00  
##  NA's   :  102   NA's   :262                           NA's   :102    
##  TotalCigsSmoked  CigsSmokedFac SmokingFreq3
##  Min.   :   0.0   Q1  :25080    1   :15983  
##  1st Qu.:   0.0   Q2  :    0    2   : 1156  
##  Median :   0.0   Q3  : 8630    3   :  772  
##  Mean   : 187.5   Q4  : 9086    4   :25080  
##  3rd Qu.: 300.0   NA's:  297    NA's:  102  
##  Max.   :2940.0                             
##  NA's   :297

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, TobaccoDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc.sub$Depression <-
  factor(
    nesarc.sub$Depression
  , labels = c("No Depression"
             , "Yes Depression"
    )
  )

nesarc.sub$SmokingStatus <-
  factor(
    nesarc.sub$SmokingStatus
  , labels = c("Smoked past 12 months"
             , "Smoked prior to 12 months"
             , "Never smoked"
    )
  )

# Specify the order of the levels to be different from codebook order
nesarc.sub$Sex <-
  factor(nesarc.sub$Sex
  , levels = c( 2
              , 1
              )
  , labels = c( "Female"
              , "Male"
              )
  )
# check ordering with a frequency table
table(nesarc.sub$Sex)
## 
## Female   Male 
##  24575  18518
nesarc.sub$TobaccoDependence <-
  factor(nesarc.sub$TobaccoDependence
  , levels = c( 1
              , 0
              )
  , labels = c( "Nicotine Dependence"
              , "No Nicotine Dependence"
              )
  )


# Remember the new category level from the original `NA` values.

nesarc.sub$SmokingFreq <-
  factor(nesarc.sub$SmokingFreq
  , levels = c( 7
              , 6
              , 5
              , 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Once a month or less"
              , "2 to 3 Days/month"
              , "1 to 2 Days/week"
              , "3 to 4 Days/week"
              , "5 to 6 Days/week"
              , "Every Day"
              )
  )


nesarc.sub$Ethnicity <-
  factor(nesarc.sub$Ethnicity
    # shorter labels are helpful for plots
  , labels = c( "Cauc"
              , "AfAm"
              , "NaAm"
              , "Asia"
              , "Hisp"
              )
  # , labels = c("Caucasian"
  #            , "African American"
  #            , "Native American"
  #            , "Asian"
  #            , "Hispanic"
  #             )
)

nesarc.sub$SmokingFreq3 <-
  factor(nesarc.sub$SmokingFreq3
  , levels = c( 1
              , 2
              , 3
              , 4
              )
  , labels = c( "Daily"
              , "Weekly"
              , "Monthly"
              , "Never"
              )
  )

summary(nesarc.sub)
##        ID            Sex             Age      
##  1      :    1   Female:24575   Min.   :18.0  
##  2      :    1   Male  :18518   1st Qu.:32.0  
##  3      :    1                  Median :44.0  
##  4      :    1                  Mean   :46.4  
##  5      :    1                  3rd Qu.:59.0  
##  6      :    1                  Max.   :98.0  
##  (Other):43087                                
##                    SmokingStatus                TobaccoDependence
##  Smoked past 12 months    : 9913   Nicotine Dependence   : 4962  
##  Smoked prior to 12 months: 8078   No Nicotine Dependence:38131  
##  Never smoked             :25080                                 
##  NA's                     :   22                                 
##                                                                  
##                                                                  
##                                                                  
##                SmokingFreq    DailyCigsSmoked  Ethnicity   
##  Never               :25080   Min.   : 0.000   Cauc:24507  
##  Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
##  Once a month or less:  772   Median : 0.000   NaAm:  701  
##  1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
##  3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
##  (Other)             :  869   Max.   :98.000               
##  NA's                :  102   NA's   :262                  
##           Depression      DaysSmoke     TotalCigsSmoked  CigsSmokedFac
##  No Depression :35254   Min.   : 0.00   Min.   :   0.0   Q1  :25080   
##  Yes Depression: 7839   1st Qu.: 0.00   1st Qu.:   0.0   Q2  :    0   
##                         Median : 0.00   Median :   0.0   Q3  : 8630   
##                         Mean   :10.96   Mean   : 187.5   Q4  : 9086   
##                         3rd Qu.:30.00   3rd Qu.: 300.0   NA's:  297   
##                         Max.   :30.00   Max.   :2940.0                
##                         NA's   :102     NA's   :297                   
##   SmokingFreq3  
##  Daily  :15983  
##  Weekly : 1156  
##  Monthly:  772  
##  Never  :25080  
##  NA's   :  102  
##                 
## 

Data subset rows to young smokers

In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values.

In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter() function. Note that SmokingStatus == 1 is used to choose people who smoked in the past 12 months.

Notice, the missing values have been removed (since NAs are not less than or equal to numbers).

# the subset command below will not include a row that evalutes to NA for the
# specified subset.  For example:
  (NA == 1)
## [1] NA
# age and smoking subset
nesarc.sub <-
  nesarc.sub %>%
  filter(
    Age <= 25                                 # people 25 or younger
  , SmokingStatus == "Smoked past 12 months"  # smoked in the past 12 months
  )
dim(nesarc.sub)
## [1] 1706   13
summary(nesarc.sub)
##        ID           Sex           Age       
##  21     :   1   Female:850   Min.   :18.00  
##  77     :   1   Male  :856   1st Qu.:20.00  
##  103    :   1                Median :22.00  
##  122    :   1                Mean   :21.61  
##  136    :   1                3rd Qu.:23.00  
##  150    :   1                Max.   :25.00  
##  (Other):1700                               
##                    SmokingStatus               TobaccoDependence
##  Smoked past 12 months    :1706   Nicotine Dependence   :896    
##  Smoked prior to 12 months:   0   No Nicotine Dependence:810    
##  Never smoked             :   0                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                SmokingFreq   DailyCigsSmoked Ethnicity  
##  Every Day           :1320   Min.   : 1.00   Cauc:1060  
##  3 to 4 Days/week    :  91   1st Qu.: 5.00   AfAm: 211  
##  1 to 2 Days/week    :  88   Median :10.00   NaAm:  42  
##  Once a month or less:  71   Mean   :11.34   Asia:  58  
##  5 to 6 Days/week    :  68   3rd Qu.:20.00   Hisp: 335  
##  (Other)             :  65   Max.   :98.00              
##  NA's                :   3   NA's   :9                  
##           Depression     DaysSmoke     TotalCigsSmoked  CigsSmokedFac
##  No Depression :1260   Min.   : 1.00   Min.   :   1.0   Q1  :   0    
##  Yes Depression: 446   1st Qu.:30.00   1st Qu.:  90.0   Q2  :   0    
##                        Median :30.00   Median : 300.0   Q3  :1107    
##                        Mean   :25.33   Mean   : 320.5   Q4  : 590    
##                        3rd Qu.:30.00   3rd Qu.: 600.0   NA's:   9    
##                        Max.   :30.00   Max.   :2940.0                
##                        NA's   :3       NA's   :9                     
##   SmokingFreq3 
##  Daily  :1479  
##  Weekly : 153  
##  Monthly:  71  
##  Never  :   0  
##  NA's   :   3  
##                
## 

EXTRA: Working with Dates

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth dat, month, and year.

dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
## 'data.frame':    43093 obs. of  6 variables:
##  $ CDAY : num  14 12 23 9 18 7 1 18 9 16 ...
##  $ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
##  $ CYEAR: num  2001 2002 2001 2001 2001 ...
##  $ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
##  $ DOBM : num  9 12 12 6 9 12 3 11 4 3 ...
##  $ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then interpret them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after each operation.

## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
## [1] "14 8 2001"  "12 1 2002"  "23 11 2001" "9 9 2001"   "18 10 2001"
## [6] "7 9 2001"
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
## [1] "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" "2001-10-18"
## [6] "2001-09-07"
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
## [1] "27 9 1977"  "15 12 1973" "28 12 1919" "27 6 1983"  "11 9 1965" 
## [6] "28 12 1966"
# create date object by interpretting the day, month, year with dmy()
  # if any failed to parse, it's because a day, month, or year is out of range
  #    such as "99" for missing values or refused to answer
  # in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
## Warning: 1094 failed to parse.
head(dob.date)
## [1] "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" "1965-09-11"
## [6] "1966-12-28"

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
## Time differences in days
## [1]  8722 10255 29916  6649 13186 12672
age.years <- as.numeric(age.days / 365.25)  # change from "difftime" to "numeric"
head(age.years)
## [1] 23.87953 28.07666 81.90554 18.20397 36.10130 34.69405
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
## [1] 0.87953457 0.07665982 0.90554415 0.20396988 0.10130048 0.69404517

Add the new age variable to your data frame, drop the ones you don’t need anymore, and you’re ready to go!

# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
## 'data.frame':    43093 obs. of  7 variables:
##  $ CDAY     : num  14 12 23 9 18 7 1 18 9 16 ...
##  $ CMON     : num  8 1 11 9 10 9 9 9 1 8 ...
##  $ CYEAR    : num  2001 2002 2001 2001 2001 ...
##  $ DOBD     : num  27 15 28 27 11 28 16 10 7 25 ...
##  $ DOBM     : num  9 12 12 6 9 12 3 11 4 3 ...
##  $ DOBY     : num  1977 1973 1919 1983 1965 ...
##  $ age.years: num  23.9 28.1 81.9 18.2 36.1 ...
# drop the original ones you don't need anymore
#   by using subset() and select = -VAR, it excludes the VAR column
dat.sub <-
  dat.sub %>%
  select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
## 'data.frame':    43093 obs. of  1 variable:
##  $ age.years: num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

Graphing

Class 11 Graphing Univariate

Categorical variables

Tables for categorical variables

Basic tables can be created using the functions table or xtabs. Frequency tables are created for the variables TobaccoDependence, CigsSmokedFac, and SmokingFreq. When only looking at one variable, table is the same as xtabs, but xtabs will make looking at two-way, three-way, and higher tables possible.

# table() produces a one-varible frequency table
table(nesarc.sub$TobaccoDependence)
## 
##    Nicotine Dependence No Nicotine Dependence 
##                    896                    810
# proportions available by passing these frequencies to prop.table()
table(nesarc.sub$TobaccoDependence) %>% prop.table()
## 
##    Nicotine Dependence No Nicotine Dependence 
##              0.5252052              0.4747948
# xtabs() can be used for one or more variables
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
## TobaccoDependence
##    Nicotine Dependence No Nicotine Dependence 
##                    896                    810
# proportions available with prop.table()
prop.table(T1)
## TobaccoDependence
##    Nicotine Dependence No Nicotine Dependence 
##              0.5252052              0.4747948
T2 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T2
## SmokingFreq
##                Never Once a month or less    2 to 3 Days/month 
##                    0                   71                   65 
##     1 to 2 Days/week     3 to 4 Days/week     5 to 6 Days/week 
##                   88                   91                   68 
##            Every Day 
##                 1320

(Look at the .Rmd code to see that these numbers in the next paragraph are calculated directly from the numbers summarized in the table! This way, if the data changes, the text will be correctly updated.)

In the data frame nesarc.sub, there are 896 nicotine dependent subjects and 810 subjects that are not nicotine dependent. Most of the subjects in nesarc.sub are daily smokers (68) with the remainder distributed uniformly across the first five levels of SmokingFreq.

Graphing frequency tables

The barplots are all created with the package ggplot2. The barplots start with the defaults for the geom_bar and add more detail to the plot with each graph.

library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1

p2 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Smoking Adults")
p2 <- p2 + theme_bw()
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
p2

Removing NAs from factor axes

When a factor variable has NAs, then NA appears as a level with frequences (see bar plots above). As of writing this, ggplot() does not have a way to remove the NAs before plotting. Therefore, we have to do it manually.

The easiest and most transparent way is to use the drop_na() function below where we only plot the non-NA values. Compare the two plots below of the same data, but the second plot had the NAs removed before plotting.

library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
#p1

# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
#   and "!is.na()" means "keep the non-NA values"
#p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
#p2

# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")

Final versions

library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x     = ""
              , y     = "Total Number"
              , title = "Nicotine Dependence for Young Smoking Adults"
              )
print(p1)

Interpretation: Among smokers 25 years and younger, there are slightly more (5% more) people who are nicotine dependent (52.5%) compared to those who are not nicotine dependent (47.5%).

p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x     = ""
              , y     = "Total Number"
              , title = "Smoking Frequency for Young Smoking Adults"
              )
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
print(p2)

Interpretation: Among smokers 25 years and younger, most smoke every day (77.5%) while the less frequent categories are roughly evenly distributed around 4% to 5%.

Numeric variables

Graphing numeric variables

One popular way to graph a continuous variable is to use a histogram. R has the base function hist() which can be used for this purpose. We will also use the package ggplot2 to create histograms. We start with a basic histogram of the variable TotalCigsSmoked.

hist(nesarc.sub$TotalCigsSmoked)

Experiment with the code in the next code chunk to set the binwidth = (extremely important!), decide if you need boundary = 0, add a title, and if needed the labels on the \(x\)- and \(y\)-axes.

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
## Warning: Removed 9 rows containing non-finite values (stat_bin).

# Use "boundary = 0" for values that can't be negative
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month", title = "boundary at 0")
p <- p + theme_bw()
p
## Warning: Removed 9 rows containing non-finite values (stat_bin).

Creating Density Plots

A smoothed density curve can be added with the geom_density() function, where adjust = 2.0 is a fairly smooth curve. Experiment with adjust values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness and decide what level best captures the general shape of your distribution.

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 9 rows containing non-finite values (stat_density).

Shape, center, and spread

summary(nesarc.sub$TotalCigsSmoked)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    90.0   300.0   320.5   600.0  2940.0       9
fivenum(nesarc.sub$TotalCigsSmoked)
## [1]    1   90  300  600 2940
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
## [1] 300
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
## [1] 510

The “shape” of the distribution for the estimated cigarettes smoked per month is skewed to the right. The “center” (median) of the distribution is 300 and the “spread” (interquartile range, middle 50%) for the distribution is 510.

Final versions

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            , title = "Total cigaretts smoked for Young Smoking Adults"
            )
p <- p + theme_bw()
p
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 9 rows containing non-finite values (stat_density).

Interpretation: Among smokers 25 years and younger, the number of cigarettes smoked is right skewed with a median (center) of 300 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 510. There are three distinct modes (peaks) in the distribution because of how we constructed this variable from two others, and because of the common numbers of cigarettes being relative to a pack size (20). There are several outlying values greater than 1200 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.

p <- ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            , title = "Total cigaretts smoked for Young Smoking Adults"
            )
p <- p + theme_bw()
p
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 9 rows containing non-finite values (stat_density).