Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

Starting in Class 07, we will concatenate all our WSs together to retain the relevant information needed for subsequent classes. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, modifying your research questions, improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

You will need to update any section symbols to be at least 3 deep (###) if you’ve copied them from an older assignment into this document.

Finally, you can delete all this header text once you don’t need to refer to it.


Document overview

This document is organized by Class number.

Consider your readers (grader):

Global code options

# I set some GLOBAL R chunk options here.
#   (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
#   and notes about the cache option.

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)

# Note: The "cache=TRUE" option will save the computations of code chunks
#   so R it doesn't recompute everything every time you recompile.
#   This can save _tons of time_ if you're working on a small section of code
#   at the bottom of the document.
#   Code chunks will be recompiled only if they are edited.
#   The autodep=TRUE will also update dependent code chunks.
#   A folder is created with the cache in it -- if you delete the folder, then
#   all the code chunks will recompute again.
#   ** If things are working as expected, or I want to freshly compute everything,
#      I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE)  #$

Research Questions

Class 03 Datasets, Codebooks, Personal Codebook

Question of interest

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Initial thinking: (My helpful narrative description to help you get going.) While nicotine dependence is a good starting point, I need to determine what it is about nicotine dependence that I am interested in. It strikes me that friends and acquaintances that I have known through the years that became hooked on cigarettes did so across very different periods of time. Some seemed to be dependent soon after their first few experiences with smoking and others after many years of generally irregular smoking behavior. I decide that I am most interested in exploring the association between level of smoking and nicotine dependence. I add to my codebook variables reflecting smoking levels (e.g., smoking quantity and frequency).

Topic of interest: (You need this part.) I have decided to investigate the relationship between nicotine dependence and the frequency and quantity of smoking on people up to 25 years old. The association may differ by ethnicity, age, gender, and other factors.

How I did it: (My helpful narrative description to help you get going.) I look through the codebook and find some variables of interest. I searched the text with “Ctrl-F” (find) to find these variables. For each variable, I copy/paste the description here, then formatted it so it’s organized. You can choose to use a table or an outline format. I found this verbatim text format to be very easy to format. I retained the “frequency” of each response because it’s interesting to know, and because it was already in the codebook — this value is not required for your codebook.

Codebook

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
VarName (RenamedVarName)
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

IDNUM (ID)
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Nominal
  43093 1-43093. Unique Identification number

SEX (Sex)
  SEX
  Nominal
  18518 1. Male
  24575 2. Female

AGE (Age)
  AGE
  Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change NA to 3. Never
          * change 9 to NA

TAB12MDX (TobaccoDependence)
  NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
  Nominal
  38131 0. No nicotine dependence
   4962 1. Nicotine dependence

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change NA to 7. Never
          * change 9 to NA

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change NA to 0 Cigarett(s)
          * change 99 to NA

ETHRACE2A (Ethnicity)
  IMPUTED RACE/ETHNICITY
  Nominal
  24507 1. White, Not Hispanic or Latino
   8245 2. Black, Not Hispanic or Latino
    701 3. American Indian/Alaska Native, Not Hispanic or Latino
   1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
   8308 5. Hispanic or Latino

MAJORDEPLIFE (Depression)
  MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
  Nominal
  35254 0. No
   7839 1. Yes

S1Q10A (IncomePersonal)
  TOTAL PERSONAL INCOME IN LAST 12 MONTHS
  Continuous
  43093 0-3000000. Income in dollars

S1Q24FT (Height_ft)
  HEIGHT: FEET
  42363 4-7. Feet
  730 99. Unknown
          * change 99 to NA

S1Q24IN (Height_in)
  HEIGHT: INCHES
  Continuous
  3572 0. None
  38760 1-11. Inches
  761 99. Unknown
          * change 99 to NA

S1Q24LB (Weight_lb)
  WEIGHT: POUNDS
  Continuous
  41717 62-500. Pounds
  1376 999. Unknown
          * change 999 to NA

Additional variables were created from the original variables:

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. Daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. Weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. Monthly = 6. Once a month or less
    4. Never   = 7. Never

Height_inches
  Total height in inches
  Height_ft * 12 + Height_in

Class 04 Citations

I will use (Beckschäfer et al. 2014; Richert 2013; Murphy 2012; Dean and ebrary, Inc 2014) for this work which was obtained using [@beck2014; @rich2013; @murp2012; @dean2014]. Plus, Beckschäfer et al. (2014) says some interesting stuff and that citation was obtained using @beck2014. For more documentation on bibliographies and citations with R Markdown, see http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html. For general help with R Markdown, see https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.


Class 05 Research Questions

See Class 06 below.


Class 06 Literature Review

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Research question:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
    • Google scholar search: “nicotine dependence smoking frequency”
    • Citation: pdf file available for Dierker et al. (2007)
    • Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
    • Others: Kandel and Chen (2000) and Caraballo, Novak, and Asman (2009)

You don’t need to include images in your literature review. I’m providing these tables to illustrate what these tables look like:

  1. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
    • Google scholar search: “nicotine dependence depression”
    • Citation: pdf file available for Naomi Breslau (1995)
    • Interesting points: Table 2, Smoking and Nicotine Dependence both associated with Education. Table 3, Major depression associated with being nicotine dependent and Sex.
    • Others: N. Breslau et al. (1998)
  1. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
    • In Question 2 we see differences by Education and Sex.

I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.

All required variables have been found and added to my personal codebook (by expanding Class 03).


Data Management

Class 07 Working With Data, Data Management

Purpose of study: smoking and depression in young adults

(Note: This is the last paragraph of the research proposal used to remind the reader of the research objectives.)

The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.

Variables

Variables from NESARC that will be used include (already in codebook, above):

  • IDNUM (unique ID number with no alphabetics),
  • MAJORDEPLIFE (has subject experienced a major depression?),
  • CHECK321 (has subject smoked cigarettes in past 12 months?),
  • AGE (age of subject),
  • TAB12MDX (tobacco dependence past 12 months),
  • S3AQ3B1 (usual frequency when cigarettes smoked),
  • ETHRACE2A (ethnicity of subject),
  • SEX (gender of subject), and
  • S3AQ3C1 (usual smoking quantity of cigarettes when smoked).

Data subset columns

First, the data is placed on the search path using the PDS package.

# data analysis packages
library(tidyverse)  # Data manipulation and visualization suite
library(forcats)    # Factor variables
library(lubridate)  # Dates

# data package
library(PDS)

  ## Problems installing PDS package?  Solution.
  ## If you had problems installing the PDS package, no problem; here's how to get the data:
  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Where I have "library(PDS)" in my code, change it to the two lines below.
  ##    Update the "filename" to the name of the file.
  ##
  ## # library(PDS)
  ## load("filename.RData")

dim(NESARC)
[1] 43093  3008

The variables of interest are specified and a subset of the dataset including only these variables is created and stored in the data frame nesarc.sub.

# variables to include in our data subset
nesarc.sub <-
  NESARC %>%
  select(
    IDNUM
  , SEX
  , AGE
  , CHECK321
  , TAB12MDX
  , S3AQ3B1
  , S3AQ3C1
  , ETHRACE2A
  , MAJORDEPLIFE
  , S1Q10A
  , S1Q24FT
  , S1Q24IN
  , S1Q24LB
  )

Check size and structure of data subset. Note that categorical variables are already Factor variables, but the levels do not have meaningful labels, yet.

# size of subset
dim(nesarc.sub)
[1] 43093    13
# structure of subset
str(nesarc.sub)
'data.frame':   43093 obs. of  13 variables:
 $ IDNUM       : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ AGE         : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TAB12MDX    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ S3AQ3C1     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ S1Q10A      : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ S1Q24FT     : num  5 5 5 5 6 5 5 5 5 5 ...
 $ S1Q24IN     : num  7 1 4 7 1 5 9 4 4 10 ...
 $ S1Q24LB     : num  195 127 185 180 220 140 160 120 140 150 ...

EXTRA: AddHealth, joining waves 1 and 2 together

Joining AddHealth waves 1 and 2 together into a single dataset can be done if you want to use variables from when the participants were both adolecents and adults.

# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
[1] 6504 2829
dim(addhealth_public4)
[1] 6504  976
# First column is the unique ID
colnames(AddHealth)[1]
[1] "AID"
colnames(addhealth_public4)[1]
[1] "aid"
# Rename W4's unique ID variable "AID" and "aid" to "ID"
#   so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
  AddHealth %>%
  rename(ID = AID)
addhealth_public4 <-
  addhealth_public4 %>%
  rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
[1] "ID"
colnames(addhealth_public4)[1]
[1] "ID"
# Join two datasets by unique ID
AH14 <-
  full_join(AddHealth, addhealth_public4)
# now we have twice as many columns
dim(AH14)
[1] 6504 3804
  # Removing this large object since I don't need this data for the NESARC project
  # DELETE THIS LINE if you're using AddHealth
  rm(AH14)

Renaming Variables

Rename variables NewName = OriginalName.

nesarc.sub <-
  nesarc.sub %>%
  rename(
    ID                = IDNUM
  , Sex               = SEX
  , Age               = AGE
  , SmokingStatus     = CHECK321
  , TobaccoDependence = TAB12MDX
  , SmokingFreq       = S3AQ3B1
  , DailyCigsSmoked   = S3AQ3C1
  , Ethnicity         = ETHRACE2A
  , Depression        = MAJORDEPLIFE
  , IncomePersonal    = S1Q10A
  , Height_ft         = S1Q24FT
  , Height_in         = S1Q24IN
  , Weight_lb         = S1Q24LB
  )

str(nesarc.sub)
'data.frame':   43093 obs. of  13 variables:
 $ ID               : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Sex              : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ Age              : num  23 28 81 18 36 34 19 84 29 18 ...
 $ SmokingStatus    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TobaccoDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ SmokingFreq      : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DailyCigsSmoked  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Ethnicity        : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ Depression       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ IncomePersonal   : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ Height_ft        : num  5 5 5 5 6 5 5 5 5 5 ...
 $ Height_in        : num  7 1 4 7 1 5 9 4 4 10 ...
 $ Weight_lb        : num  195 127 185 180 220 140 160 120 140 150 ...

The first 6 and last 6 values are given by head() and tail().

head(nesarc.sub)
  ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
1  1   1  23          <NA>                 0        <NA>              NA
2  2   2  28          <NA>                 0        <NA>              NA
3  3   2  81          <NA>                 0        <NA>              NA
4  4   1  18          <NA>                 0        <NA>              NA
5  5   1  36          <NA>                 0        <NA>              NA
6  6   2  34          <NA>                 0        <NA>              NA
  Ethnicity Depression IncomePersonal Height_ft Height_in Weight_lb
1         5          0          17500         5         7       195
2         5          0          11000         5         1       127
3         5          0           6000         5         4       185
4         5          0          27000         5         7       180
5         2          0          42000         6         1       220
6         2          0            500         5         5       140
tail(nesarc.sub)
         ID Sex Age SmokingStatus TobaccoDependence SmokingFreq
43088 43088   2  18             1                 0           1
43089 43089   2  18          <NA>                 0        <NA>
43090 43090   1  19          <NA>                 0        <NA>
43091 43091   1  18             1                 1           6
43092 43092   1  29             1                 1           1
43093 43093   1  18          <NA>                 0        <NA>
      DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
43088               5         1          0           4000         5
43089              NA         1          0           4000         5
43090              NA         1          0           2500         6
43091               3         1          0           3000         5
43092              20         1          0           6500         5
43093              NA         1          1           6000         5
      Height_in Weight_lb
43088         3       120
43089         1       160
43090         1       175
43091         8       150
43092        10       140
43093        11       145

A quick summary of the variables indicates many missing values.

# summary of variables -- notice many NAs
summary(nesarc.sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   9   :   22                     
 4      :    1             Mean   :46.4   NA's:25080                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked Ethnicity Depression IncomePersonal   
 1      :14836   Min.   : 1.00   1:24507   0:35254    Min.   :      0  
 6      :  772   1st Qu.: 6.00   2: 8245   1: 7839    1st Qu.:   8800  
 4      :  747   Median :15.00   3:  701              Median :  20000  
 3      :  687   Mean   :16.81   4: 1332              Mean   :  28185  
 2      :  460   3rd Qu.:20.00   5: 8308              3rd Qu.:  36000  
 (Other):  511   Max.   :99.00                        Max.   :3000000  
 NA's   :25080   NA's   :25080                                         
   Height_ft      Height_in        Weight_lb    
 Min.   : 4.0   Min.   : 0.000   Min.   : 62.0  
 1st Qu.: 5.0   1st Qu.: 3.000   1st Qu.:140.0  
 Median : 5.0   Median : 5.000   Median :168.0  
 Mean   : 6.7   Mean   : 6.934   Mean   :197.1  
 3rd Qu.: 5.0   3rd Qu.: 8.000   3rd Qu.:197.0  
 Max.   :99.0   Max.   :99.000   Max.   :999.0  
                                                

Class 08 Subsetting data and R Programming

Coding missing values

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 0 Cigarett(s)
table(nesarc.sub$DailyCigsSmoked)

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
 934  884  923  573 1070  463  269  299   49 3077   23  230   34   25  851 
  16   17   18   19   20   21   22   23   24   25   27   28   29   30   33 
  40   22   59    5 5366    1   10    2    7  155    2    3    3  909    1 
  34   35   37   39   40   45   50   55   57   60   66   70   75   80   98 
   1   30    2    1  993    8  106    2    1  241    1   12    2   47   15 
  99 
 262 
nesarc.sub <-
  nesarc.sub %>%
  mutate(
  # replace NA with 0
    DailyCigsSmoked = replace(DailyCigsSmoked, is.na(DailyCigsSmoked), 0)
  )

table(nesarc.sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11 
25080   934   884   923   573  1070   463   269   299    49  3077    23 
   12    13    14    15    16    17    18    19    20    21    22    23 
  230    34    25   851    40    22    59     5  5366     1    10     2 
   24    25    27    28    29    30    33    34    35    37    39    40 
    7   155     2     3     3   909     1     1    30     2     1   993 
   45    50    55    57    60    66    70    75    80    98    99 
    8   106     2     1   241     1    12     2    47    15   262 
NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 3. Never

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 7. Never
table(nesarc.sub$SmokingStatus  )

   1    2    9 
9913 8078   22 
table(nesarc.sub$SmokingFreq    )

    1     2     3     4     5     6     9 
14836   460   687   747   409   772   102 
nesarc.sub <-
  nesarc.sub %>%
  mutate(
  # first change the "factor" variables to "character" to add a new level
    SmokingStatus   = as.character(SmokingStatus)
  , SmokingFreq     = as.character(SmokingFreq  )
  # replace NA with new category values
  , SmokingStatus   = replace(SmokingStatus  , is.na(SmokingStatus), "3")
  , SmokingFreq     = replace(SmokingFreq    , is.na(SmokingFreq)  , "7")
  )

table(nesarc.sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc.sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 

Coding 9s and 99s as NAs

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc.sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc.sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 
table(nesarc.sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11 
25080   934   884   923   573  1070   463   269   299    49  3077    23 
   12    13    14    15    16    17    18    19    20    21    22    23 
  230    34    25   851    40    22    59     5  5366     1    10     2 
   24    25    27    28    29    30    33    34    35    37    39    40 
    7   155     2     3     3   909     1     1    30     2     1   993 
   45    50    55    57    60    66    70    75    80    98    99 
    8   106     2     1   241     1    12     2    47    15   262 
nesarc.sub <-
  nesarc.sub %>%
  mutate(
  # replace missing values
    SmokingStatus   = replace(SmokingStatus  , SmokingStatus   %in% c("9"), NA)
  , SmokingFreq     = replace(SmokingFreq    , SmokingFreq     %in% c("9"), NA)
  , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
  , Height_ft       = replace(Height_ft      , Height_ft       %in% c( 99), NA)
  , Height_in       = replace(Height_in      , Height_in       %in% c( 99), NA)
  , Weight_lb       = replace(Weight_lb      , Weight_lb       %in% c(999), NA)
  # drop unused factor levels
  , SmokingStatus   = fct_drop(SmokingStatus)
  , SmokingFreq     = fct_drop(SmokingFreq  )
  )

table(nesarc.sub$SmokingStatus  )

    1     2     3 
 9913  8078 25080 
table(nesarc.sub$SmokingFreq    )

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
table(nesarc.sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11 
25080   934   884   923   573  1070   463   269   299    49  3077    23 
   12    13    14    15    16    17    18    19    20    21    22    23 
  230    34    25   851    40    22    59     5  5366     1    10     2 
   24    25    27    28    29    30    33    34    35    37    39    40 
    7   155     2     3     3   909     1     1    30     2     1   993 
   45    50    55    57    60    66    70    75    80    98 
    8   106     2     1   241     1    12     2    47    15 
summary(nesarc.sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   3   :25080                     
 4      :    1             Mean   :46.4   NA's:   22                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
   Height_ft      Height_in        Weight_lb    
 Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 Median :5.00   Median : 5.000   Median :165.0  
 Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 Max.   :7.00   Max.   :11.000   Max.   :500.0  
 NA's   :730    NA's   :761      NA's   :1376   

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc.sub)
[1] 43093    13
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
[1] 43093    13
summary(nesarc.sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   3   :25080                     
 4      :    1             Mean   :46.4   NA's:   22                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
   Height_ft      Height_in        Weight_lb    
 Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 Median :5.00   Median : 5.000   Median :165.0  
 Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 Max.   :7.00   Max.   :11.000   Max.   :500.0  
 NA's   :730    NA's   :761      NA's   :1376   

Now we have a dataset with no missing values.

Creating new variables

Additional variables that we created from the original variables are listed in this ammendment to the original codebook.

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. Daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. Weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. Monthly = 6. Once a month or less
    4. Never   = 7. Never

Height_inches
  Total height in inches
  Height_ft * 12 + Height_in

From categories to numeric

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    DaysSmoke =
      case_when(
        SmokingFreq == 1 ~ 30      # 1. Every day
      , SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
      , SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
      , SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
      , SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
      , SmokingFreq == 6 ~ 1       # 6. Once a month or less
      , SmokingFreq == 7 ~ 0       # 7. Never
      )
  )

summary(nesarc.sub$DaysSmoke)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   10.96   30.00   30.00     102 

From numeric to numeric

Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable.

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    Height_inches = Height_ft * 12 + Height_in
  )

summary(nesarc.sub$Height_inches)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   48.0    64.0    66.0    66.6    70.0    84.0     761 

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
  )

Log scale. I like log2 since its interpretation is that every unit is a doubling. Notice that I added 1 to the TotalCigsSmoked before computing the log. This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    TotalCigsSmokedLog2 = log(TotalCigsSmoked + 1, base = 2)
  )

summary(nesarc.sub$TotalCigsSmokedLog2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   3.284   8.234  11.522     297 

Square root scale. Square root is another transformation that spreads out values between 0 and 1 and compresses values from 1 to infinity. It can be preferred to a log transformation in some cases.

nesarc.sub <-
  nesarc.sub %>%
  mutate(
    TotalCigsSmokedSqrt = sqrt(TotalCigsSmoked)
  )

summary(nesarc.sub$TotalCigsSmokedSqrt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   7.887  17.321  54.222     297 

From numeric to categories based on quantiles

The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function. The label "Q1" indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4" indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations.

quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
  0%  25%  50%  75% 100% 
   0    0    0  300 2940 
nesarc.sub <-
  nesarc.sub %>%
  mutate(
    CigsSmokedFac =
      .bincode(TotalCigsSmoked
      , breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
      , include.lowest = TRUE
      )
  , CigsSmokedFac = factor( CigsSmokedFac
                          , levels = c(1, 2, 3, 4)
                          , labels = c("Q1", "Q2", "Q3", "Q4")
                          )
  )

In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.

summary(nesarc.sub$CigsSmokedFac)
   Q1    Q2    Q3    Q4  NA's 
25080     0  8630  9086   297 

From many categories to a few

The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
#   if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc.sub <-
  nesarc.sub %>%
  mutate(
    SmokingFreq3 =
      fct_collapse(
        SmokingFreq
      , "1" = c("1", "2", "3")  # 1. daily
      , "2" = c("4", "5")       # 2. weekly
      , "3" = "6"               # 3. monthly
      , "4" = "7"               # 7. never
      )
  )

# new variable
table(nesarc.sub$SmokingFreq3)

    1     2     3     4 
15983  1156   772 25080 
# we will label the new factor variable below, with the others

Review results of new variables

Look at the first 10 observations in the dataset, with special attention to the new variables.

head(nesarc.sub, 10)
   ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
1   1   1  23             3                 0           7               0
2   2   2  28             3                 0           7               0
3   3   2  81             3                 0           7               0
4   4   1  18             3                 0           7               0
5   5   1  36             3                 0           7               0
6   6   2  34             3                 0           7               0
7   7   1  19             3                 0           7               0
8   8   2  84             3                 0           7               0
9   9   2  29             3                 0           7               0
10 10   2  18             3                 0           7               0
   Ethnicity Depression IncomePersonal Height_ft Height_in Weight_lb
1          5          0          17500         5         7       195
2          5          0          11000         5         1       127
3          5          0           6000         5         4       185
4          5          0          27000         5         7       180
5          2          0          42000         6         1       220
6          2          0            500         5         5       140
7          2          1           2500         5         9       160
8          1          0          24000         5         4       120
9          1          1          65000         5         4       140
10         5          0           4000         5        10       150
   DaysSmoke Height_inches TotalCigsSmoked TotalCigsSmokedLog2
1          0            67               0                   0
2          0            61               0                   0
3          0            64               0                   0
4          0            67               0                   0
5          0            73               0                   0
6          0            65               0                   0
7          0            69               0                   0
8          0            64               0                   0
9          0            64               0                   0
10         0            70               0                   0
   TotalCigsSmokedSqrt CigsSmokedFac SmokingFreq3
1                    0            Q1            4
2                    0            Q1            4
3                    0            Q1            4
4                    0            Q1            4
5                    0            Q1            4
6                    0            Q1            4
7                    0            Q1            4
8                    0            Q1            4
9                    0            Q1            4
10                   0            Q1            4

Labeling Categorical variable levels

Look at summary of dataset to see the state of the labelling before we make changes.

summary(nesarc.sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   3   :25080                     
 4      :    1             Mean   :46.4   NA's:   22                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
   Height_ft      Height_in        Weight_lb       DaysSmoke    
 Min.   :4.00   Min.   : 0.000   Min.   : 62.0   Min.   : 0.00  
 1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00  
 Median :5.00   Median : 5.000   Median :165.0   Median : 0.00  
 Mean   :5.11   Mean   : 5.279   Mean   :170.6   Mean   :10.96  
 3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00  
 Max.   :7.00   Max.   :11.000   Max.   :500.0   Max.   :30.00  
 NA's   :730    NA's   :761      NA's   :1376    NA's   :102    
 Height_inches  TotalCigsSmoked  TotalCigsSmokedLog2 TotalCigsSmokedSqrt
 Min.   :48.0   Min.   :   0.0   Min.   : 0.000      Min.   : 0.000     
 1st Qu.:64.0   1st Qu.:   0.0   1st Qu.: 0.000      1st Qu.: 0.000     
 Median :66.0   Median :   0.0   Median : 0.000      Median : 0.000     
 Mean   :66.6   Mean   : 187.5   Mean   : 3.284      Mean   : 7.887     
 3rd Qu.:70.0   3rd Qu.: 300.0   3rd Qu.: 8.234      3rd Qu.:17.321     
 Max.   :84.0   Max.   :2940.0   Max.   :11.522      Max.   :54.222     
 NA's   :761    NA's   :297      NA's   :297         NA's   :297        
 CigsSmokedFac SmokingFreq3
 Q1  :25080    1   :15983  
 Q2  :    0    2   : 1156  
 Q3  : 8630    3   :  772  
 Q4  : 9086    4   :25080  
 NA's:  297    NA's:  102  
                           
                           

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, TobaccoDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc.sub$Depression <-
  factor(
    nesarc.sub$Depression
  , labels = c("No Depression"
             , "Yes Depression"
    )
  )

nesarc.sub$SmokingStatus <-
  factor(
    nesarc.sub$SmokingStatus
  , labels = c("Smoked past 12 months"
             , "Smoked prior to 12 months"
             , "Never smoked"
    )
  )

# Specify the order of the levels to be different from codebook order
nesarc.sub$Sex <-
  factor(nesarc.sub$Sex
  , levels = c( 2
              , 1
              )
  , labels = c( "Female"
              , "Male"
              )
  )
# check ordering with a frequency table
table(nesarc.sub$Sex)

Female   Male 
 24575  18518 
nesarc.sub$TobaccoDependence <-
  factor(nesarc.sub$TobaccoDependence
  , levels = c( 1
              , 0
              )
  , labels = c( "Nicotine Dependence"
              , "No Nicotine Dependence"
              )
  )


# Remember the new category level from the original `NA` values.

nesarc.sub$SmokingFreq <-
  factor(nesarc.sub$SmokingFreq
  , levels = c( 7
              , 6
              , 5
              , 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Once a month or less"
              , "2 to 3 Days/month"
              , "1 to 2 Days/week"
              , "3 to 4 Days/week"
              , "5 to 6 Days/week"
              , "Every Day"
              )
  )


nesarc.sub$Ethnicity <-
  factor(nesarc.sub$Ethnicity
    # shorter labels are helpful for plots
  , labels = c( "Cauc"
              , "AfAm"
              , "NaAm"
              , "Asia"
              , "Hisp"
              )
  # , labels = c("Caucasian"
  #            , "African American"
  #            , "Native American"
  #            , "Asian"
  #            , "Hispanic"
  #             )
)

nesarc.sub$SmokingFreq3 <-
  factor(nesarc.sub$SmokingFreq3
  , levels = c( 1
              , 2
              , 3
              , 4
              )
  , labels = c( "Daily"
              , "Weekly"
              , "Monthly"
              , "Never"
              )
  )

summary(nesarc.sub)
       ID            Sex             Age      
 1      :    1   Female:24575   Min.   :18.0  
 2      :    1   Male  :18518   1st Qu.:32.0  
 3      :    1                  Median :44.0  
 4      :    1                  Mean   :46.4  
 5      :    1                  3rd Qu.:59.0  
 6      :    1                  Max.   :98.0  
 (Other):43087                                
                   SmokingStatus                TobaccoDependence
 Smoked past 12 months    : 9913   Nicotine Dependence   : 4962  
 Smoked prior to 12 months: 8078   No Nicotine Dependence:38131  
 Never smoked             :25080                                 
 NA's                     :   22                                 
                                                                 
                                                                 
                                                                 
               SmokingFreq    DailyCigsSmoked  Ethnicity   
 Never               :25080   Min.   : 0.000   Cauc:24507  
 Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
 Once a month or less:  772   Median : 0.000   NaAm:  701  
 1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
 3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
 (Other)             :  869   Max.   :98.000               
 NA's                :  102   NA's   :262                  
          Depression    IncomePersonal      Height_ft      Height_in     
 No Depression :35254   Min.   :      0   Min.   :4.00   Min.   : 0.000  
 Yes Depression: 7839   1st Qu.:   8800   1st Qu.:5.00   1st Qu.: 3.000  
                        Median :  20000   Median :5.00   Median : 5.000  
                        Mean   :  28185   Mean   :5.11   Mean   : 5.279  
                        3rd Qu.:  36000   3rd Qu.:5.00   3rd Qu.: 8.000  
                        Max.   :3000000   Max.   :7.00   Max.   :11.000  
                                          NA's   :730    NA's   :761     
   Weight_lb       DaysSmoke     Height_inches  TotalCigsSmoked 
 Min.   : 62.0   Min.   : 0.00   Min.   :48.0   Min.   :   0.0  
 1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0   1st Qu.:   0.0  
 Median :165.0   Median : 0.00   Median :66.0   Median :   0.0  
 Mean   :170.6   Mean   :10.96   Mean   :66.6   Mean   : 187.5  
 3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0   3rd Qu.: 300.0  
 Max.   :500.0   Max.   :30.00   Max.   :84.0   Max.   :2940.0  
 NA's   :1376    NA's   :102     NA's   :761    NA's   :297     
 TotalCigsSmokedLog2 TotalCigsSmokedSqrt CigsSmokedFac  SmokingFreq3  
 Min.   : 0.000      Min.   : 0.000      Q1  :25080    Daily  :15983  
 1st Qu.: 0.000      1st Qu.: 0.000      Q2  :    0    Weekly : 1156  
 Median : 0.000      Median : 0.000      Q3  : 8630    Monthly:  772  
 Mean   : 3.284      Mean   : 7.887      Q4  : 9086    Never  :25080  
 3rd Qu.: 8.234      3rd Qu.:17.321      NA's:  297    NA's   :  102  
 Max.   :11.522      Max.   :54.222                                   
 NA's   :297         NA's   :297                                      

Data subset rows to young smokers

In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values.

In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter() function. Note that SmokingStatus == 1 is used to choose people who smoked in the past 12 months.

Notice, the missing values have been removed (since NAs are not less than or equal to numbers).

# the subset command below will not include a row that evalutes to NA for the
# specified subset.  For example:
  (NA == 1)
[1] NA
# age and smoking subset
nesarc.sub <-
  nesarc.sub %>%
  filter(
    Age <= 25                                 # people 25 or younger
  , SmokingStatus == "Smoked past 12 months"  # smoked in the past 12 months
  )
dim(nesarc.sub)
[1] 1706   20
summary(nesarc.sub)
       ID           Sex           Age       
 21     :   1   Female:850   Min.   :18.00  
 77     :   1   Male  :856   1st Qu.:20.00  
 103    :   1                Median :22.00  
 122    :   1                Mean   :21.61  
 136    :   1                3rd Qu.:23.00  
 150    :   1                Max.   :25.00  
 (Other):1700                               
                   SmokingStatus               TobaccoDependence
 Smoked past 12 months    :1706   Nicotine Dependence   :896    
 Smoked prior to 12 months:   0   No Nicotine Dependence:810    
 Never smoked             :   0                                 
                                                                
                                                                
                                                                
                                                                
               SmokingFreq   DailyCigsSmoked Ethnicity  
 Every Day           :1320   Min.   : 1.00   Cauc:1060  
 3 to 4 Days/week    :  91   1st Qu.: 5.00   AfAm: 211  
 1 to 2 Days/week    :  88   Median :10.00   NaAm:  42  
 Once a month or less:  71   Mean   :11.34   Asia:  58  
 5 to 6 Days/week    :  68   3rd Qu.:20.00   Hisp: 335  
 (Other)             :  65   Max.   :98.00              
 NA's                :   3   NA's   :9                  
          Depression   IncomePersonal     Height_ft       Height_in     
 No Depression :1260   Min.   :     0   Min.   :4.000   Min.   : 0.000  
 Yes Depression: 446   1st Qu.:  4000   1st Qu.:5.000   1st Qu.: 2.000  
                       Median : 11000   Median :5.000   Median : 5.000  
                       Mean   : 13298   Mean   :5.196   Mean   : 5.425  
                       3rd Qu.: 20000   3rd Qu.:5.000   3rd Qu.: 8.000  
                       Max.   :130000   Max.   :6.000   Max.   :11.000  
                                        NA's   :14      NA's   :14      
   Weight_lb       DaysSmoke     Height_inches   TotalCigsSmoked 
 Min.   : 85.0   Min.   : 1.00   Min.   :58.00   Min.   :   1.0  
 1st Qu.:135.5   1st Qu.:30.00   1st Qu.:65.00   1st Qu.:  90.0  
 Median :160.0   Median :30.00   Median :68.00   Median : 300.0  
 Mean   :165.8   Mean   :25.33   Mean   :67.78   Mean   : 320.5  
 3rd Qu.:185.0   3rd Qu.:30.00   3rd Qu.:71.00   3rd Qu.: 600.0  
 Max.   :390.0   Max.   :30.00   Max.   :81.00   Max.   :2940.0  
 NA's   :27      NA's   :3       NA's   :14      NA's   :9       
 TotalCigsSmokedLog2 TotalCigsSmokedSqrt CigsSmokedFac  SmokingFreq3 
 Min.   : 1.000      Min.   : 1.000      Q1  :   0     Daily  :1479  
 1st Qu.: 6.508      1st Qu.: 9.487      Q2  :   0     Weekly : 153  
 Median : 8.234      Median :17.321      Q3  :1107     Monthly:  71  
 Mean   : 7.448      Mean   :15.987      Q4  : 590     Never  :   0  
 3rd Qu.: 9.231      3rd Qu.:24.495      NA's:   9     NA's   :   3  
 Max.   :11.522      Max.   :54.222                                  
 NA's   :9           NA's   :9                                       

EXTRA: Working with Dates

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth dat, month, and year.

dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
'data.frame':   43093 obs. of  6 variables:
 $ CDAY : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR: num  2001 2002 2001 2001 2001 ...
 $ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then interpret them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after each operation.

## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
[1] "14 8 2001"  "12 1 2002"  "23 11 2001" "9 9 2001"   "18 10 2001"
[6] "7 9 2001"  
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
[1] "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" "2001-10-18"
[6] "2001-09-07"
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
[1] "27 9 1977"  "15 12 1973" "28 12 1919" "27 6 1983"  "11 9 1965" 
[6] "28 12 1966"
# create date object by interpretting the day, month, year with dmy()
  # if any failed to parse, it's because a day, month, or year is out of range
  #    such as "99" for missing values or refused to answer
  # in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
head(dob.date)
[1] "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" "1965-09-11"
[6] "1966-12-28"

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
Time differences in days
[1]  8722 10255 29916  6649 13186 12672
age.years <- as.numeric(age.days / 365.25)  # change from "difftime" to "numeric"
head(age.years)
[1] 23.87953 28.07666 81.90554 18.20397 36.10130 34.69405
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
[1] 0.87953457 0.07665982 0.90554415 0.20396988 0.10130048 0.69404517

Add the new age variable to your data frame, drop the ones you don’t need anymore, and you’re ready to go!

# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
'data.frame':   43093 obs. of  7 variables:
 $ CDAY     : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON     : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR    : num  2001 2002 2001 2001 2001 ...
 $ DOBD     : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM     : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY     : num  1977 1973 1919 1983 1965 ...
 $ age.years: num  23.9 28.1 81.9 18.2 36.1 ...
# drop the original ones you don't need anymore
#   by using subset() and select = -VAR, it excludes the VAR column
dat.sub <-
  dat.sub %>%
  select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
'data.frame':   43093 obs. of  1 variable:
 $ age.years: num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.


Graphing

Class 11 Graphing Univariate

Categorical variables

Tables for categorical variables

Basic tables can be created using the functions table or xtabs. Frequency tables are created for the variables TobaccoDependence, CigsSmokedFac, and SmokingFreq. When only looking at one variable, table is the same as xtabs, but xtabs will make looking at two-way, three-way, and higher tables possible.

# table() produces a one-varible frequency table
table(nesarc.sub$TobaccoDependence)

   Nicotine Dependence No Nicotine Dependence 
                   896                    810 
# proportions available by passing these frequencies to prop.table()
table(nesarc.sub$TobaccoDependence) %>% prop.table()

   Nicotine Dependence No Nicotine Dependence 
             0.5252052              0.4747948 
# xtabs() can be used for one or more variables
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
TobaccoDependence
   Nicotine Dependence No Nicotine Dependence 
                   896                    810 
# proportions available with prop.table()
prop.table(T1)
TobaccoDependence
   Nicotine Dependence No Nicotine Dependence 
             0.5252052              0.4747948 
T2 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T2
SmokingFreq
               Never Once a month or less    2 to 3 Days/month 
                   0                   71                   65 
    1 to 2 Days/week     3 to 4 Days/week     5 to 6 Days/week 
                  88                   91                   68 
           Every Day 
                1320 

(Look at the .Rmd code to see that these numbers in the next paragraph are calculated directly from the numbers summarized in the table! This way, if the data changes, the text will be correctly updated.)

In the data frame nesarc.sub, there are 896 nicotine dependent subjects and 810 subjects that are not nicotine dependent. Most of the subjects in nesarc.sub are daily smokers (68) with the remainder distributed uniformly across the first five levels of SmokingFreq.

Graphing frequency tables

The barplots are all created with the package ggplot2. The barplots start with the defaults for the geom_bar and add more detail to the plot with each graph.

library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1

p2 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Smoking Adults")
p2 <- p2 + theme_bw()
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
p2

Removing NAs from factor axes

When a factor variable has NAs, then NA appears as a level with frequences (see bar plots above). As of writing this, ggplot() does not have a way to remove the NAs before plotting. Therefore, we have to do it manually.

The easiest and most transparent way is to use the drop_na() function below where we only plot the non-NA values. Compare the two plots below of the same data, but the second plot had the NAs removed before plotting.

library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
#p1

# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
#   and "!is.na()" means "keep the non-NA values"
#p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
#p2

# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")

Final versions

library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x     = ""
              , y     = "Total Number"
              , title = "Nicotine Dependence for Young Smoking Adults"
              )
print(p1)

Interpretation: Among smokers 25 years and younger, there are slightly more (5% more) people who are nicotine dependent (52.5%) compared to those who are not nicotine dependent (47.5%).

p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x     = ""
              , y     = "Total Number"
              , title = "Smoking Frequency for Young Smoking Adults"
              )
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
print(p2)

Interpretation: Among smokers 25 years and younger, most smoke every day (77.5%) while the less frequent categories are roughly evenly distributed around 4% to 5%.

Numeric variables

Graphing numeric variables

One popular way to graph a continuous variable is to use a histogram. R has the base function hist() which can be used for this purpose. We will also use the package ggplot2 to create histograms. We start with a basic histogram of the variable TotalCigsSmoked.

hist(nesarc.sub$TotalCigsSmoked)

Experiment with the code in the next code chunk to set the binwidth = (extremely important!), decide if you need boundary = 0, add a title, and if needed the labels on the \(x\)- and \(y\)-axes.

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p

# Use "boundary = 0" for values that can't be negative
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month", title = "boundary at 0")
p <- p + theme_bw()
p

Creating Density Plots

A smoothed density curve can be added with the geom_density() function, where adjust = 2.0 is a fairly smooth curve. Experiment with adjust values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness and decide what level best captures the general shape of your distribution.

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p

Shape, center, and spread

summary(nesarc.sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0    90.0   300.0   320.5   600.0  2940.0       9 
fivenum(nesarc.sub$TotalCigsSmoked)
[1]    1   90  300  600 2940
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 300
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 510

The “shape” of the distribution for the estimated cigarettes smoked per month is skewed to the right. The “center” (median) of the distribution is 300 and the “spread” (interquartile range, middle 50%) for the distribution is 510.

Final versions

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            , title = "Total cigaretts smoked for Young Smoking Adults"
            )
p <- p + theme_bw()
p

Interpretation: Among smokers 25 years and younger, the number of cigarettes smoked is right skewed with a median (center) of 300 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 510. There are three distinct modes (peaks) in the distribution because of how we constructed this variable from two others, and because of the common numbers of cigarettes being relative to a pack size (20). There are several outlying values greater than 1200 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.

p <- ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            , title = "Total cigaretts smoked for Young Smoking Adults"
            )
p <- p + theme_bw()
p

Interpretation: Among smokers 25 years and younger, most smoke less than 10 cigarettes per day with another peak at 20 (1 pack per day), another small peak at 40 (2 packs per day), and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).


Class 13 Graphing Bivariate

Scatter plot (for regression): x = numerical, y = numerical

Research question: With my questions, I don’t have a meaningful question to ask that will relate two numeric variable.

An exploratory question whether there is a relationship between a person’s age (Age) and total number of cigaretts smoked (TotalCigsSmoked). Because age is an integer varible, I jitter the points, and I reduce the opacity (increase the transparancy) with alpha = 1/4 to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).

Interpretation: I added a linear regression smoother to see whether there was any linear trend with age. The horizontal line suggests that there isn’t a (linear) relationship here.

The first plot has all the points in their original locations, but they end up stacking on top of each other so you can’t tell how many points are there.

library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)