1 Document overview

This document is organized by Week and Class number. The in-class assignments are indicated by the Tuesday and Thursday Class numbers. Each week’s homework is often a combination of Tuesday and Thursday, with a small extension. Therefore, “fleshing out” the Tuesday and Thursday sections with a little addition is often sufficient for your homework assignment; that is, you won’t need a separate “homework” section for a week but just extend the in-class assignments. Rarely, the homework assignment is different from the in-class assignments and requires it’s own section in this document.

Consider your readers (graders):

1.1 Global code options

# I set some GLOBAL R chunk options here.
#   (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
#   and notes about the cache option.

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)

# Note: The "cache=TRUE" option will save the computations of code chunks
#   so R it doesn't recompute everything every time you recompile.
#   This can save _tons of time_ if you're working on a small section of code
#   at the bottom of the document.
#   Code chunks will be recompiled only if they are edited.
#   The autodep=TRUE will also update dependent code chunks.
#   A folder is created with the cache in it -- if you delete the folder, then
#   all the code chunks will recompute again.
#   ** If things are working as expected, or I want to freshly compute everything,
#      I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE)

1.2 Document

1.2.1 Naming

Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

  • ADA1_HW_ALL_NESARC_Project_05.Rmd
  • ADA1_HW_ALL_NESARC_Project_06.Rmd
  • ADA1_HW_ALL_NESARC_Project_07.Rmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

  • ADA1_HW_ALL_NESARC_Project_20190903.Rmd
  • ADA1_HW_ALL_NESARC_Project_20190905.Rmd
  • ADA1_HW_ALL_NESARC_Project_20190910.Rmd

1.2.2 Structure

Starting in Week03, we will concatenate all our Homework assignments together to retain the relevant information needed for subsequent classes. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, modifying your research questions, improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.


2 Research Questions

2.1 Week01: Personal Codebook

2.1.1 Class 02 Rmd, codebook

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook wv1codebook.pdf.

Initial thinking: While nicotine dependence is a good starting point, I need to determine what it is about nicotine dependence that I am interested in. It strikes me that friends and acquaintances that I have known through the years that became hooked on cigarettes did so across very different periods of time. Some seemed to be dependent soon after their first few experiences with smoking and others after many years of generally irregular smoking behavior. I decide that I am most interested in exploring the association between level of smoking and nicotine dependence. I add to my codebook variables reflecting smoking levels (e.g., smoking quantity and frequency).

Topic of interest: Is there an association between nicotine dependence and the frequency and quantity of smoking on people up to 25 years old? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

How I did it: I look through the codebook wv1codebook.pdf and find some variables of interest. I searched the text with “Ctrl-F” (find) to find these variables. For each variable, I copy/paste the description here, then formatted so it’s organized. You can choose to use a table or an outline format. I found this text format to be very easy to format. I retained the “frequency” of each response because it’s interesting to know, and because it was already in the codebook — this value is not required for your codebook.

2.1.2 Codebook

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
RenamedVarName
  VarName original in dataset
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

ID
  IDNUM
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Nominal
  43093 1-43093. Unique Identification number

Sex
  SEX
  SEX
  Nominal
  18518 1. Male
  24575 2. Female

Age
  AGE
  AGE
  Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

SmokingStatus
  CHECK321
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

TobaccoDependence
  TAB12MDX
  NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
  Nominal
  38131 0. No nicotine dependence
   4962 1. Nicotine dependence

SmokingFreq
  S3AQ3B1
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

DailyCigsSmoked
  S3AQ3C1
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

Ethnicity
  ETHRACE2A
  IMPUTED RACE/ETHNICITY
  Nominal
  24507 1. White, Not Hispanic or Latino
   8245 2. Black, Not Hispanic or Latino
    701 3. American Indian/Alaska Native, Not Hispanic or Latino
   1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
   8308 5. Hispanic or Latino

Depression
  MAJORDEPLIFE
  MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
  Nominal
  35254 0. No
   7839 1. Yes

IncomePersonal
  S1Q10A
  TOTAL PERSONAL INCOME IN LAST 12 MONTHS
  Continuous
  43093 0-3000000. Income in dollars

Height_ft
  S1Q24FT
  HEIGHT: FEET
  42363 4-7. Feet
  730 99. Unknown
          * change 99 to NA

Height_in
  S1Q24IN
  HEIGHT: INCHES
  Continuous
  3572 0. None
  38760 1-11. Inches
  761 99. Unknown
          * change 99 to NA

Weight_lb
  S1Q24LB
  WEIGHT: POUNDS
  Continuous
  41717 62-500. Pounds
  1376 999. Unknown
          * change 999 to NA

Additional variables were created from the original variables:

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
  1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
  0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
  ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. monthly = 6. Once a month or less
    4. Never   = 7. Never

Height_inches
  Total height in inches
  Height_ft * 12 + Height_in

2.2 Week02: Literature Review

2.2.1 Tuesday ———

2.2.2 Class 03 Research questions

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Research question:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
  2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
  3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

2.2.3 Thursday ———

2.2.4 Class 04 Citations and Literature review

2.2.5 Citations

  • Beckschäfer et al. (2014) said an interesting thing about X and Y and I will try that relationship with my variables A and B.
  • Richert (2013) said something similar, but it was with adults instead of adolescents.
  • There were other citations that are related and I want to read them more carefully (Murphy 2012; Dean and ebrary, Inc 2014; Beckschäfer et al. 2014)

For more documentation on bibliographies and citations with R Markdown, see http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html. For general help with R Markdown, see https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.

Citations to appear in parentheses are cited in brackets like this (Beckschäfer et al. 2014; Richert 2013; Murphy 2012; Dean and ebrary, Inc 2014). A citation you want to refer to in text, such as Beckschäfer et al. (2014), doesn’t have the brackets.

2.2.6 Week 02 Homework Literature review

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Research question:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
    • Google scholar search: “nicotine dependence smoking frequency”
    • Citation: pdf file available for Dierker et al. (2007)
    • Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
    • Others: Kandel and Chen (2000) and Caraballo, Novak, and Asman (2009)

You don’t need to include images in your literature review. I’m providing these figures and tables to illustrate what they look like.

  1. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
    • Google scholar search: “nicotine dependence depression”
    • Citation: pdf file available for Breslau (1995)
    • Interesting points: Table 2, Smoking and Nicotine Dependence both associated with Education. Table 3, Major depression associated with being nicotine dependent and Sex.
    • Others: Breslau et al. (1998)

  1. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
    • In Question 2 we see differences by Education and Sex.

I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.

All required variables have been found and added to my personal codebook (by expanding Class 03).

Citations are at the bottom of this document in the “References” section.


3 Data Management

3.1 Week03: Data Subset, Univariate Summaries And Plots

3.1.1 Background

3.1.1.1 Purpose of study: smoking and depression in young adults

(Note: This is the last paragraph of the research proposal used to remind the reader of the research objectives.)

The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.

3.1.1.2 Variables

Variables from NESARC that will be used include (already in codebook, above):

  • IDNUM (unique ID number with no alphabetics),
  • MAJORDEPLIFE (has subject experienced a major depression?),
  • CHECK321 (has subject smoked cigarettes in past 12 months?),
  • AGE (age of subject),
  • TAB12MDX (tobacco dependence past 12 months),
  • S3AQ3B1 (usual frequency when cigarettes smoked),
  • ETHRACE2A (ethnicity of subject),
  • SEX (gender of subject), and
  • S3AQ3C1 (usual smoking quantity of cigarettes when smoked).

3.1.2 Tuesday ———

3.1.3 Data subset

First, the data is placed on the search path.

# data analysis packages
library(tidyverse)  # Data manipulation and visualization suite
library(forcats)    # Factor variables
library(lubridate)  # Dates

  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Use the load() statement for the dataset you want to use.
  ##
  ## load("AddHealth.RData")
  ## load("addhealth_public4.RData")
  ## load("NESARC.RData")

# read data
load("NESARC.RData")

dim(NESARC)
[1] 43093  3008

The variables of interest are specified and a subset of the dataset including only these variables is created and stored in the data frame nesarc_sub.

# variables to include in our data subset
nesarc_sub <-
  NESARC %>%
  select(
    IDNUM
  , SEX
  , AGE
  , CHECK321
  , TAB12MDX
  , S3AQ3B1
  , S3AQ3C1
  , ETHRACE2A
  , MAJORDEPLIFE
  , S1Q10A
  , S1Q24FT
  , S1Q24IN
  , S1Q24LB
  )

Check size and structure of data subset. Note that categorical variables are already Factor variables, but the levels do not have meaningful labels, yet.

# size of subset
dim(nesarc_sub)
[1] 43093    13
# structure of subset
str(nesarc_sub)
'data.frame':   43093 obs. of  13 variables:
 $ IDNUM       : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ AGE         : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TAB12MDX    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ S3AQ3C1     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ S1Q10A      : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ S1Q24FT     : num  5 5 5 5 6 5 5 5 5 5 ...
 $ S1Q24IN     : num  7 1 4 7 1 5 9 4 4 10 ...
 $ S1Q24LB     : num  195 127 185 180 220 140 160 120 140 150 ...

3.1.3.1 EXTRA: AddHealth, joining waves 1 and 2 together

Joining AddHealth waves 1 and 2 together into a single dataset can be done if you want to use variables from when the participants were both adolecents and adults.

load("AddHealth.RData")
load("addhealth_public4.RData")
# size of datasets, to verify the data is available
dim(AddHealth)
[1] 6504 2829
dim(addhealth_public4)
[1] 6504  976
# First column is the unique ID
colnames(AddHealth)[1]
[1] "AID"
colnames(addhealth_public4)[1]
[1] "aid"
# Rename W4's unique ID variable "AID" and "aid" to "ID"
#   so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
  AddHealth %>%
  dplyr::rename(ID = AID)
addhealth_public4 <-
  addhealth_public4 %>%
  dplyr::rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
[1] "ID"
colnames(addhealth_public4)[1]
[1] "ID"
# Join two datasets by unique ID
AH14 <-
  full_join(AddHealth, addhealth_public4)
# now we have twice as many columns
dim(AH14)
[1] 6504 3804
  # remove the original data and use AH14 going forward
  #rm(AddHealth)
  #rm(addhealth_public4)

  # I don't need this data for the NESARC project, so I am removing this large object.
  # DELETE THIS rm() LINE if you're using AddHealth
  #rm(AH14)

3.1.4 Renaming Variables

Rename variables NewName = OriginalName.

nesarc_sub <-
  nesarc_sub %>%
  dplyr::rename(
    ID                = IDNUM
  , Sex               = SEX
  , Age               = AGE
  , SmokingStatus     = CHECK321
  , TobaccoDependence = TAB12MDX
  , SmokingFreq       = S3AQ3B1
  , DailyCigsSmoked   = S3AQ3C1
  , Ethnicity         = ETHRACE2A
  , Depression        = MAJORDEPLIFE
  , IncomePersonal    = S1Q10A
  , Height_ft         = S1Q24FT
  , Height_in         = S1Q24IN
  , Weight_lb         = S1Q24LB
  )

str(nesarc_sub)
'data.frame':   43093 obs. of  13 variables:
 $ ID               : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Sex              : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ Age              : num  23 28 81 18 36 34 19 84 29 18 ...
 $ SmokingStatus    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TobaccoDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ SmokingFreq      : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DailyCigsSmoked  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Ethnicity        : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ Depression       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ IncomePersonal   : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ Height_ft        : num  5 5 5 5 6 5 5 5 5 5 ...
 $ Height_in        : num  7 1 4 7 1 5 9 4 4 10 ...
 $ Weight_lb        : num  195 127 185 180 220 140 160 120 140 150 ...

The first 6 and last 6 values are given by head() and tail().

head(nesarc_sub)
  ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
1  1   1  23          <NA>                 0        <NA>              NA
2  2   2  28          <NA>                 0        <NA>              NA
3  3   2  81          <NA>                 0        <NA>              NA
4  4   1  18          <NA>                 0        <NA>              NA
5  5   1  36          <NA>                 0        <NA>              NA
6  6   2  34          <NA>                 0        <NA>              NA
  Ethnicity Depression IncomePersonal Height_ft Height_in Weight_lb
1         5          0          17500         5         7       195
2         5          0          11000         5         1       127
3         5          0           6000         5         4       185
4         5          0          27000         5         7       180
5         2          0          42000         6         1       220
6         2          0            500         5         5       140
tail(nesarc_sub)
         ID Sex Age SmokingStatus TobaccoDependence SmokingFreq
43088 43088   2  18             1                 0           1
43089 43089   2  18          <NA>                 0        <NA>
43090 43090   1  19          <NA>                 0        <NA>
43091 43091   1  18             1                 1           6
43092 43092   1  29             1                 1           1
43093 43093   1  18          <NA>                 0        <NA>
      DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
43088               5         1          0           4000         5
43089              NA         1          0           4000         5
43090              NA         1          0           2500         6
43091               3         1          0           3000         5
43092              20         1          0           6500         5
43093              NA         1          1           6000         5
      Height_in Weight_lb
43088         3       120
43089         1       160
43090         1       175
43091         8       150
43092        10       140
43093        11       145

A quick summary of the variables indicates many missing values.

# summary of variables -- notice many NAs
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   9   :   22                     
 4      :    1             Mean   :46.4   NA's:25080                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked Ethnicity Depression IncomePersonal   
 1      :14836   Min.   : 1.00   1:24507   0:35254    Min.   :      0  
 6      :  772   1st Qu.: 6.00   2: 8245   1: 7839    1st Qu.:   8800  
 4      :  747   Median :15.00   3:  701              Median :  20000  
 3      :  687   Mean   :16.81   4: 1332              Mean   :  28185  
 2      :  460   3rd Qu.:20.00   5: 8308              3rd Qu.:  36000  
 (Other):  511   Max.   :99.00                        Max.   :3000000  
 NA's   :25080   NA's   :25080                                         
   Height_ft      Height_in        Weight_lb    
 Min.   : 4.0   Min.   : 0.000   Min.   : 62.0  
 1st Qu.: 5.0   1st Qu.: 3.000   1st Qu.:140.0  
 Median : 5.0   Median : 5.000   Median :168.0  
 Mean   : 6.7   Mean   : 6.934   Mean   :197.1  
 3rd Qu.: 5.0   3rd Qu.: 8.000   3rd Qu.:197.0  
 Max.   :99.0   Max.   :99.000   Max.   :999.0  
                                                

3.1.5 Coding missing values

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

3.1.5.1 Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

3.1.5.1.1 NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 0 Cigarett(s)
table(nesarc_sub$DailyCigsSmoked)

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
 934  884  923  573 1070  463  269  299   49 3077   23  230   34   25  851 
  16   17   18   19   20   21   22   23   24   25   27   28   29   30   33 
  40   22   59    5 5366    1   10    2    7  155    2    3    3  909    1 
  34   35   37   39   40   45   50   55   57   60   66   70   75   80   98 
   1   30    2    1  993    8  106    2    1  241    1   12    2   47   15 
  99 
 262 
nesarc_sub <-
  nesarc_sub %>%
  replace_na(
  # replace NA with new numeric value
    list(
      DailyCigsSmoked = 0
    )
  )

table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11 
25080   934   884   923   573  1070   463   269   299    49  3077    23 
   12    13    14    15    16    17    18    19    20    21    22    23 
  230    34    25   851    40    22    59     5  5366     1    10     2 
   24    25    27    28    29    30    33    34    35    37    39    40 
    7   155     2     3     3   909     1     1    30     2     1   993 
   45    50    55    57    60    66    70    75    80    98    99 
    8   106     2     1   241     1    12     2    47    15   262 
3.1.5.1.2 NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 3. Never

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 7. Never
table(nesarc_sub$SmokingStatus  )

   1    2    9 
9913 8078   22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     9 
14836   460   687   747   409   772   102 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
  # first change the "factor" variables to "character" to add a new level
    SmokingStatus   = as.character(SmokingStatus)
  , SmokingFreq     = as.character(SmokingFreq  )
  ) %>%
  replace_na(
  # replace NA with new category values
    list(
      SmokingStatus = "3"
    , SmokingFreq   = "7"
    )
  )

table(nesarc_sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 

3.1.5.2 Coding 9s and 99s as NAs

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc_sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 
table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11 
25080   934   884   923   573  1070   463   269   299    49  3077    23 
   12    13    14    15    16    17    18    19    20    21    22    23 
  230    34    25   851    40    22    59     5  5366     1    10     2 
   24    25    27    28    29    30    33    34    35    37    39    40 
    7   155     2     3     3   909     1     1    30     2     1   993 
   45    50    55    57    60    66    70    75    80    98    99 
    8   106     2     1   241     1    12     2    47    15   262 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
  # replace missing values
    SmokingStatus   = replace(SmokingStatus  , SmokingStatus   %in% c("9"), NA)
  , SmokingFreq     = replace(SmokingFreq    , SmokingFreq     %in% c("9"), NA)
  , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
  , Height_ft       = replace(Height_ft      , Height_ft       %in% c( 99), NA)
  , Height_in       = replace(Height_in      , Height_in       %in% c( 99), NA)
  , Weight_lb       = replace(Weight_lb      , Weight_lb       %in% c(999), NA)
  # drop unused factor levels
  , SmokingStatus   = fct_drop(SmokingStatus)
  , SmokingFreq     = fct_drop(SmokingFreq  )
  )

table(nesarc_sub$SmokingStatus  )

    1     2     3 
 9913  8078 25080 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11 
25080   934   884   923   573  1070   463   269   299    49  3077    23 
   12    13    14    15    16    17    18    19    20    21    22    23 
  230    34    25   851    40    22    59     5  5366     1    10     2 
   24    25    27    28    29    30    33    34    35    37    39    40 
    7   155     2     3     3   909     1     1    30     2     1   993 
   45    50    55    57    60    66    70    75    80    98 
    8   106     2     1   241     1    12     2    47    15 
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   3   :25080                     
 4      :    1             Mean   :46.4   NA's:   22                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
   Height_ft      Height_in        Weight_lb    
 Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 Median :5.00   Median : 5.000   Median :165.0  
 Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 Max.   :7.00   Max.   :11.000   Max.   :500.0  
 NA's   :730    NA's   :761      NA's   :1376   

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc_sub)
[1] 43093    13
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
[1] 43093    13
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   3   :25080                     
 4      :    1             Mean   :46.4   NA's:   22                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
   Height_ft      Height_in        Weight_lb    
 Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 Median :5.00   Median : 5.000   Median :165.0  
 Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 Max.   :7.00   Max.   :11.000   Max.   :500.0  
 NA's   :730    NA's   :761      NA's   :1376   

Now we have a dataset with no missing values.

3.1.6 Creating new variables

Additional variables that we created from the original variables are listed in this ammendment to the original codebook.

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. Daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. Weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. Monthly = 6. Once a month or less
    4. Never   = 7. Never

Height_inches
  Total height in inches
  Height_ft * 12 + Height_in

3.1.6.1 From categories to numeric

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    DaysSmoke =
      case_when(
        SmokingFreq == 1 ~ 30      # 1. Every day
      , SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
      , SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
      , SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
      , SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
      , SmokingFreq == 6 ~ 1       # 6. Once a month or less
      , SmokingFreq == 7 ~ 0       # 7. Never
      )
  )

summary(nesarc_sub$DaysSmoke)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   10.96   30.00   30.00     102 
3.1.6.1.1 AddHealth4 Income example

Use the AddHealth Wave 4 income variable to convert from categorical to numerical.

Note that there’s also variable H4EC2, which is already a numeric income variable.

# create "Income4" from the Section 12: Economics section.
AH14 <-
  AH14 %>%
  mutate(
    Income4 = h4ec1
  )

table(AH14$Income4)

   1    2    3    4    5    6    7    8    9   10   11   12   96   98 
 137  120  173  167  237  269  505  578 1154  698  476  247   51  302 
# create a numeric version of this categorical variable
AH14 <-
  AH14 %>%
  mutate(
    Income4_num =
      case_when(
        Income4 ==  1 ~   2500     #  less than $5,000
      , Income4 ==  2 ~   7500     #  $5,000 to $9,999
      , Income4 ==  3 ~  12500     #  $10,000 to $14,999
      , Income4 ==  4 ~  17500     #  $15,000 to $19,999
      , Income4 ==  5 ~  22500     #  $20,000 to $24,999
      , Income4 ==  6 ~  27500     #  $25,000 to $29,999
      , Income4 ==  7 ~  35000     #  $30,000 to $39,999
      , Income4 ==  8 ~  45000     #  $40,000 to $49,999
      , Income4 ==  9 ~  62500     #  $50,000 to $74,999
      , Income4 == 10 ~  87500     #  $75,000 to $99,999
      , Income4 == 11 ~ 125000     #  $100,000 to $149,999
      , Income4 == 12 ~ 175000     #  $150,000 or more
      ## These two would have been coded as NA previously
      ## You can't code them as NA in this case_when() function.
      #, Income4 == 96 ~            #  refused
      #, Income4 == 98 ~            #  don't know
      )
  )

table(AH14$Income4_num)

  2500   7500  12500  17500  22500  27500  35000  45000  62500  87500 
   137    120    173    167    237    269    505    578   1154    698 
125000 175000 
   476    247 

3.1.6.2 From numeric to numeric

Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    Height_inches = Height_ft * 12 + Height_in
  )

summary(nesarc_sub$Height_inches)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   48.0    64.0    66.0    66.6    70.0    84.0     761 

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
  )

Log scale. I like log2 since its interpretation is that every unit is a doubling. Notice that I added 1 to the TotalCigsSmoked before computing the log. This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    TotalCigsSmokedLog2 = log(TotalCigsSmoked + 1, base = 2)
  )

summary(nesarc_sub$TotalCigsSmokedLog2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   3.284   8.234  11.522     297 

Square root scale. Square root is another transformation that spreads out values between 0 and 1 and compresses values from 1 to infinity. It can be preferred to a log transformation in some cases.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    TotalCigsSmokedSqrt = sqrt(TotalCigsSmoked)
  )

summary(nesarc_sub$TotalCigsSmokedSqrt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   7.887  17.321  54.222     297 

3.1.6.3 From numeric to categories based on quantiles

The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function. The label "Q1" indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4" indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations.

quantile(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)
  0%  25%  50%  75% 100% 
   0    0    0  300 2940 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
    CigsSmokedFac =
      .bincode(TotalCigsSmoked
      , breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
      , include.lowest = TRUE
      )
  , CigsSmokedFac = factor( CigsSmokedFac
                          , levels = c(1, 2, 3, 4)
                          , labels = c("Q1", "Q2", "Q3", "Q4")
                          )
  )

In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.

summary(nesarc_sub$CigsSmokedFac)
   Q1    Q2    Q3    Q4  NA's 
25080     0  8630  9086   297 

3.1.6.4 From many categories to a few

The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
#   if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
# look at numeric values
table(as.numeric(nesarc_sub$SmokingFreq))

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc_sub <-
  nesarc_sub %>%
  mutate(
    SmokingFreq3 =
      fct_collapse(
        SmokingFreq
      , "1" = c("1", "2", "3")  # 1. daily
      , "2" = c("4", "5")       # 2. weekly
      , "3" = "6"               # 3. monthly
      , "4" = "7"               # 7. never
      )
  )

# new variable
table(nesarc_sub$SmokingFreq3)

    1     2     3     4 
15983  1156   772 25080 
# we will label the new factor variable below, with the others

3.1.6.5 Review results of new variables

Look at the first 10 observations in the dataset, with special attention to the new variables.

head(nesarc_sub, 10)
   ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
1   1   1  23             3                 0           7               0
2   2   2  28             3                 0           7               0
3   3   2  81             3                 0           7               0
4   4   1  18             3                 0           7               0
5   5   1  36             3                 0           7               0
6   6   2  34             3                 0           7               0
7   7   1  19             3                 0           7               0
8   8   2  84             3                 0           7               0
9   9   2  29             3                 0           7               0
10 10   2  18             3                 0           7               0
   Ethnicity Depression IncomePersonal Height_ft Height_in Weight_lb
1          5          0          17500         5         7       195
2          5          0          11000         5         1       127
3          5          0           6000         5         4       185
4          5          0          27000         5         7       180
5          2          0          42000         6         1       220
6          2          0            500         5         5       140
7          2          1           2500         5         9       160
8          1          0          24000         5         4       120
9          1          1          65000         5         4       140
10         5          0           4000         5        10       150
   DaysSmoke Height_inches TotalCigsSmoked TotalCigsSmokedLog2
1          0            67               0                   0
2          0            61               0                   0
3          0            64               0                   0
4          0            67               0                   0
5          0            73               0                   0
6          0            65               0                   0
7          0            69               0                   0
8          0            64               0                   0
9          0            64               0                   0
10         0            70               0                   0
   TotalCigsSmokedSqrt CigsSmokedFac SmokingFreq3
1                    0            Q1            4
2                    0            Q1            4
3                    0            Q1            4
4                    0            Q1            4
5                    0            Q1            4
6                    0            Q1            4
7                    0            Q1            4
8                    0            Q1            4
9                    0            Q1            4
10                   0            Q1            4

3.1.7 Labeling Categorical variable levels

Look at summary of dataset to see the state of the labelling before we make changes.

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   3   :25080                     
 4      :    1             Mean   :46.4   NA's:   22                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
   Height_ft      Height_in        Weight_lb       DaysSmoke    
 Min.   :4.00   Min.   : 0.000   Min.   : 62.0   Min.   : 0.00  
 1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00  
 Median :5.00   Median : 5.000   Median :165.0   Median : 0.00  
 Mean   :5.11   Mean   : 5.279   Mean   :170.6   Mean   :10.96  
 3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00  
 Max.   :7.00   Max.   :11.000   Max.   :500.0   Max.   :30.00  
 NA's   :730    NA's   :761      NA's   :1376    NA's   :102    
 Height_inches  TotalCigsSmoked  TotalCigsSmokedLog2 TotalCigsSmokedSqrt
 Min.   :48.0   Min.   :   0.0   Min.   : 0.000      Min.   : 0.000     
 1st Qu.:64.0   1st Qu.:   0.0   1st Qu.: 0.000      1st Qu.: 0.000     
 Median :66.0   Median :   0.0   Median : 0.000      Median : 0.000     
 Mean   :66.6   Mean   : 187.5   Mean   : 3.284      Mean   : 7.887     
 3rd Qu.:70.0   3rd Qu.: 300.0   3rd Qu.: 8.234      3rd Qu.:17.321     
 Max.   :84.0   Max.   :2940.0   Max.   :11.522      Max.   :54.222     
 NA's   :761    NA's   :297      NA's   :297         NA's   :297        
 CigsSmokedFac SmokingFreq3
 Q1  :25080    1   :15983  
 Q2  :    0    2   : 1156  
 Q3  : 8630    3   :  772  
 Q4  : 9086    4   :25080  
 NA's:  297    NA's:  102  
                           
                           

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, TobaccoDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <-
  factor(
    nesarc_sub$Depression
  , labels = c("No Depression"
             , "Yes Depression"
    )
  )

nesarc_sub$SmokingStatus <-
  factor(
    nesarc_sub$SmokingStatus
  , labels = c("Smoked past 12 months"
             , "Smoked prior to 12 months"
             , "Never smoked"
    )
  )

# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <-
  factor(nesarc_sub$Sex
  , levels = c( 2
              , 1
              )
  , labels = c( "Female"
              , "Male"
              )
  )
# check ordering with a frequency table
table(nesarc_sub$Sex)

Female   Male 
 24575  18518 
nesarc_sub$TobaccoDependence <-
  factor(nesarc_sub$TobaccoDependence
  , levels = c( 1
              , 0
              )
  , labels = c( "Nicotine Dependence"
              , "No Nicotine Dependence"
              )
  )


# Remember to include the new category level from the original `NA` values.

nesarc_sub$SmokingFreq <-
  factor(nesarc_sub$SmokingFreq
  , levels = c( 7
              , 6
              , 5
              , 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Once a month or less"
              , "2 to 3 Days/month"
              , "1 to 2 Days/week"
              , "3 to 4 Days/week"
              , "5 to 6 Days/week"
              , "Every Day"
              )
  )


nesarc_sub$Ethnicity <-
  factor(nesarc_sub$Ethnicity
    # shorter labels are helpful for plots
  , labels = c( "Cauc"
              , "AfAm"
              , "NaAm"
              , "Asia"
              , "Hisp"
              )
  # , labels = c("Caucasian"
  #            , "African American"
  #            , "Native American"
  #            , "Asian"
  #            , "Hispanic"
  #             )
)

nesarc_sub$SmokingFreq3 <-
  factor(nesarc_sub$SmokingFreq3
  , levels = c( 1
              , 2
              , 3
              , 4
              )
  , labels = c( "Daily"
              , "Weekly"
              , "Monthly"
              , "Never"
              )
  )

summary(nesarc_sub)
       ID            Sex             Age      
 1      :    1   Female:24575   Min.   :18.0  
 2      :    1   Male  :18518   1st Qu.:32.0  
 3      :    1                  Median :44.0  
 4      :    1                  Mean   :46.4  
 5      :    1                  3rd Qu.:59.0  
 6      :    1                  Max.   :98.0  
 (Other):43087                                
                   SmokingStatus                TobaccoDependence
 Smoked past 12 months    : 9913   Nicotine Dependence   : 4962  
 Smoked prior to 12 months: 8078   No Nicotine Dependence:38131  
 Never smoked             :25080                                 
 NA's                     :   22                                 
                                                                 
                                                                 
                                                                 
               SmokingFreq    DailyCigsSmoked  Ethnicity   
 Never               :25080   Min.   : 0.000   Cauc:24507  
 Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
 Once a month or less:  772   Median : 0.000   NaAm:  701  
 1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
 3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
 (Other)             :  869   Max.   :98.000               
 NA's                :  102   NA's   :262                  
          Depression    IncomePersonal      Height_ft      Height_in     
 No Depression :35254   Min.   :      0   Min.   :4.00   Min.   : 0.000  
 Yes Depression: 7839   1st Qu.:   8800   1st Qu.:5.00   1st Qu.: 3.000  
                        Median :  20000   Median :5.00   Median : 5.000  
                        Mean   :  28185   Mean   :5.11   Mean   : 5.279  
                        3rd Qu.:  36000   3rd Qu.:5.00   3rd Qu.: 8.000  
                        Max.   :3000000   Max.   :7.00   Max.   :11.000  
                                          NA's   :730    NA's   :761     
   Weight_lb       DaysSmoke     Height_inches  TotalCigsSmoked 
 Min.   : 62.0   Min.   : 0.00   Min.   :48.0   Min.   :   0.0  
 1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0   1st Qu.:   0.0  
 Median :165.0   Median : 0.00   Median :66.0   Median :   0.0  
 Mean   :170.6   Mean   :10.96   Mean   :66.6   Mean   : 187.5  
 3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0   3rd Qu.: 300.0  
 Max.   :500.0   Max.   :30.00   Max.   :84.0   Max.   :2940.0  
 NA's   :1376    NA's   :102     NA's   :761    NA's   :297     
 TotalCigsSmokedLog2 TotalCigsSmokedSqrt CigsSmokedFac  SmokingFreq3  
 Min.   : 0.000      Min.   : 0.000      Q1  :25080    Daily  :15983  
 1st Qu.: 0.000      1st Qu.: 0.000      Q2  :    0    Weekly : 1156  
 Median : 0.000      Median : 0.000      Q3  : 8630    Monthly:  772  
 Mean   : 3.284      Mean   : 7.887      Q4  : 9086    Never  :25080  
 3rd Qu.: 8.234      3rd Qu.:17.321      NA's:  297    NA's   :  102  
 Max.   :11.522      Max.   :54.222                                   
 NA's   :297         NA's   :297                                      

3.1.8 Data subset rows to young smokers

In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values.

In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter() function. Note that SmokingStatus == "Smoked past 12 months" is used to choose people who smoked in the past 12 months.

Notice, the missing values have been removed (since NAs are not less than or equal to numbers).

# the subset command below will not include a row that evalutes to NA for the
# specified subset.  For example:
  (NA == 1)
[1] NA
# age and smoking subset
nesarc_sub <-
  nesarc_sub %>%
  filter(
    Age <= 25                                 # people 25 or younger
  , SmokingStatus == "Smoked past 12 months"  # smoked in the past 12 months
  )
dim(nesarc_sub)
[1] 1706   20
summary(nesarc_sub)
       ID           Sex           Age       
 21     :   1   Female:850   Min.   :18.00  
 77     :   1   Male  :856   1st Qu.:20.00  
 103    :   1                Median :22.00  
 122    :   1                Mean   :21.61  
 136    :   1                3rd Qu.:23.00  
 150    :   1                Max.   :25.00  
 (Other):1700                               
                   SmokingStatus               TobaccoDependence
 Smoked past 12 months    :1706   Nicotine Dependence   :896    
 Smoked prior to 12 months:   0   No Nicotine Dependence:810    
 Never smoked             :   0                                 
                                                                
                                                                
                                                                
                                                                
               SmokingFreq   DailyCigsSmoked Ethnicity  
 Every Day           :1320   Min.   : 1.00   Cauc:1060  
 3 to 4 Days/week    :  91   1st Qu.: 5.00   AfAm: 211  
 1 to 2 Days/week    :  88   Median :10.00   NaAm:  42  
 Once a month or less:  71   Mean   :11.34   Asia:  58  
 5 to 6 Days/week    :  68   3rd Qu.:20.00   Hisp: 335  
 (Other)             :  65   Max.   :98.00              
 NA's                :   3   NA's   :9                  
          Depression   IncomePersonal     Height_ft       Height_in     
 No Depression :1260   Min.   :     0   Min.   :4.000   Min.   : 0.000  
 Yes Depression: 446   1st Qu.:  4000   1st Qu.:5.000   1st Qu.: 2.000  
                       Median : 11000   Median :5.000   Median : 5.000  
                       Mean   : 13298   Mean   :5.196   Mean   : 5.425  
                       3rd Qu.: 20000   3rd Qu.:5.000   3rd Qu.: 8.000  
                       Max.   :130000   Max.   :6.000   Max.   :11.000  
                                        NA's   :14      NA's   :14      
   Weight_lb       DaysSmoke     Height_inches   TotalCigsSmoked 
 Min.   : 85.0   Min.   : 1.00   Min.   :58.00   Min.   :   1.0  
 1st Qu.:135.5   1st Qu.:30.00   1st Qu.:65.00   1st Qu.:  90.0  
 Median :160.0   Median :30.00   Median :68.00   Median : 300.0  
 Mean   :165.8   Mean   :25.33   Mean   :67.78   Mean   : 320.5  
 3rd Qu.:185.0   3rd Qu.:30.00   3rd Qu.:71.00   3rd Qu.: 600.0  
 Max.   :390.0   Max.   :30.00   Max.   :81.00   Max.   :2940.0  
 NA's   :27      NA's   :3       NA's   :14      NA's   :9       
 TotalCigsSmokedLog2 TotalCigsSmokedSqrt CigsSmokedFac  SmokingFreq3 
 Min.   : 1.000      Min.   : 1.000      Q1  :   0     Daily  :1479  
 1st Qu.: 6.508      1st Qu.: 9.487      Q2  :   0     Weekly : 153  
 Median : 8.234      Median :17.321      Q3  :1107     Monthly:  71  
 Mean   : 7.448      Mean   :15.987      Q4  : 590     Never  :   0  
 3rd Qu.: 9.231      3rd Qu.:24.495      NA's:   9     NA's   :   3  
 Max.   :11.522      Max.   :54.222                                  
 NA's   :9           NA's   :9                                       

3.1.9 EXTRA: Working with Dates

3.1.9.1 NESARC age example

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth day, month, and year.

dat_sub <-
  NESARC %>%
  select(
    CDAY
  , CMON
  , CYEAR
  , DOBD
  , DOBM
  , DOBY
  )
str(dat_sub)
'data.frame':   43093 obs. of  6 variables:
 $ CDAY : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR: num  2001 2002 2001 2001 2001 ...
 $ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then interpret them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after each operation.

## interview date
# combine day, month, and year into one text string
cdate_text <- paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
head(cdate_text)
[1] "14 8 2001"  "12 1 2002"  "23 11 2001" "9 9 2001"   "18 10 2001"
[6] "7 9 2001"  
# create date object by interpretting the day, month, year with dmy()
cdate_date <- dmy(cdate_text)
head(cdate_date)
[1] "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" "2001-10-18"
[6] "2001-09-07"
## date of birth
# combine day, month, and year into one text string
dob_text <- paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
head(dob_text)
[1] "27 9 1977"  "15 12 1973" "28 12 1919" "27 6 1983"  "11 9 1965" 
[6] "28 12 1966"
# create date object by interpretting the day, month, year with dmy()
  # if any failed to parse, it's because a day, month, or year is out of range
  #    such as "99" for missing values or refused to answer
  # in these cases, the date is NA (missing), which is what we want
dob_date <- dmy(dob_text)
head(dob_date)
[1] "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" "1965-09-11"
[6] "1966-12-28"

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

# difference in days / 365.25 = difference in years
age_days <- cdate_date - dob_date
head(age_days)
Time differences in days
[1]  8722 10255 29916  6649 13186 12672
age_years <- as.numeric(age_days / 365.25)  # change from "difftime" to "numeric"
head(age_years)
[1] 23.87953 28.07666 81.90554 18.20397 36.10130 34.69405
# difference between the age we calculated and the age in the original dataset
head(age_years - NESARC$AGE)
[1] 0.87953457 0.07665982 0.90554415 0.20396988 0.10130048 0.69404517

Add the new age variable to your data frame, drop the ones you don’t need anymore, and you’re ready to go!

# add the variable
dat_sub$age_years <- age_years
str(dat_sub)
'data.frame':   43093 obs. of  7 variables:
 $ CDAY     : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON     : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR    : num  2001 2002 2001 2001 2001 ...
 $ DOBD     : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM     : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY     : num  1977 1973 1919 1983 1965 ...
 $ age_years: num  23.9 28.1 81.9 18.2 36.1 ...
# drop the original ones you don't need anymore
#   by using subset() and select = -VAR, it excludes the VAR column
dat_sub <-
  dat_sub %>%
  select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat_sub)
'data.frame':   43093 obs. of  1 variable:
 $ age_years: num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

3.1.9.2 AddHealth4 age example

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “i” variables (iday4) are the interview day, month, and year, and the date of birth variables (H4OD1MandH4OD1Y`) are the date of birth month and year (there is no day, so we’ll use day = 15 for the middle of the month).

Combine the date fields together, then interpret them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after each operation.

AH14 <-
  AH14 %>%
  mutate(
    idate = paste(iday4, imonth4, iyear4)
  , dob   = paste(15, h4od1m, h4od1y)
  )

head(select(AH14, idate, dob))
      idate        dob
1  NA NA NA   15 NA NA
2  6 5 2008 15 11 1976
3  NA NA NA   15 NA NA
4 22 5 2008  15 1 1976
5  NA NA NA   15 NA NA
6  NA NA NA   15 NA NA
# create date object by interpretting the day, month, year with dmy()
  # if any failed to parse, it's because a day, month, or year is out of range
  #    such as "99" for missing values or refused to answer
  # in these cases, the date is NA (missing), which is what we want
AH14 <-
  AH14 %>%
  mutate(
    idate = dmy(idate)
  , dob   = dmy(dob)
  )

head(select(AH14, idate, dob))
       idate        dob
1       <NA>       <NA>
2 2008-05-06 1976-11-15
3       <NA>       <NA>
4 2008-05-22 1976-01-15
5       <NA>       <NA>
6       <NA>       <NA>

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

# difference in days / 365.25 = difference in years
AH14 <-
  AH14 %>%
  mutate(
    age_days  = idate - dob
  , age_years = as.numeric(age_days / 365.25)  # change from "difftime" to "numeric"
  )

head(select(AH14, idate, dob, age_days, age_years))
       idate        dob   age_days age_years
1       <NA>       <NA>    NA days        NA
2 2008-05-06 1976-11-15 11495 days  31.47159
3       <NA>       <NA>    NA days        NA
4 2008-05-22 1976-01-15 11816 days  32.35044
5       <NA>       <NA>    NA days        NA
6       <NA>       <NA>    NA days        NA
summary(AH14$age_years)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  24.43   27.48   28.92   28.89   30.36   33.94    1390 

Now you’ve got an age (or other date object) you can use in your project.

3.1.9.3 Converting 12-hour to 24-hour time

Let’s use a time from Add Health “Wave IV Section 7: Sleep Patterns” about what time people wake up. There are three fields for hour, minute, and am/pm, and we want to convert these three fields into a single time number.

Now I'm going to ask you about your sleep patterns.

1H. On the days you go to work, school or similar activities,
    what time do you usually wake up?
  H4SP1H num [Hour]   96=refused, 98=don't know
  H4SP1M num [Minute] 96=refused, 98=don't know
  H4SP1T cat [am/pm]   1=am, 2=pm, 6=refused, 8=don't know

There are two steps to address.

  1. First, “PM” adds 12 hours.
  2. Second, 12 PM stays as 12 and 12 AM becomes 00.
AH14 <-
  AH14 %>%
  mutate(
    # name the hour, minute, and am_pm variables
    wake_hour = h4sp1h %>% as.numeric()
  , wake_min  = h4sp1m %>% as.numeric()
  , wake_ampm = h4sp1t %>% as.numeric()
    # code missing values labels as NAs
  , wake_hour = replace(wake_hour, wake_hour %in% c(96, 98), NA)
  , wake_min  = replace(wake_min , wake_min  %in% c(96, 98), NA)
  , wake_ampm = replace(wake_ampm, wake_ampm %in% c( 6,  8), NA)
    # 1=am, 2=pm
  , wake_ampm = case_when(
                  wake_ampm == 1 ~ "AM"
                , wake_ampm == 2 ~ "PM"
                )
    # convert hour to 24-hour time
  , wake_hour24 = case_when(
                    # hour is 1-11 and AM, then hour is 1-11
                    (wake_hour %in% 1:11) & (wake_ampm == "AM") ~ wake_hour
                    # hour is 1-11 and PM, then hour adds 12 for 13-23
                  , (wake_hour %in% 1:11) & (wake_ampm == "PM") ~ wake_hour + 12
                    # hour is 12 and AM, then hour is 0 (midnight)
                  , (wake_hour == 12    ) & (wake_ampm == "AM") ~ 0
                    # hour is 12 and PM, then hour is 12 (noon)
                  , (wake_hour == 12    ) & (wake_ampm == "PM") ~ 12
                )
    # convert minutes from 0-59 to fractions of an hour
  , wake_min_frac = wake_min / 60
    # add the 24-hour hour and the fraction minute to get a value from 0 to 23.99
  , wake_time = wake_hour24 + wake_min_frac
  )

# Here's an example of the new variables
# This looks correct :)
AH14 %>%
  select(wake_hour, wake_min, wake_ampm, wake_time) %>%
  na.omit() %>%
  arrange(wake_time) %>%
  slice(seq(1, n(), length = 10)  )
   wake_hour wake_min wake_ampm wake_time
1         12        0        AM      0.00
2          5        0        AM      5.00
3          5       30        AM      5.50
4          6        0        AM      6.00
5          6       30        AM      6.50
6          6       45        AM      6.75
7          7        0        AM      7.00
8          7       30        AM      7.50
9          9        0        AM      9.00
10        11       30        PM     23.50

Below we see what we expect, that most people wake between 5 and 8 AM.

library(ggplot2)
p <- ggplot(data = AH14, aes(x = wake_time))
p <- p + geom_histogram(boundary = 0, binwidth = 0.5)
p <- p + geom_rug()
p <- p + labs(title = "AddHealth Wave 4 Wake time")
p <- p + theme_bw()
print(p)

3.2 Data is complete

At this point, your dataset should be complete, coded, formatted, and ready for analysis.

3.2.1 Plot entire dataset, show missing values

Plot the data subset to show the extent of missing values.

ggplot_missing <- function(dat) {
  # https://github.com/njtierney/neato/blob/master/R/ggplot_missing.R

  # A function that plots missing data in ggplot. For a more updated version,
  #   see `vis_miss` from visdat - github.com/njtierney/visdat
  # dat is a data frame
  # return a ggplot of the missing data.

  dat2 <-
    dat %>%
    is.na()

  # create a column indicating which rows all have data (no missing)
  NO_MISSING <-
    !(rowSums(!dat2) == ncol(dat2))
  dat2 <-
    cbind(dat2, NO_MISSING) %>%
    reshape2::melt()

  p <- ggplot2::ggplot(data = dat2, aes(x = Var2, y = Var1))
  p <- p +