1 Document overview

This document is organized by Class number. The worksheet assignments are indicated by the Tuesday and Thursday Class numbers.

Consider your readers (graders):

1.1 Global code options

# I set some GLOBAL R chunk options here.
#   (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
#   and notes about the cache option.

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)

# Note: The "cache=TRUE" option will save the computations of code chunks
#   so R it doesn't recompute everything every time you recompile.
#   This can save _tons of time_ if you're working on a small section of code
#   at the bottom of the document.
#   Code chunks will be recompiled only if they are edited.
#   The autodep=TRUE will also update dependent code chunks.
#   A folder is created with the cache in it -- if you delete the folder, then
#   all the code chunks will recompute again.
#   ** If things are not working as expected, or I want to freshly compute everything,
#      I delete the *_cache folder.
knitr::opts_chunk$set(cache = FALSE) #, autodep=TRUE)  #$

1.2 Document

1.2.1 Naming

Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

  • ADA1_ALL_05.Rmd
  • ADA1_ALL_06.Rmd
  • ADA1_ALL_07.Rmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

  • ADA1_ALL_20200903.Rmd
  • ADA1_ALL_20200905.Rmd
  • ADA1_ALL_20200910.Rmd

1.2.2 Structure

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

1.2.3 Classes not appearing in this document

These class assignments are in their own documents: 09, 11, 13, 14, …


2 Research Questions

2.1 Class 04, Personal Codebook

Rubric

  1. (1 p) Is there a topic of interest?

  2. (2 p) Are the variables relevant to a set of research questions?

  3. (4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

    • 1 categorical variable with only 2 levels
    • 1 categorical variable with at least 3 levels
    • 2 numerical variables with many possible unique values
    • More variables are welcome and you’re likely to add to this later in the semester
  4. (3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

  5. Compile this Rmd file to an html, print/save to pdf, and upload to UNM Learn.

2.1.1 Topic and research questions

Topic:

Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

Research questions:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?

  2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?

  3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

3 Codebook

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
RenamedVarName
  VarName original in dataset
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

ID
  IDNUM
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Nominal
  43093 1-43093. Unique Identification number

Sex
  SEX
  SEX
  Nominal
  18518 1. Male
  24575 2. Female

Age
  AGE
  AGE
  Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

SmokingStatus
  CHECK321
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

NicotineDependence
  TAB12MDX
  NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
  Nominal
  38131 0. No nicotine dependence
   4962 1. Nicotine dependence

SmokingFreq
  S3AQ3B1
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

DailyCigsSmoked
  S3AQ3C1
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

Ethnicity
  ETHRACE2A
  IMPUTED RACE/ETHNICITY
  Nominal
  24507 1. White, Not Hispanic or Latino
   8245 2. Black, Not Hispanic or Latino
    701 3. American Indian/Alaska Native, Not Hispanic or Latino
   1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
   8308 5. Hispanic or Latino

Depression
  MAJORDEPLIFE
  MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
  Nominal
  35254 0. No
   7839 1. Yes

IncomePersonal
  S1Q10A
  TOTAL PERSONAL INCOME IN LAST 12 MONTHS
  Continuous
  43093 0-3000000. Income in dollars

Height_ft
  S1Q24FT
  HEIGHT: FEET
  42363 4-7. Feet
  730 99. Unknown
          * change 99 to NA

Height_in
  S1Q24IN
  HEIGHT: INCHES
  Continuous
  3572 0. None
  38760 1-11. Inches
  761 99. Unknown
          * change 99 to NA

Weight_lb
  S1Q24LB
  WEIGHT: POUNDS
  Continuous
  41717 62-500. Pounds
  1376 999. Unknown
          * change 999 to NA

Additional variables were created from the original variables:

CREATED VARIABLES

DaysSmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
  1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
  0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
  ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. monthly = 6. Once a month or less
    4. Never   = 7. Never

Height_inches
  Total height in inches
  Height_ft * 12 + Height_in

4 Data Management

4.1 Class 05, Data subset and numerical summaries

Rubric

  1. (4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

    • Scroll down to sections labeled “(Class 05)”.
  2. (1 p) Output confirms the subset is correct (e.g., using dim() and str()).

  3. (3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

    • Scroll down to sections labeled “(Class 05)”.
  4. (2 p) Provide numerical summaries for all variables (e.g., using summary()).

    • Scroll down to sections labeled “(Class 05)”.

4.1.1 Data subset (Class 05)

First, the data is placed on the search path.

# data analysis packages
library(tidyverse)  # Data manipulation and visualization suite
library(lubridate)  # Dates

source("ada_functions.R")

  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Use the load() statement for the dataset you want to use.
# read data example
load("NESARC.RData")

dim(NESARC)
[1] 43093  3008
# variables to include in our data subset
nesarc_sub <-
  NESARC %>%
  dplyr::select(
    IDNUM
  , SEX
  , AGE
  , CHECK321
  , TAB12MDX
  , S3AQ3B1
  , S3AQ3C1
  , ETHRACE2A
  , MAJORDEPLIFE
  , S1Q10A
  , S1Q24FT
  , S1Q24IN
  , S1Q24LB
  )

# Check size and structure of data subset.
# Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
[1] 43093    13
# structure of subset
str(nesarc_sub)
'data.frame':   43093 obs. of  13 variables:
 $ IDNUM       : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ AGE         : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TAB12MDX    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ S3AQ3C1     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ S1Q10A      : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ S1Q24FT     : num  5 5 5 5 6 5 5 5 5 5 ...
 $ S1Q24IN     : num  7 1 4 7 1 5 9 4 4 10 ...
 $ S1Q24LB     : num  195 127 185 180 220 140 160 120 140 150 ...

4.1.2 Renaming Variables (Class 05)

nesarc_sub <-
  nesarc_sub %>%
  dplyr::rename(
    ID                = IDNUM
  , Sex               = SEX
  , Age               = AGE
  , SmokingStatus     = CHECK321
  , NicotineDependence = TAB12MDX
  , SmokingFreq       = S3AQ3B1
  , DailyCigsSmoked   = S3AQ3C1
  , Ethnicity         = ETHRACE2A
  , Depression        = MAJORDEPLIFE
  , IncomePersonal    = S1Q10A
  , Height_ft         = S1Q24FT
  , Height_in         = S1Q24IN
  , Weight_lb         = S1Q24LB
  )

str(nesarc_sub)
'data.frame':   43093 obs. of  13 variables:
 $ ID                : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Sex               : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ Age               : num  23 28 81 18 36 34 19 84 29 18 ...
 $ SmokingStatus     : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ SmokingFreq       : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DailyCigsSmoked   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Ethnicity         : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ Depression        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ IncomePersonal    : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ Height_ft         : num  5 5 5 5 6 5 5 5 5 5 ...
 $ Height_in         : num  7 1 4 7 1 5 9 4 4 10 ...
 $ Weight_lb         : num  195 127 185 180 220 140 160 120 140 150 ...

4.1.3 Coding missing values (Class 06)

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

4.1.3.1 Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

4.1.3.1.1 NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 0 Cigarett(s)
table(nesarc_sub$DailyCigsSmoked)

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   27   28   29   30   33   34   35   37   39   40 
 934  884  923  573 1070  463  269  299   49 3077   23  230   34   25  851   40   22   59    5 5366    1   10    2    7  155    2    3    3  909    1    1   30    2    1  993 
  45   50   55   57   60   66   70   75   80   98   99 
   8  106    2    1  241    1   12    2   47   15  262 
nesarc_sub <-
  nesarc_sub %>%
  replace_na(
  # replace NA with new numeric value
    list(
      DailyCigsSmoked = 0
    )
  )

table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    27    28    29 
25080   934   884   923   573  1070   463   269   299    49  3077    23   230    34    25   851    40    22    59     5  5366     1    10     2     7   155     2     3     3 
   30    33    34    35    37    39    40    45    50    55    57    60    66    70    75    80    98    99 
  909     1     1    30     2     1   993     8   106     2     1   241     1    12     2    47    15   262 
4.1.3.1.2 NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 3. Never

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 7. Never
table(nesarc_sub$SmokingStatus  )

   1    2    9 
9913 8078   22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     9 
14836   460   687   747   409   772   102 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
  # first change the "factor" variables to "character" to add a new level
    SmokingStatus   = as.character(SmokingStatus)
  , SmokingFreq     = as.character(SmokingFreq  )
  ) %>%
  replace_na(
  # replace NA with new category values
    list(
      SmokingStatus = "3"
    , SmokingFreq   = "7"
    )
  )

table(nesarc_sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 

4.1.3.2 Coding 9s and 99s as NAs

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc_sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 
table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    27    28    29 
25080   934   884   923   573  1070   463   269   299    49  3077    23   230    34    25   851    40    22    59     5  5366     1    10     2     7   155     2     3     3 
   30    33    34    35    37    39    40    45    50    55    57    60    66    70    75    80    98    99 
  909     1     1    30     2     1   993     8   106     2     1   241     1    12     2    47    15   262 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
  # replace missing values
    SmokingStatus   = replace(SmokingStatus  , SmokingStatus   %in% c("9"), NA)
  , SmokingFreq     = replace(SmokingFreq    , SmokingFreq     %in% c("9"), NA)
  , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
  , Height_ft       = replace(Height_ft      , Height_ft       %in% c( 99), NA)
  , Height_in       = replace(Height_in      , Height_in       %in% c( 99), NA)
  , Weight_lb       = replace(Weight_lb      , Weight_lb       %in% c(999), NA)
  # drop unused factor levels
  , SmokingStatus   = fct_drop(SmokingStatus)
  , SmokingFreq     = fct_drop(SmokingFreq  )
  )

table(nesarc_sub$SmokingStatus  )

    1     2     3 
 9913  8078 25080 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    27    28    29 
25080   934   884   923   573  1070   463   269   299    49  3077    23   230    34    25   851    40    22    59     5  5366     1    10     2     7   155     2     3     3 
   30    33    34    35    37    39    40    45    50    55    57    60    66    70    75    80    98 
  909     1     1    30     2     1   993     8   106     2     1   241     1    12     2    47    15 
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal      Height_ft   
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131            7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0   Min.   :4.00  
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962            1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800   1st Qu.:5.00  
 3      :    1             Median :44.0   3   :25080                       6      :  772   Median : 0.000   3:  701              Median :  20000   Median :5.00  
 4      :    1             Mean   :46.4   NA's:   22                       4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185   Mean   :5.11  
 5      :    1             3rd Qu.:59.0                                    3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000   3rd Qu.:5.00  
 6      :    1             Max.   :98.0                                    (Other):  869   Max.   :98.000                        Max.   :3000000   Max.   :7.00  
 (Other):43087                                                             NA's   :  102   NA's   :262                                             NA's   :730   
   Height_in        Weight_lb    
 Min.   : 0.000   Min.   : 62.0  
 1st Qu.: 3.000   1st Qu.:140.0  
 Median : 5.000   Median :165.0  
 Mean   : 5.279   Mean   :170.6  
 3rd Qu.: 8.000   3rd Qu.:192.0  
 Max.   :11.000   Max.   :500.0  
 NA's   :761      NA's   :1376   

4.1.3.3 Removing rows with missing values (don’t do this)

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc_sub)
[1] 43093    13
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
[1] 43093    13
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal      Height_ft   
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131            7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0   Min.   :4.00  
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962            1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800   1st Qu.:5.00  
 3      :    1             Median :44.0   3   :25080                       6      :  772   Median : 0.000   3:  701              Median :  20000   Median :5.00  
 4      :    1             Mean   :46.4   NA's:   22                       4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185   Mean   :5.11  
 5      :    1             3rd Qu.:59.0                                    3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000   3rd Qu.:5.00  
 6      :    1             Max.   :98.0                                    (Other):  869   Max.   :98.000                        Max.   :3000000   Max.   :7.00  
 (Other):43087                                                             NA's   :  102   NA's   :262                                             NA's   :730   
   Height_in        Weight_lb    
 Min.   : 0.000   Min.   : 62.0  
 1st Qu.: 3.000   1st Qu.:140.0  
 Median : 5.000   Median :165.0  
 Mean   : 5.279   Mean   :170.6  
 3rd Qu.: 8.000   3rd Qu.:192.0  
 Max.   :11.000   Max.   :500.0  
 NA's   :761      NA's   :1376   

Now we (would) have a dataset with no missing values.

4.1.4 Creating new variables (Class 06+)

4.1.4.1 From categories to numeric

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    DaysSmoke =
      case_when(
        SmokingFreq == 1 ~ 30      # 1. Every day
      , SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
      , SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
      , SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
      , SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
      , SmokingFreq == 6 ~ 1       # 6. Once a month or less
      , SmokingFreq == 7 ~ 0       # 7. Never
      )
  )

summary(nesarc_sub$DaysSmoke)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   10.96   30.00   30.00     102 

4.1.4.2 From numeric to numeric

Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    Height_inches = Height_ft * 12 + Height_in
  )

summary(nesarc_sub$Height_inches)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   48.0    64.0    66.0    66.6    70.0    84.0     761 

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    DaysSmoke_log2       = log2(DaysSmoke)
  , TotalCigsSmoked      = DaysSmoke * DailyCigsSmoked
  , TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
  )

summary(nesarc_sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.0   187.5   300.0  2940.0     297 

4.1.4.3 From numeric to categories based on quantiles

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    CigsSmokedFac  =
      case_when(
        TotalCigsSmoked == 0                              ~ "No smoking (0)"
      , (TotalCigsSmoked >  0) & (TotalCigsSmoked <=   6) ~ "Lowest smoking (1-6)"
      , (TotalCigsSmoked >  6) & (TotalCigsSmoked <=  30) ~ "Low smoking (7-30)"
      , (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
      , (TotalCigsSmoked > 300)                           ~ "High smoking (301+)"
      )
  , CigsSmokedFac =
      factor(
        CigsSmokedFac
      , levels = c(
            "No smoking (0)"
          , "Lowest smoking (1-6)"
          , "Low smoking (7-30)"
          , "Medium smoking (31-300)"
          , "High smoking (301+)"
          )
      )
  )

summary(nesarc_sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.0   187.5   300.0  2940.0     297 
summary(nesarc_sub$CigsSmokedFac)
         No smoking (0)    Lowest smoking (1-6)      Low smoking (7-30) Medium smoking (31-300)     High smoking (301+)                    NA's 
                  25080                    1045                    1129                    6456                    9086                     297 
table(nesarc_sub$CigsSmokedFac)

         No smoking (0)    Lowest smoking (1-6)      Low smoking (7-30) Medium smoking (31-300)     High smoking (301+) 
                  25080                    1045                    1129                    6456                    9086 

4.1.4.4 From many categories to a few

The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
#   if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc_sub <-
  nesarc_sub %>%
  mutate(
    SmokingFreq3 =
      fct_collapse(
        SmokingFreq
      , "1" = c("1", "2", "3")  # 1. daily
      , "2" = c("4", "5")       # 2. weekly
      , "3" = "6"               # 3. monthly
      , "4" = "7"               # 7. never
      )
  )

# new variable
table(nesarc_sub$SmokingFreq3)

    1     2     3     4 
15983  1156   772 25080 
# we will label the new factor variable below, with the others

4.1.4.5 Working with Dates

How to calculate the duration between two dates.

(You probably won’t need to use this for your project.)

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth day, month, and year.

dat_sub <-
  NESARC %>%
  dplyr::select(
    AGE
  , CDAY
  , CMON
  , CYEAR
  , DOBD
  , DOBM
  , DOBY
  )
str(dat_sub)
'data.frame':   43093 obs. of  7 variables:
 $ AGE  : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CDAY : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR: num  2001 2002 2001 2001 2001 ...
 $ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.

dat_sub <-
  dat_sub %>%
  mutate(
    ## interview date
    # combine day, month, and year into one text string
    cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
    # create date object by interpretting the day, month, year with dmy()
  , cdate_date = dmy(cdate_text)

    ## date of birth
    # combine day, month, and year into one text string
  , dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
    # create date object by interpretting the day, month, year with dmy()
      # if any failed to parse, it's because a day, month, or year is out of range
      #    such as "99" for missing values or refused to answer
      # in these cases, the date is NA (missing), which is what we want
  , dob_date = dmy(dob_text)
  )
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text   dob_date
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977 1977-09-27
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973 1973-12-15
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919 1919-12-28
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983 1983-06-27
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965 1965-09-11
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966 1966-12-28

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

dat_sub <-
  dat_sub %>%
  mutate(
    age_days  = cdate_date - dob_date
  , age_years = as.numeric(age_days / 365.25)  # change from "difftime" to "numeric"
    # difference between the age we calculated and the age in the original dataset
  , age_diff  = age_years - AGE
  )
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text   dob_date   age_days age_years   age_diff
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977 1977-09-27  8722 days  23.87953 0.87953457
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973 1973-12-15 10255 days  28.07666 0.07665982
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919 1919-12-28 29916 days  81.90554 0.90554415
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983 1983-06-27  6649 days  18.20397 0.20396988
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965 1965-09-11 13186 days  36.10130 0.10130048
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966 1966-12-28 12672 days  34.69405 0.69404517

Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!

# drop the original ones you don't need anymore
#   by select(-VAR), it excludes the VAR column
dat_sub <-
  dat_sub %>%
  select(
    -c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
  # -c(cdate_text, cdate_date, dob_text, dob_date)
  , -c(age_days, age_diff)
  )
str(dat_sub)
'data.frame':   43093 obs. of  6 variables:
 $ AGE       : num  23 28 81 18 36 34 19 84 29 18 ...
 $ cdate_text: chr  "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" ...
 $ cdate_date: Date, format: "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" ...
 $ dob_text  : chr  "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" ...
 $ dob_date  : Date, format: "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" ...
 $ age_years : num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

(For this example, I’m going to delete the dat_sub object since it was just for this demonstration.

rm(dat_sub)

4.1.4.6 Review results of new variables

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal      Height_ft   
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131            7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0   Min.   :4.00  
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962            1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800   1st Qu.:5.00  
 3      :    1             Median :44.0   3   :25080                       6      :  772   Median : 0.000   3:  701              Median :  20000   Median :5.00  
 4      :    1             Mean   :46.4   NA's:   22                       4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185   Mean   :5.11  
 5      :    1             3rd Qu.:59.0                                    3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000   3rd Qu.:5.00  
 6      :    1             Max.   :98.0                                    (Other):  869   Max.   :98.000                        Max.   :3000000   Max.   :7.00  
 (Other):43087                                                             NA's   :  102   NA's   :262                                             NA's   :730   
   Height_in        Weight_lb       DaysSmoke     Height_inches  DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2                 CigsSmokedFac   SmokingFreq3
 Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :48.0   Min.   : -Inf   Min.   :   0.0   Min.   : 0.000       No smoking (0)         :25080   1   :15983  
 1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0   1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229       Lowest smoking (1-6)   : 1045   2   : 1156  
 Median : 5.000   Median :165.0   Median : 0.00   Median :66.0   Median : -Inf   Median :   0.0   Median : 8.492       Low smoking (7-30)     : 1129   3   :  772  
 Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   :66.6   Mean   : -Inf   Mean   : 187.5   Mean   : 7.888       Medium smoking (31-300): 6456   4   :25080  
 3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0   3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229       High smoking (301+)    : 9086   NA's:  102  
 Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :84.0   Max.   :4.907   Max.   :2940.0   Max.   :11.522       NA's                   :  297               
 NA's   :761      NA's   :1376    NA's   :102     NA's   :761    NA's   :102     NA's   :297      NA's   :25377                                                    

4.1.5 Labeling Categorical variable levels (Class 06)

Look at summary of dataset to see the state of the labeling before we make changes.

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal      Height_ft   
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131            7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0   Min.   :4.00  
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962            1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800   1st Qu.:5.00  
 3      :    1             Median :44.0   3   :25080                       6      :  772   Median : 0.000   3:  701              Median :  20000   Median :5.00  
 4      :    1             Mean   :46.4   NA's:   22                       4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185   Mean   :5.11  
 5      :    1             3rd Qu.:59.0                                    3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000   3rd Qu.:5.00  
 6      :    1             Max.   :98.0                                    (Other):  869   Max.   :98.000                        Max.   :3000000   Max.   :7.00  
 (Other):43087                                                             NA's   :  102   NA's   :262                                             NA's   :730   
   Height_in        Weight_lb       DaysSmoke     Height_inches  DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2                 CigsSmokedFac   SmokingFreq3
 Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :48.0   Min.   : -Inf   Min.   :   0.0   Min.   : 0.000       No smoking (0)         :25080   1   :15983  
 1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0   1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229       Lowest smoking (1-6)   : 1045   2   : 1156  
 Median : 5.000   Median :165.0   Median : 0.00   Median :66.0   Median : -Inf   Median :   0.0   Median : 8.492       Low smoking (7-30)     : 1129   3   :  772  
 Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   :66.6   Mean   : -Inf   Mean   : 187.5   Mean   : 7.888       Medium smoking (31-300): 6456   4   :25080  
 3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0   3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229       High smoking (301+)    : 9086   NA's:  102  
 Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :84.0   Max.   :4.907   Max.   :2940.0   Max.   :11.522       NA's                   :  297               
 NA's   :761      NA's   :1376    NA's   :102     NA's   :761    NA's   :102     NA's   :297      NA's   :25377                                                    

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, NicotineDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <-
  factor(
    nesarc_sub$Depression
  , labels = c("No Depression"
             , "Yes Depression"
    )
  )

nesarc_sub$SmokingStatus <-
  factor(
    nesarc_sub$SmokingStatus
  , labels = c("Smoked past 12 months"
             , "Smoked prior to 12 months"
             , "Never smoked"
    )
  )

# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <-
  factor(nesarc_sub$Sex
  , levels = c( 2
              , 1
              )
  , labels = c( "Female"
              , "Male"
              )
  )
# check ordering with a frequency table
table(nesarc_sub$Sex)

Female   Male 
 24575  18518 
nesarc_sub$NicotineDependence <-
  factor(nesarc_sub$NicotineDependence
  , levels = c( 1
              , 0
              )
  , labels = c( "Nicotine Dependence"
              , "No Nicotine Dependence"
              )
  )


# Remember to include the new category level from the original `NA` values.

nesarc_sub$SmokingFreq <-
  factor(nesarc_sub$SmokingFreq
  , levels = c( 7
              , 6
              , 5
              , 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Once a month or less"
              , "2 to 3 Days/month"
              , "1 to 2 Days/week"
              , "3 to 4 Days/week"
              , "5 to 6 Days/week"
              , "Every Day"
              )
  )


nesarc_sub$Ethnicity <-
  factor(nesarc_sub$Ethnicity
    # shorter labels are helpful for plots
  , labels = c( "Cauc"
              , "AfAm"
              , "NaAm"
              , "Asia"
              , "Hisp"
              )
  # , labels = c("Caucasian"
  #            , "African American"
  #            , "Native American"
  #            , "Asian"
  #            , "Hispanic"
  #             )
)

nesarc_sub$SmokingFreq3 <-
  factor(nesarc_sub$SmokingFreq3
  , levels = c( 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Monthly"
              , "Weekly"
              , "Daily"
              )
  )


summary(nesarc_sub)
       ID            Sex             Age                         SmokingStatus                NicotineDependence               SmokingFreq    DailyCigsSmoked  Ethnicity   
 1      :    1   Female:24575   Min.   :18.0   Smoked past 12 months    : 9913   Nicotine Dependence   : 4962    Never               :25080   Min.   : 0.000   Cauc:24507  
 2      :    1   Male  :18518   1st Qu.:32.0   Smoked prior to 12 months: 8078   No Nicotine Dependence:38131    Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
 3      :    1                  Median :44.0   Never smoked             :25080                                   Once a month or less:  772   Median : 0.000   NaAm:  701  
 4      :    1                  Mean   :46.4   NA's                     :   22                                   1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
 5      :    1                  3rd Qu.:59.0                                                                     3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
 6      :    1                  Max.   :98.0                                                                     (Other)             :  869   Max.   :98.000               
 (Other):43087                                                                                                   NA's                :  102   NA's   :262                  
          Depression    IncomePersonal      Height_ft      Height_in        Weight_lb       DaysSmoke     Height_inches  DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2
 No Depression :35254   Min.   :      0   Min.   :4.00   Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :48.0   Min.   : -Inf   Min.   :   0.0   Min.   : 0.000      
 Yes Depression: 7839   1st Qu.:   8800   1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0   1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229      
                        Median :  20000   Median :5.00   Median : 5.000   Median :165.0   Median : 0.00   Median :66.0   Median : -Inf   Median :   0.0   Median : 8.492      
                        Mean   :  28185   Mean   :5.11   Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   :66.6   Mean   : -Inf   Mean   : 187.5   Mean   : 7.888      
                        3rd Qu.:  36000   3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0   3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229      
                        Max.   :3000000   Max.   :7.00   Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :84.0   Max.   :4.907   Max.   :2940.0   Max.   :11.522      
                                          NA's   :730    NA's   :761      NA's   :1376    NA's   :102     NA's   :761    NA's   :102     NA's   :297      NA's   :25377       
                 CigsSmokedFac    SmokingFreq3  
 No smoking (0)         :25080   Never  :25080  
 Lowest smoking (1-6)   : 1045   Monthly:  772  
 Low smoking (7-30)     : 1129   Weekly : 1156  
 Medium smoking (31-300): 6456   Daily  :15983  
 High smoking (301+)    : 9086   NA's   :  102  
 NA's                   :  297                  
                                                

4.1.6 Data subset rows

4.2 Data is complete (Class 06)

4.2.1 Plot entire dataset, show missing values

{r}
ggplot_missing(nesarc_sub)

4.2.2 Numerical summaries to assess correctness (Class 05)

summary(nesarc_sub)
       ID            Sex             Age                         SmokingStatus                NicotineDependence               SmokingFreq    DailyCigsSmoked  Ethnicity   
 1      :    1   Female:24575   Min.   :18.0   Smoked past 12 months    : 9913   Nicotine Dependence   : 4962    Never               :25080   Min.   : 0.000   Cauc:24507  
 2      :    1   Male  :18518   1st Qu.:32.0   Smoked prior to 12 months: 8078   No Nicotine Dependence:38131    Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
 3      :    1                  Median :44.0   Never smoked             :25080                                   Once a month or less:  772   Median : 0.000   NaAm:  701  
 4      :    1                  Mean   :46.4   NA's                     :   22                                   1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
 5      :    1                  3rd Qu.:59.0                                                                     3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
 6      :    1                  Max.   :98.0                                                                     (Other)             :  869   Max.   :98.000               
 (Other):43087                                                                                                   NA's                :  102   NA's   :262                  
          Depression    IncomePersonal      Height_ft      Height_in        Weight_lb       DaysSmoke     Height_inches  DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2
 No Depression :35254   Min.   :      0   Min.   :4.00   Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :48.0   Min.   : -Inf   Min.   :   0.0   Min.   : 0.000      
 Yes Depression: 7839   1st Qu.:   8800   1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0   1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229      
                        Median :  20000   Median :5.00   Median : 5.000   Median :165.0   Median : 0.00   Median :66.0   Median : -Inf   Median :   0.0   Median : 8.492      
                        Mean   :  28185   Mean   :5.11   Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   :66.6   Mean   : -Inf   Mean   : 187.5   Mean   : 7.888      
                        3rd Qu.:  36000   3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0   3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229      
                        Max.   :3000000   Max.   :7.00   Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :84.0   Max.   :4.907   Max.   :2940.0   Max.   :11.522      
                                          NA's   :730    NA's   :761      NA's   :1376    NA's   :102     NA's   :761    NA's   :102     NA's   :297      NA's   :25377       
                 CigsSmokedFac    SmokingFreq3  
 No smoking (0)         :25080   Never  :25080  
 Lowest smoking (1-6)   : 1045   Monthly:  772  
 Low smoking (7-30)     : 1129   Weekly : 1156  
 Medium smoking (31-300): 6456   Daily  :15983  
 High smoking (301+)    : 9086   NA's   :  102  
 NA's                   :  297                  
                                                

5 Graphing and Tabulating

5.1 Class 06, Plotting univariate

Rubric

  1. (3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe.

  2. (3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers).

  3. (2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g., str(), dim(), summary()).

    • Scroll up to sections labeled “(Class 06)”.
  4. (2 p) Label levels of factor variables.

    • Scroll up to sections labeled “(Class 06)”.

5.2 Categorical variables

5.2.1 Tables for categorical variables

# table() produces a one-variable frequency table
table(nesarc_sub$NicotineDependence)

   Nicotine Dependence No Nicotine Dependence 
                  4962                  38131 
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()

   Nicotine Dependence No Nicotine Dependence 
             0.1151463              0.8848537 
tab_nicdep <-
  nesarc_sub %>%
  group_by(
    NicotineDependence
  ) %>%
  summarize(
    n = n()
  #, .groups = "drop_last"
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  #arrange(
  #  desc(n)
  #) %>%
  ungroup()

tab_nicdep
# A tibble: 2 x 3
  NicotineDependence         n  prop
  <fct>                  <int> <dbl>
1 Nicotine Dependence     4962 0.115
2 No Nicotine Dependence 38131 0.885

5.2.2 Graphing frequency tables

library(ggplot2)
p1 <- ggplot(data = nesarc_sub, aes(x = NicotineDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x     = "Nicotine Dependence"
              , y     = "Total Number"
              , title = "Nicotine Dependence for Adults"
              )
print(p1)

Interpretation: Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%).

5.3 Numeric variables

5.3.1 Graphing numeric variables

p <- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5)
#p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            , title = "Monthly cigaretts smoked for Young Smoking Adults"
            )
p <- p + theme_bw()
print(p)

summary(nesarc_sub$DailyCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   6.462  10.000  98.000     262 
fivenum(nesarc_sub$DailyCigsSmoked)
[1]  0  0  0 10 98
median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
[1] 0
IQR(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
[1] 10

Interpretation: Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 15 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 14. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.


5.4 Class 07, Plotting bivariate, numeric response

Rubric

  1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

    1. Scatter plot (for regression): \(x\) = numerical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

    2. Box plots (for ANOVA): \(x\) = categorical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

5.4.1 Scatter plot (for regression): x = numerical, y = numerical

Interpretation: The relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line.

{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
            title = "Total Cigarettes Smoked vs Age"
          #, x =
          , y = "Total Cigarettes Smoked per month"
          , caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
          )
print(p)

An example of height and weight.

{r}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_jitter(height = 0.5, alpha = 1/10)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)

5.4.2 Box plots (for ANOVA): x = categorical, y = numerical

Interpretation: For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.

{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
                      colour = "red", alpha = 0.8)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total Cigarettes Smoked per month"
          , caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
          )
print(p)

5.5 Class 08, Plotting bivariate, categorical response

Rubric

  1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

    1. Mosaic plot or bivariate bar plots (for contingency tables): \(x\) = categorical, \(y\) = categorical, include axis labels and a title. Interpret the plot: describe the relationship.

    2. Logistic scatter plot (for logistic regression): \(x\) = numerical, \(y\) = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship.

5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical

tab_dep_nic <-
  nesarc_sub %>%
  group_by(
    Depression, NicotineDependence
  ) %>%
  summarize(
    n = n()
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  ungroup()

tab_dep_nic
# A tibble: 4 x 4
  Depression     NicotineDependence         n  prop
  <fct>          <fct>                  <int> <dbl>
1 No Depression  Nicotine Dependence     3193 0.091
2 No Depression  No Nicotine Dependence 32061 0.909
3 Yes Depression Nicotine Dependence     1769 0.226
4 Yes Depression No Nicotine Dependence  6070 0.774
tab_nic_dep <-
  nesarc_sub %>%
  group_by(
    NicotineDependence, Depression
  ) %>%
  summarize(
    n = n()
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  ungroup()

tab_nic_dep
# A tibble: 4 x 4
  NicotineDependence     Depression         n  prop
  <fct>                  <fct>          <int> <dbl>
1 Nicotine Dependence    No Depression   3193 0.643
2 Nicotine Dependence    Yes Depression  1769 0.357
3 No Nicotine Dependence No Depression  32061 0.841
4 No Nicotine Dependence Yes Depression  6070 0.159

Reversing color order: It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function forcats::fct_rev() on your fill= variable.

library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
           , y = "Proportion"
           , title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
           )
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name = "Nicotine Dependence\nStatus")
print(p)

Interpretation: When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%).

5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)

table(nesarc_sub$NicotineDependence)

   Nicotine Dependence No Nicotine Dependence 
                  4962                  38131 
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc_sub$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)

    0     1 
38131  4962 
#  #  Note, sometimes it is necessary to make your variable a character type first.
#  #  If the code above doesn't produce both 0s and 1s, then try this:
#  nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0)
{r, fig.height = 4, fig.width = 8}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)

# overlay a sigmodal curve to indicate direction of relationship
  # As of ggplot2 2.1.0, need to use this binomial_smooth() function:
  binomial_smooth <- function(...) {
    geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
  }
p <- p + binomial_smooth()    # old way: stat_smooth(method="glm", family="binomial", se=FALSE)

p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))

p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = "Nicotine Dependence (1=Yes, 0=No)"
            , title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
            )
print(p)

Interpretation: The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.

6 Statistical methods

6.1 Class 09, Simple linear regression (separate worksheet)

6.2 Class 10, Simple linear regression

Rubric

  1. With your previous (or new) bivariate scatter plot, add a regression line.
    • (2 p) plot with regression line,
    • (1 p) label axes and title.
  2. Use lm() to fit the linear regression and interpret slope and \(R^2\) (R-squared) values.
    • (2 p) lm summary() table is presented,
    • (2 p) slope is interpreted with respect to a per-unit increase of the \(x\) variable in the context of the variables in the plot,
    • (2 p) \(R^2\) is interpretted in a sentence.
  3. (1 p) Interpret the intercept. Does it make sense in the context of your study?

6.2.1 1. Scatter plot, add a regression line.

{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
#p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, fullrange = TRUE)
p <- p + labs(
            title = "Total Cigarettes Smoked vs Age"
          #, x =
          , y = "Total Cigarettes Smoked per month"
          , caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
          )
p <- p + xlim(c(0, NA))
print(p)

6.2.2 2. Fit the linear regression, interpret slope and \(R^2\) (R-squared) values

# fit the simple linear regression model
lm_TCS_Age <- 
  lm(
    formula = TotalCigsSmoked ~ Age
  , data = nesarc_sub
  )
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)

Call:
lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub)

Residuals:
   Min     1Q Median     3Q    Max 
-332.1 -189.4 -139.1  106.3 2828.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 57.92792    4.35989   13.29   <2e-16 ***
Age          2.79797    0.08762   31.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 329.2 on 42794 degrees of freedom
  (297 observations deleted due to missingness)
Multiple R-squared:  0.02327,   Adjusted R-squared:  0.02325 
F-statistic:  1020 on 1 and 42794 DF,  p-value: < 2.2e-16
  • Slope: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month.

  • \(R^2\): The proportion of variance explained by the regression model, compared to the grand mean, is \(R^2 = 0.0233\), which is very small.

6.2.3 3. Interpret the intercept. Does it make sense?

For someone 0 years old, the expected number of cigarettes smoked per month is 57.93.

This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.

6.3 Class 11, Logarithm transformation (separate worksheet)

6.4 Class 12, Logarithm transformation

Rubric

  1. Try plotting the data on a logarithmic scale
    • (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.
    1. original scales
    2. \(\log(x)\)-only
    3. \(\log(y)\)-only
    4. both \(\log(x)\) and \(\log(y)\)
  2. What happened to your data when you transformed it?
    • (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
    • (1 p) Choose the best scale for a linear relationship and explain why.
    • (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.

6.4.1 1. Try plotting on log scale (original scale, \(\log(x)\)-only, \(\log(y)\)-only, both \(\log(x)\) and \(\log(y)\))

{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()

library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
            title = "Original scale"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p1)

library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
            title = "Log(x)"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p2)

library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
            title = "Log(y)"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p3)

library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
            title = "Log(x) and Log(y)"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p4)


# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Age")

6.4.2 2. What happened to your data when you transformed it?

  • Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).
  1. Original: Same
  2. log(x): no improvement, age values became left skewed
  3. log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
  4. log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help.
  • Choose the best scale for a linear relationship and explain why.

log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.

  • Does your relationship benefit from a logarithmic transformation? Say why or why not.

A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.

6.4.2.0.1 Extra plots

Here’s an example where transforming both \(x\) and \(y\) with the log makes a big improvement.

{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()

library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
            title = "Original scale"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p1)

library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
            title = "Log(x)"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p2)

library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
            title = "Log(y)"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p3)

library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
            title = "Log(x) and Log(y)"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p4)


# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")

6.5 Class 13, Correlation (separate worksheet)

6.6 Class 14, Categorical contingency tables (separate worksheet)

6.7 Class 15, Correlation and Categorical contingency tables

Rubric

  1. With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
    • (1 p) plot is repeated here or the plot is referenced an easy to find from a plot above,
    • (1 p) correlation is calculated,
    • (2 p) correlation is interpreted (direction, strength of LINEAR relationship).
  2. With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
    • (1 p) frequency table of variables is given,
    • (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
    • (1 p) a well-labeled plot of the proportion table is given,
    • (2 p) the conditional proportions are interpreted and compared between conditions.

6.7.1 Correlation

{r, fig.height = 5, fig.width = 5}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_point(alpha = 1/20)
p <- p + stat_smooth(method = lm)
p <- p + labs(
            title = "Log2 Cigarettes smoked vs Age"
          , x = "Age"
          , y = "Log2 Total Cigarettes Smoked per month"
          )
print(p)
cor_A_C <-
  cor(
    nesarc_sub$Age
  , nesarc_sub$TotalCigsSmoked_log2
  , use = "complete.obs"
  )
cor_A_C
[1] 0.1375481

6.7.2 Interpretation of correlation

The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.

6.7.3 Contingency table

tab_S_D_N <-
  nesarc_sub %>%
  group_by(
    Sex
  , Depression
  , NicotineDependence
  ) %>%
  summarise(
    Frequency = n()
  ) %>%
  mutate(
    Proportion = round(Frequency / sum(Frequency), 3)
  ) %>%
  ungroup()

# tab_S_D_N

tab_S_D_N %>%
  knitr::kable()
Sex Depression NicotineDependence Frequency Proportion
Female No Depression Nicotine Dependence 1455 0.076
Female No Depression No Nicotine Dependence 17706 0.924
Female Yes Depression Nicotine Dependence 1117 0.206
Female Yes Depression No Nicotine Dependence 4297 0.794
Male No Depression Nicotine Dependence 1738 0.108
Male No Depression No Nicotine Dependence 14355 0.892
Male Yes Depression Nicotine Dependence 652 0.269
Male Yes Depression No Nicotine Dependence 1773 0.731
library(ggplot2)
p <- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(
    title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
  , y = "Proportion of Nicotine Dependence"
  , fill = "Nicotine Dependence"
  )
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_wrap(~ Sex)
p <- p + theme(legend.position = "bottom")
print(p)

6.7.4 Interpretation of conditional proportions

  • For Females without depression, the proportion who are nicotine dependent is 0.076, while for those with depression it is 0.206 (nearly 3 times as much).
  • For Males without depression, the proportion who are nicotine dependent is 0.108, while for those with depression it is 0.269 (nearly 3 times as much).

Males generally are more nicotine dependent than Females, but both are nearly 3 times more likely to be nicotine dependent if they are depressed compared to if they are not depressed.

6.8 Class 16, Parameter estimation (one-sample) (separate worksheet)

6.9 Class 17, Inference and Parameter estimation (one-sample)

Rubric

  1. Using a numerical variable, calculate and interpret a confidence interval for the population mean.
    • (1 p) Identify and describe the variable,
    • (1 p) use t.test() to calculate the mean and confidence interval, and
    • (1 p) interpret the confidence interval.
    • (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
  2. Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
    • (1 p) Identify and describe the variable,
    • (1 p) use binom.test() to calculate the sample proportion and confidence interval, and
    • (1 p) interpret the confidence interval.
    • (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.

6.9.1 Numeric variable confidence interval for mean \(\mu\)

TotalCigsSmoked_log2 is the estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked) on the log2 scale.

# calculate mean and CI
t_result <- 
  t.test(
    nesarc_sub$TotalCigsSmoked_log2
  )
t_result

    One Sample t-test

data:  nesarc_sub$TotalCigsSmoked_log2
t = 457.94, df = 17715, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 7.854201 7.921726
sample estimates:
mean of x 
 7.887963 

Interpretation:

  • The mean of TotalCigsSmoked_log2 is 7.89 and the 95% confidence interval is (7.85, 7.92).
  • In 95% of samples, a confidence interval constructed from the data will contain the true population mean. In our case, our CI was (7.85, 7.92). It either contains the true population mean or it doesn’t, but we don’t know.
# plot data and create table
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_log2))
p <- p + geom_histogram(binwidth = 0.5, alpha = 1)
#p <- p + geom_rug(alpha = 1/8)
# est
p <- p + geom_vline(aes(xintercept = as.numeric(t_result$estimate)), colour = "blue", size = 1, linetype = 3)
p <- p + geom_rect(aes(
          xmin = t_result$conf.int[1]
        , xmax = t_result$conf.int[2]
        , ymin = -150
        , ymax = 0)
                  , fill = "blue", alpha = 1)
p <- p + labs(title = "Log2 Total Cigs Smoked\nblue = est +- 95% CI")
print(p)

6.9.2 Categorical variable confidence interval for proportion \(p\)

Depression is a Yes/No variable indicating that the person has major depression (lifetime).

# proportions are of the last group_by() variable within combinations of the other variables
tab_dep_long <-
  nesarc_sub %>%
  group_by(Depression) %>%
  summarise(Frequency = n()) %>%
  mutate(Proportion = Frequency / sum(Frequency)) %>%
  ungroup()
tab_dep_long
# A tibble: 2 x 3
  Depression     Frequency Proportion
  <fct>              <int>      <dbl>
1 No Depression      35254      0.818
2 Yes Depression      7839      0.182
# prop.test() is an asymptotic (approximate) test for a binomial random variable
p_summary <-
  prop.test(
    x = tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency)
  , n = sum(tab_dep_long$Frequency)
  #, conf.level = 0.95
  )
p_summary

    1-sample proportions test with continuity correction

data:  tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency) out of sum(tab_dep_long$Frequency), null probability 0.5
X-squared = 17440, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1782835 0.1855912
sample estimates:
        p 
0.1819089 

Interpretation:

  • The proportion of people who have Depression is 0.182 and the 95% confidence interval is (0.178, 0.186).
  • In 95% of samples, a confidence interval constructed from the data will contain the true population proportion In our case, our CI was (0.178, 0.186). It either contains the true population proportion or it doesn’t, but we don’t know.
{r}
library(ggplot2)
p <- ggplot(data = tab_dep_long %>% filter(Depression == "Yes Depression"), aes(x = Depression, y = Proportion))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
#p <- p + geom_bar(stat = "identity")
p <- p + geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25)
p <- p + geom_point(colour = "red", size = 6, pch = 18)
p <- p + scale_y_continuous(limits = c(0.17, 0.19))
# swap axes to take less vertical space
#p <- p + coord_flip()
print(p)

6.10 Class 18, Hypothesis testing (one- and two-sample) (separate worksheet)

6.11 Class 19, Paired data, assumption assessment (separate worksheet)

6.12 Class 20, Hypothesis testing (one- and two-sample)

6.12.1 Mechanics of a hypothesis test (review)

  1. Set up the null and alternative hypotheses in words and notation.

    • In words: ``The population mean for [what is being studied] is different from [value of \(\mu_0\)].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
    • In notation: \(H_0: \mu=\mu_0\) versus \(H_A: \mu \ne \mu_0\) (where \(\mu_0\) is specified by the context of the problem).
  2. Choose the significance level of the test, such as \(\alpha=0.05\).

  3. Compute the test statistic, such as \(t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}\), where \(SE_{\bar{Y}}=s/\sqrt{n}\) is the standard error.

  4. Determine the tail(s) of the sampling distribution where the \(p\)-value from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, \(t_{s}\), with the critical value \(t_{\textrm{crit}}=t_{\alpha/2}\) in the direction of the alternative hypothesis from the \(t\)-distribution table with degrees of freedom \(df = n-1\).)

  5. State the conclusion in terms of the problem.

    • Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
    • Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT accept \(H_0\).)
  6. Check assumptions of the test (for now we skip this).

6.12.2 What do we do about “significance”?

Adapted from Significance Magazine.

Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more).

In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph.

“The difference between the two groups turns out to be small (8%), while the probability (\(p\)) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is \(p = 0.003\) (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold (\(p < 0.05\)). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.”

6.12.3 Two-sample \(t\)-test

Rubric

  1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample \(t\)-test associated with your research questions.
    • (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
    • (0 p) use t.test() to calculate the mean, test statistic, and p-value,
    • (2 p) state the significance level, test statistic, and p-value, and
    • (2 p) state the conclusion in the context of the problem.
    • (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
    • (2 p) Provide an appropriate plot of the data and sample estimates in a well-labeled plot.
    • (1 p) Check the assumptions of the test using the bootstrap and interpret the bootstrap sampling distribution plot.
  2. Set up the null and alternative hypotheses in words and notation.
    • In words: “The population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).”
    • In notation: \(H_0: \mu_D = \mu_N\) versus \(H_A: \mu_D > \mu_N\).
t_summary_TCS_D <-
  t.test(
    TotalCigsSmoked ~ Depression
  , data = nesarc_sub
  , alternative = "less"
  )
t_summary_TCS_D

    Welch Two Sample t-test

data:  TotalCigsSmoked by Depression
t = -17.226, df = 10439, p-value < 2.2e-16
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -71.94239
sample estimates:
 mean in group No Depression mean in group Yes Depression 
                    173.0296                     252.5677 
  1. Choose the significance level of the test, such as \(\alpha=0.05\).

  2. Compute the test statistic, such as \(t_{s} = -17.2\).

  3. \(p\)-value is \(p = 6.85\times 10^{-66}\).

  4. State the conclusion in terms of the problem.

    • Reject \(H_0\) in favor of \(H_A\) concluding that the population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).
  • Because we rejected the Null hypothesis, we could have made a Type I error.
## If we create a summary data.frame with a similar structure as our data, then we
##   can annotate our plot with those summaries.
est_mean_TCS_D <-
  nesarc_sub %>%
  group_by(Depression) %>%
  summarise(TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)) %>%
  ungroup()
est_mean_TCS_D
# A tibble: 2 x 2
  Depression     TotalCigsSmoked
  <fct>                    <dbl>
1 No Depression             173.
2 Yes Depression            253.
{R, fig.width=4, fig.height=6}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
                      colour = "red", alpha = 0.8)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total cigarettes smoked per month"
          )
print(p)
# histogram using ggplot
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 40)
#p <- p + geom_rug()
p <- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour = "red")
p <- p + facet_grid(Depression ~ .)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total cigarettes smoked per month"
          )
print(p)