ADA1: Erik’s cumulative project file

Model assignment: NESARC, Smoking and depression in adults

Advanced Data Analysis 1, Stat 427/527, Fall 2022, Prof. Erik Erhardt, UNM

Author

Erik Erhardt

Published

August 24, 2022


1 Document overview

Important

Please don’t let the initial size and detail of this document intimidate you. You got this!

This document is organized by Week and Class number. The worksheet assignments are indicated by the Class numbers.

Consider your readers (graders):

  • organize the document clearly (use this document as an example)
  • label minor sections under each day (use this document as an example)
  • For each thing you do, always have these three parts:
    1. Say what you’re going to do and why.
    2. Do it with code, and document your code.
    3. Interpret the results.

1.1 Document

1.1.1 Naming

Note

Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

  • ADA1_ALL_05.qmd
  • ADA1_ALL_06.qmd
  • ADA1_ALL_07.qmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

  • ADA1_ALL_20220903.qmd
  • ADA1_ALL_20220905.qmd
  • ADA1_ALL_20220910.qmd

1.1.2 Structure

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

1.1.3 Classes not appearing in this document

Some assignments are in a separate worksheet and are indicated with “(separate worksheet)”. For these, I’ll provide a dataset I want you to analyze. Typically, you’ll then return to this document and repeat the same type of analysis with your dataset.


2 Research Questions

2.1 Class 02, Personal Codebook

Rubric

  1. (1 p) Is there a topic of interest?

  2. (2 p) Are the variables relevant to a set of research questions?

  3. (4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

    • 1 categorical variable with only 2 levels
    • 1 categorical variable with at least 3 levels
    • 2 numerical variables with many possible unique values
    • More variables are welcome and you’re likely to add to this later in the semester
  4. (3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

  5. Compile this qmd file to an html, print/save to pdf, and upload to UNM Canvas.

2.1.1 Topic and research questions

Topic:

Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

Research questions:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?

  2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?

  3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

3 Codebook

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
RenamedVarName
  VarName original in dataset
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

ID
  IDNUM
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Categorical, Nominal
  43093 1-43093. Unique Identification number

Sex
  SEX
  SEX
  Categorical, Nominal
  18518 1. Male
  24575 2. Female

Age
  AGE
  AGE
  Numerical, Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

SmokingStatus
  CHECK321
  CIGARETTE SMOKING STATUS
  Categorical, Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

NicotineDependence
  TAB12MDX
  NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
  Categorical, Nominal
  38131 0. No nicotine dependence
   4962 1. Nicotine dependence

SmokingFreq
  S3AQ3B1
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Categorical, Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

DailyCigsSmoked
  S3AQ3C1
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Numerical, Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

Ethnicity
  ETHRACE2A
  IMPUTED RACE/ETHNICITY
  Categorical, Nominal
  24507 1. White, Not Hispanic or Latino
   8245 2. Black, Not Hispanic or Latino
    701 3. American Indian/Alaska Native, Not Hispanic or Latino
   1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
   8308 5. Hispanic or Latino

Depression
  MAJORDEPLIFE
  MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
  Categorical, Nominal
  35254 0. No
   7839 1. Yes

IncomePersonal
  S1Q10A
  TOTAL PERSONAL INCOME IN LAST 12 MONTHS
  Numerical, Continuous
  43093 0-3000000. Income in dollars

IncomePersonalCat
  S1Q10B
  TOTAL PERSONAL INCOME IN LAST 12 MONTHS: CATEGORY
  Categorical, Ordinal
  2462  0. $      0 (No personal income)
  3571  1. $      1 to $ 4,999
  3823  2. $  5,000 to $ 7,999
  2002  3. $  8,000 to $ 9,999
  3669  4. $ 10,000 to $12,999
  1634  5. $ 13,000 to $14,999
  3940  6. $ 15,000 to $19,999
  3887  7. $ 20,000 to $24,999
  3085  8. $ 25,000 to $29,999
  3003  9. $ 30,000 to $34,999
  2351 10. $ 35,000 to $39,999
  3291 11. $ 40,000 to $49,999
  2059 12. $ 50,000 to $59,999
  1328 13. $ 60,000 to $69,999
   857 14. $ 70,000 to $79,999
   521 15. $ 80,000 to $89,999
   290 16. $ 90,000 to $99,999
  1320 17. $100,000 or more

Height_ft
  S1Q24FT
  HEIGHT: FEET
  Numerical, Continuous
  42363 4-7. Feet
  730 99. Unknown
          * change 99 to NA

Height_in
  S1Q24IN
  HEIGHT: INCHES
  Numerical, Continuous
  3572 0. None
  38760 1-11. Inches
  761 99. Unknown
          * change 99 to NA

Weight_lb
  S1Q24LB
  WEIGHT: POUNDS
  Numerical, Continuous
  41717 62-500. Pounds
  1376 999. Unknown
          * change 999 to NA

Additional variables were created from the original variables:

CREATED VARIABLES

DaysSmoke
  estimated number of days per month a subject smokes
  Numerical, Continuous (due to way estimated), assumes 30 days per month using SmokingFreq
  1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Numerical, Continuous
  0-large.

TotalCigsSmoked_smoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Smokers, only! (replaced 0s with NAs)
  Continuous
  1-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Categorical, Ordinal
  ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Categorical, Ordinal
    from SmokingFreq
    1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. monthly = 6. Once a month or less
    4. Never   = 7. Never

Height_inches
  Total height in inches
  Numerical, Continuous
  Height_ft * 12 + Height_in

4 Data Management

4.1 Class 03, Data subset and numerical summaries

Rubric

  1. (4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

    • Scroll down to sections labeled “(Class 03)”.
  2. (1 p) Output confirms the subset is correct (e.g., using dim() and str()).

  3. (3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

    • Scroll down to sections labeled “(Class 03)”.
  4. (2 p) Provide numerical summaries for all variables (e.g., using summary()).

    • Scroll down to sections labeled “(Class 03)”.

4.1.1 Data subset (Class 03)

First, read the data and see the size (dim()).

# data analysis packages
library(erikmisc)   # Helpful functions
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.16 ──
✔ tibble 3.1.8     ✔ dplyr  1.0.9
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
  by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)  # Data manipulation and visualization suite
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lubridate)  # Dates

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots

  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Use the load() statement for the dataset you want to use.
# read data example
load("NESARC.RData")

dim(NESARC)
[1] 43093  3008

Select the variables to include in our subset.

# variables to include in our data subset
nesarc_sub <-
  NESARC %>%
  select(
    IDNUM
  , SEX
  , AGE
  , CHECK321
  , TAB12MDX
  , S3AQ3B1
  , S3AQ3C1
  , ETHRACE2A
  , MAJORDEPLIFE
  , S1Q10A
  , S1Q10B
  , S1Q24FT
  , S1Q24IN
  , S1Q24LB
  )

# Check size and structure of data subset.
# Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
[1] 43093    14
# structure of subset
str(nesarc_sub)
'data.frame':   43093 obs. of  14 variables:
 $ IDNUM       : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ AGE         : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TAB12MDX    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ S3AQ3C1     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ S1Q10A      : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ S1Q10B      : Factor w/ 18 levels "0","1","2","3",..: 7 5 3 9 12 2 2 8 14 2 ...
 $ S1Q24FT     : num  5 5 5 5 6 5 5 5 5 5 ...
 $ S1Q24IN     : num  7 1 4 7 1 5 9 4 4 10 ...
 $ S1Q24LB     : num  195 127 185 180 220 140 160 120 140 150 ...

4.1.2 Renaming Variables (Class 03)

nesarc_sub <-
  nesarc_sub %>%
  rename(
    ID                  = IDNUM
  , Sex                 = SEX
  , Age                 = AGE
  , SmokingStatus       = CHECK321
  , NicotineDependence  = TAB12MDX
  , SmokingFreq         = S3AQ3B1
  , DailyCigsSmoked     = S3AQ3C1
  , Ethnicity           = ETHRACE2A
  , Depression          = MAJORDEPLIFE
  , IncomePersonal      = S1Q10A
  , IncomePersonalCat   = S1Q10B
  , Height_ft           = S1Q24FT
  , Height_in           = S1Q24IN
  , Weight_lb           = S1Q24LB
  )

str(nesarc_sub)
'data.frame':   43093 obs. of  14 variables:
 $ ID                : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Sex               : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ Age               : num  23 28 81 18 36 34 19 84 29 18 ...
 $ SmokingStatus     : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ SmokingFreq       : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DailyCigsSmoked   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Ethnicity         : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ Depression        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ IncomePersonal    : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
 $ IncomePersonalCat : Factor w/ 18 levels "0","1","2","3",..: 7 5 3 9 12 2 2 8 14 2 ...
 $ Height_ft         : num  5 5 5 5 6 5 5 5 5 5 ...
 $ Height_in         : num  7 1 4 7 1 5 9 4 4 10 ...
 $ Weight_lb         : num  195 127 185 180 220 140 160 120 140 150 ...

4.1.3 Coding missing values (Class 04)

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

4.1.3.1 Coding NAs as meaningful “missing”

Note

Most of you will not need to code this type of missing.

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

4.1.3.1.1 NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 0 Cigarett(s)
table(nesarc_sub$DailyCigsSmoked)

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
 934  884  923  573 1070  463  269  299   49 3077   23  230   34   25  851   40 
  17   18   19   20   21   22   23   24   25   27   28   29   30   33   34   35 
  22   59    5 5366    1   10    2    7  155    2    3    3  909    1    1   30 
  37   39   40   45   50   55   57   60   66   70   75   80   98   99 
   2    1  993    8  106    2    1  241    1   12    2   47   15  262 
nesarc_sub <-
  nesarc_sub %>%
  replace_na(
  # replace NA with new numeric value
    list(
      DailyCigsSmoked = 0
    )
  )

table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11    12 
25080   934   884   923   573  1070   463   269   299    49  3077    23   230 
   13    14    15    16    17    18    19    20    21    22    23    24    25 
   34    25   851    40    22    59     5  5366     1    10     2     7   155 
   27    28    29    30    33    34    35    37    39    40    45    50    55 
    2     3     3   909     1     1    30     2     1   993     8   106     2 
   57    60    66    70    75    80    98    99 
    1   241     1    12     2    47    15   262 
4.1.3.1.2 NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 3. Never

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
          * change to 7. Never
table(nesarc_sub$SmokingStatus  )

   1    2    9 
9913 8078   22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     9 
14836   460   687   747   409   772   102 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
  # first change the "factor" variables to "character" to add a new level
    SmokingStatus   = as.character(SmokingStatus)
  , SmokingFreq     = as.character(SmokingFreq  )
  ) %>%
  replace_na(
  # replace NA with new category values
    list(
      SmokingStatus = "3"
    , SmokingFreq   = "7"
    )
  )

table(nesarc_sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 

4.1.3.2 Coding 9s and 99s as NAs

Note

Everyone will need to do this.

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc_sub$SmokingStatus  )

    1     2     3     9 
 9913  8078 25080    22 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7     9 
14836   460   687   747   409   772 25080   102 
table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11    12 
25080   934   884   923   573  1070   463   269   299    49  3077    23   230 
   13    14    15    16    17    18    19    20    21    22    23    24    25 
   34    25   851    40    22    59     5  5366     1    10     2     7   155 
   27    28    29    30    33    34    35    37    39    40    45    50    55 
    2     3     3   909     1     1    30     2     1   993     8   106     2 
   57    60    66    70    75    80    98    99 
    1   241     1    12     2    47    15   262 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
  # replace missing values, factor variables use quotes, numeric no quotes
    SmokingStatus   = replace(SmokingStatus  , SmokingStatus   %in% c("9"), NA)
  , SmokingFreq     = replace(SmokingFreq    , SmokingFreq     %in% c("9"), NA)
  , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
  , Height_ft       = replace(Height_ft      , Height_ft       %in% c( 99), NA)
  , Height_in       = replace(Height_in      , Height_in       %in% c( 99), NA)
  , Weight_lb       = replace(Weight_lb      , Weight_lb       %in% c(999), NA)
  # drop unused factor levels (not for numeric)
  , SmokingStatus   = fct_drop(SmokingStatus)
  , SmokingFreq     = fct_drop(SmokingFreq  )
  )

table(nesarc_sub$SmokingStatus  )

    1     2     3 
 9913  8078 25080 
table(nesarc_sub$SmokingFreq    )

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
table(nesarc_sub$DailyCigsSmoked)

    0     1     2     3     4     5     6     7     8     9    10    11    12 
25080   934   884   923   573  1070   463   269   299    49  3077    23   230 
   13    14    15    16    17    18    19    20    21    22    23    24    25 
   34    25   851    40    22    59     5  5366     1    10     2     7   155 
   27    28    29    30    33    34    35    37    39    40    45    50    55 
    2     3     3   909     1     1    30     2     1   993     8   106     2 
   57    60    66    70    75    80    98 
    1   241     1    12     2    47    15 
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131           
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962           
 3      :    1             Median :44.0   3   :25080                      
 4      :    1             Mean   :46.4   NA's:   22                      
 5      :    1             3rd Qu.:59.0                                   
 6      :    1             Max.   :98.0                                   
 (Other):43087                                                            
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
 IncomePersonalCat   Height_ft      Height_in        Weight_lb    
 6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 2      : 3823     Median :5.00   Median : 5.000   Median :165.0  
 4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0  
 (Other):20912     NA's   :730    NA's   :761      NA's   :1376   

4.1.3.3 Removing rows with missing values (don’t do this)

Note

Do not do this.

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc_sub)
[1] 43093    14
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
[1] 43093    14
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131           
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962           
 3      :    1             Median :44.0   3   :25080                      
 4      :    1             Mean   :46.4   NA's:   22                      
 5      :    1             3rd Qu.:59.0                                   
 6      :    1             Max.   :98.0                                   
 (Other):43087                                                            
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
 IncomePersonalCat   Height_ft      Height_in        Weight_lb    
 6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 2      : 3823     Median :5.00   Median : 5.000   Median :165.0  
 4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0  
 (Other):20912     NA's   :730    NA's   :761      NA's   :1376   

Now we (would) have a dataset with no missing values.

4.1.4 Creating new variables (Class 04+)

4.1.4.1 From categories to numeric

Note

Most of you will not need to do this.

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    DaysSmoke =
      case_when(
        SmokingFreq == 1 ~ 30      # 1. Every day
      , SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
      , SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
      , SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
      , SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
      , SmokingFreq == 6 ~ 1       # 6. Once a month or less
      , SmokingFreq == 7 ~ 0       # 7. Never
      )
  )

summary(nesarc_sub$DaysSmoke)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00   10.96   30.00   30.00     102 

This shows an example with many levels. Note, that there is a numeric variable in this dataset, but this categorical sitation occurs frequent enough to show you an example.

table(nesarc_sub$IncomePersonalCat)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
2462 3571 3823 2002 3669 1634 3940 3887 3085 3003 2351 3291 2059 1328  857  521 
  16   17 
 290 1320 
# create a numeric version of this categorical variable
nesarc_sub <-
  nesarc_sub %>%
  mutate(
    IncomePersonalCat_num =
      case_when(
        IncomePersonalCat ==  0 ~      0     #  $      0 (No personal income)
      , IncomePersonalCat ==  1 ~   2500     #  $      1 to $ 4,999
      , IncomePersonalCat ==  2 ~   6500     #  $  5,000 to $ 7,999
      , IncomePersonalCat ==  3 ~   9000     #  $  8,000 to $ 9,999
      , IncomePersonalCat ==  4 ~  11500     #  $ 10,000 to $12,999
      , IncomePersonalCat ==  5 ~  14000     #  $ 13,000 to $14,999
      , IncomePersonalCat ==  6 ~  17500     #  $ 15,000 to $19,999
      , IncomePersonalCat ==  7 ~  22500     #  $ 20,000 to $24,999
      , IncomePersonalCat ==  8 ~  27500     #  $ 25,000 to $29,999
      , IncomePersonalCat ==  9 ~  32500     #  $ 30,000 to $34,999
      , IncomePersonalCat == 10 ~  37500     #  $ 35,000 to $39,999
      , IncomePersonalCat == 11 ~  45000     #  $ 40,000 to $49,999
      , IncomePersonalCat == 12 ~  55000     #  $ 50,000 to $59,999
      , IncomePersonalCat == 13 ~  65000     #  $ 60,000 to $69,999
      , IncomePersonalCat == 14 ~  75000     #  $ 70,000 to $79,999
      , IncomePersonalCat == 15 ~  85000     #  $ 80,000 to $89,999
      , IncomePersonalCat == 16 ~  95000     #  $ 90,000 to $99,999
      , IncomePersonalCat == 17 ~ 120000     #  $100,000 or more
      )
  )

table(nesarc_sub$IncomePersonalCat_num)

     0   2500   6500   9000  11500  14000  17500  22500  27500  32500  37500 
  2462   3571   3823   2002   3669   1634   3940   3887   3085   3003   2351 
 45000  55000  65000  75000  85000  95000 120000 
  3291   2059   1328    857    521    290   1320 

4.1.4.2 From numeric to numeric

Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    Height_inches = Height_ft * 12 + Height_in
  )

summary(nesarc_sub$Height_inches)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   48.0    64.0    66.0    66.6    70.0    84.0     761 
Note

Most of you will need to transform a variable to address extreme right skewness.

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

Log scale. I like log2 since its interpretation is that every unit is a doubling. Notice that I added 1 to the TotalCigsSmoked before computing the log. This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.

Square root scale. Square root is another transformation that spreads out values between 0 and 1 and compresses values from 1 to infinity. It can be preferred to a log transformation in some cases.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    DaysSmoke_log2       = log2(DaysSmoke)
  , TotalCigsSmoked      = DaysSmoke * DailyCigsSmoked
  , TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
  , TotalCigsSmoked_sqrt = sqrt(TotalCigsSmoked)
  , TotalCigsSmoked_smokers      = ifelse(TotalCigsSmoked > 0, TotalCigsSmoked, NA)
  , TotalCigsSmoked_smokers_log2 = TotalCigsSmoked_log2
  )

summary(nesarc_sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.0   187.5   300.0  2940.0     297 
summary(nesarc_sub$TotalCigsSmoked_smokers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0   150.0   360.0   453.1   600.0  2940.0   25377 

4.1.4.3 From numeric to categories based on quantiles

The numeric variable TotalCigsSmoked is converted into a factor with five levels.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    CigsSmokedFac  =
      case_when(
        TotalCigsSmoked == 0                              ~ "No smoking (0)"
      , (TotalCigsSmoked >  0) & (TotalCigsSmoked <=   6) ~ "Lowest smoking (1-6)"
      , (TotalCigsSmoked >  6) & (TotalCigsSmoked <=  30) ~ "Low smoking (7-30)"
      , (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
      , (TotalCigsSmoked > 300)                           ~ "High smoking (301+)"
      )
  , CigsSmokedFac =
      factor(
        CigsSmokedFac
      , levels = c(
            "No smoking (0)"
          , "Lowest smoking (1-6)"
          , "Low smoking (7-30)"
          , "Medium smoking (31-300)"
          , "High smoking (301+)"
          )
      )
  )

summary(nesarc_sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.0   187.5   300.0  2940.0     297 
summary(nesarc_sub$CigsSmokedFac)
         No smoking (0)    Lowest smoking (1-6)      Low smoking (7-30) 
                  25080                    1045                    1129 
Medium smoking (31-300)     High smoking (301+)                    NA's 
                   6456                    9086                     297 
table(nesarc_sub$CigsSmokedFac)

         No smoking (0)    Lowest smoking (1-6)      Low smoking (7-30) 
                  25080                    1045                    1129 
Medium smoking (31-300)     High smoking (301+) 
                   6456                    9086 
Finer points

Cutting a numeric variable into equal-sized categories.

The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function. The label "Q1" indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4" indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations.

quantile(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)
  0%  25%  50%  75% 100% 
   0    0    0  300 2940 
nesarc_sub <-
  nesarc_sub %>%
  mutate(
    CigsSmokedFac2 =
      .bincode(TotalCigsSmoked
      , breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
      , include.lowest = TRUE
      )
  , CigsSmokedFac2 = factor( CigsSmokedFac
                          , levels = c(1, 2, 3, 4)
                          , labels = c("Q1", "Q2", "Q3", "Q4")
                          )
  )

In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.

summary(nesarc_sub$CigsSmokedFac2)
   Q1    Q2    Q3    Q4  NA's 
    0     0     0     0 43093 

4.1.4.4 From many categories to a few

The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
#   if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)

    1     2     3     4     5     6     7 
14836   460   687   747   409   772 25080 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc_sub <-
  nesarc_sub %>%
  mutate(
    SmokingFreq3 =
      fct_collapse(
        SmokingFreq
      , "1" = c("1", "2", "3")  # 1. daily
      , "2" = c("4", "5")       # 2. weekly
      , "3" = "6"               # 3. monthly
      , "4" = "7"               # 7. never
      )
  )

# new variable
table(nesarc_sub$SmokingFreq3)

    1     2     3     4 
15983  1156   772 25080 
# we will label the new factor variable below, with the others

4.1.4.5 Working with Dates

Note

Most of you will not need to work with dates.

How to calculate the duration between two dates.

(You probably won’t need to use this for your project.)

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth day, month, and year.

dat_sub <-
  NESARC %>%
  select(
    AGE
  , CDAY
  , CMON
  , CYEAR
  , DOBD
  , DOBM
  , DOBY
  )
str(dat_sub)
'data.frame':   43093 obs. of  7 variables:
 $ AGE  : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CDAY : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR: num  2001 2002 2001 2001 2001 ...
 $ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.

dat_sub <-
  dat_sub %>%
  mutate(
    ## interview date
    # combine day, month, and year into one text string
    cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
    # create date object by interpretting the day, month, year with dmy()
  , cdate_date = dmy(cdate_text)

    ## date of birth
    # combine day, month, and year into one text string
  , dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
    # create date object by interpretting the day, month, year with dmy()
      # if any failed to parse, it's because a day, month, or year is out of range
      #    such as "99" for missing values or refused to answer
      # in these cases, the date is NA (missing), which is what we want
  , dob_date = dmy(dob_text)
  )
Warning: 1094 failed to parse.
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966
    dob_date
1 1977-09-27
2 1973-12-15
3 1919-12-28
4 1983-06-27
5 1965-09-11
6 1966-12-28

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

dat_sub <-
  dat_sub %>%
  mutate(
    age_days  = cdate_date - dob_date
  , age_years = as.numeric(age_days / 365.25)  # change from "difftime" to "numeric"
    # difference between the age we calculated and the age in the original dataset
  , age_diff  = age_years - AGE
  )
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966
    dob_date   age_days age_years   age_diff
1 1977-09-27  8722 days  23.87953 0.87953457
2 1973-12-15 10255 days  28.07666 0.07665982
3 1919-12-28 29916 days  81.90554 0.90554415
4 1983-06-27  6649 days  18.20397 0.20396988
5 1965-09-11 13186 days  36.10130 0.10130048
6 1966-12-28 12672 days  34.69405 0.69404517

Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!

# drop the original ones you don't need anymore
#   by select(-VAR), it excludes the VAR column
dat_sub <-
  dat_sub %>%
  select(
    -c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
  # -c(cdate_text, cdate_date, dob_text, dob_date)
  , -c(age_days, age_diff)
  )
str(dat_sub)
'data.frame':   43093 obs. of  6 variables:
 $ AGE       : num  23 28 81 18 36 34 19 84 29 18 ...
 $ cdate_text: chr  "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" ...
 $ cdate_date: Date, format: "2001-08-14" "2002-01-12" ...
 $ dob_text  : chr  "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" ...
 $ dob_date  : Date, format: "1977-09-27" "1973-12-15" ...
 $ age_years : num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

(For this example, I’m going to delete the dat_sub object since it was just for this demonstration.

rm(dat_sub)

4.1.4.6 Review results of new variables

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131           
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962           
 3      :    1             Median :44.0   3   :25080                      
 4      :    1             Mean   :46.4   NA's:   22                      
 5      :    1             3rd Qu.:59.0                                   
 6      :    1             Max.   :98.0                                   
 (Other):43087                                                            
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
 IncomePersonalCat   Height_ft      Height_in        Weight_lb    
 6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 2      : 3823     Median :5.00   Median : 5.000   Median :165.0  
 4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0  
 (Other):20912     NA's   :730    NA's   :761      NA's   :1376   
   DaysSmoke     IncomePersonalCat_num Height_inches  DaysSmoke_log2 
 Min.   : 0.00   Min.   :     0        Min.   :48.0   Min.   : -Inf  
 1st Qu.: 0.00   1st Qu.:  9000        1st Qu.:64.0   1st Qu.: -Inf  
 Median : 0.00   Median : 22500        Median :66.0   Median : -Inf  
 Mean   :10.96   Mean   : 27523        Mean   :66.6   Mean   : -Inf  
 3rd Qu.:30.00   3rd Qu.: 37500        3rd Qu.:70.0   3rd Qu.:4.907  
 Max.   :30.00   Max.   :120000        Max.   :84.0   Max.   :4.907  
 NA's   :102                           NA's   :761    NA's   :102    
 TotalCigsSmoked  TotalCigsSmoked_log2 TotalCigsSmoked_sqrt
 Min.   :   0.0   Min.   : 0.000       Min.   : 0.000      
 1st Qu.:   0.0   1st Qu.: 7.229       1st Qu.: 0.000      
 Median :   0.0   Median : 8.492       Median : 0.000      
 Mean   : 187.5   Mean   : 7.888       Mean   : 7.887      
 3rd Qu.: 300.0   3rd Qu.: 9.229       3rd Qu.:17.321      
 Max.   :2940.0   Max.   :11.522       Max.   :54.222      
 NA's   :297      NA's   :25377        NA's   :297         
 TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
 Min.   :   1.0          Min.   : 0.000              
 1st Qu.: 150.0          1st Qu.: 7.229              
 Median : 360.0          Median : 8.492              
 Mean   : 453.1          Mean   : 7.888              
 3rd Qu.: 600.0          3rd Qu.: 9.229              
 Max.   :2940.0          Max.   :11.522              
 NA's   :25377           NA's   :25377               
                 CigsSmokedFac   CigsSmokedFac2 SmokingFreq3
 No smoking (0)         :25080   Q1  :    0     1   :15983  
 Lowest smoking (1-6)   : 1045   Q2  :    0     2   : 1156  
 Low smoking (7-30)     : 1129   Q3  :    0     3   :  772  
 Medium smoking (31-300): 6456   Q4  :    0     4   :25080  
 High smoking (301+)    : 9086   NA's:43093     NA's:  102  
 NA's                   :  297                              
                                                            

4.1.5 Labeling Categorical variable levels (Class 04)

Look at summary of dataset to see the state of the labeling before we make changes.

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131           
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962           
 3      :    1             Median :44.0   3   :25080                      
 4      :    1             Mean   :46.4   NA's:   22                      
 5      :    1             3rd Qu.:59.0                                   
 6      :    1             Max.   :98.0                                   
 (Other):43087                                                            
  SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal   
 7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0  
 1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800  
 6      :  772   Median : 0.000   3:  701              Median :  20000  
 4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185  
 3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000  
 (Other):  869   Max.   :98.000                        Max.   :3000000  
 NA's   :  102   NA's   :262                                            
 IncomePersonalCat   Height_ft      Height_in        Weight_lb    
 6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0  
 7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0  
 2      : 3823     Median :5.00   Median : 5.000   Median :165.0  
 4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6  
 1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0  
 11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0  
 (Other):20912     NA's   :730    NA's   :761      NA's   :1376   
   DaysSmoke     IncomePersonalCat_num Height_inches  DaysSmoke_log2 
 Min.   : 0.00   Min.   :     0        Min.   :48.0   Min.   : -Inf  
 1st Qu.: 0.00   1st Qu.:  9000        1st Qu.:64.0   1st Qu.: -Inf  
 Median : 0.00   Median : 22500        Median :66.0   Median : -Inf  
 Mean   :10.96   Mean   : 27523        Mean   :66.6   Mean   : -Inf  
 3rd Qu.:30.00   3rd Qu.: 37500        3rd Qu.:70.0   3rd Qu.:4.907  
 Max.   :30.00   Max.   :120000        Max.   :84.0   Max.   :4.907  
 NA's   :102                           NA's   :761    NA's   :102    
 TotalCigsSmoked  TotalCigsSmoked_log2 TotalCigsSmoked_sqrt
 Min.   :   0.0   Min.   : 0.000       Min.   : 0.000      
 1st Qu.:   0.0   1st Qu.: 7.229       1st Qu.: 0.000      
 Median :   0.0   Median : 8.492       Median : 0.000      
 Mean   : 187.5   Mean   : 7.888       Mean   : 7.887      
 3rd Qu.: 300.0   3rd Qu.: 9.229       3rd Qu.:17.321      
 Max.   :2940.0   Max.   :11.522       Max.   :54.222      
 NA's   :297      NA's   :25377        NA's   :297         
 TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
 Min.   :   1.0          Min.   : 0.000              
 1st Qu.: 150.0          1st Qu.: 7.229              
 Median : 360.0          Median : 8.492              
 Mean   : 453.1          Mean   : 7.888              
 3rd Qu.: 600.0          3rd Qu.: 9.229              
 Max.   :2940.0          Max.   :11.522              
 NA's   :25377           NA's   :25377               
                 CigsSmokedFac   CigsSmokedFac2 SmokingFreq3
 No smoking (0)         :25080   Q1  :    0     1   :15983  
 Lowest smoking (1-6)   : 1045   Q2  :    0     2   : 1156  
 Low smoking (7-30)     : 1129   Q3  :    0     3   :  772  
 Medium smoking (31-300): 6456   Q4  :    0     4   :25080  
 High smoking (301+)    : 9086   NA's:43093     NA's:  102  
 NA's                   :  297                              
                                                            

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, NicotineDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <-
  factor(
    nesarc_sub$Depression
  , labels = c("No Depression"
             , "Yes Depression"
    )
  )

nesarc_sub$SmokingStatus <-
  factor(
    nesarc_sub$SmokingStatus
  , labels = c("Smoked past 12 months"
             , "Smoked prior to 12 months"
             , "Never smoked"
    )
  )

# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <-
  factor(nesarc_sub$Sex
  , levels = c( 2
              , 1
              )
  , labels = c( "Female"
              , "Male"
              )
  )
# check ordering with a frequency table
table(nesarc_sub$Sex)

Female   Male 
 24575  18518 
nesarc_sub$NicotineDependence <-
  factor(nesarc_sub$NicotineDependence
  , levels = c( 1
              , 0
              )
  , labels = c( "Nicotine Dependence"
              , "No Nicotine Dependence"
              )
  )


# Remember to include the new category level from the original `NA` values.

nesarc_sub$SmokingFreq <-
  factor(nesarc_sub$SmokingFreq
  , levels = c( 7
              , 6
              , 5
              , 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Once a month or less"
              , "2 to 3 Days/month"
              , "1 to 2 Days/week"
              , "3 to 4 Days/week"
              , "5 to 6 Days/week"
              , "Every Day"
              )
  )


nesarc_sub$Ethnicity <-
  factor(nesarc_sub$Ethnicity
    # shorter labels are helpful for plots
  , labels = c( "Cauc"
              , "AfAm"
              , "NaAm"
              , "Asia"
              , "Hisp"
              )
  # , labels = c("Caucasian"
  #            , "African American"
  #            , "Native American"
  #            , "Asian"
  #            , "Hispanic"
  #             )
)

nesarc_sub$SmokingFreq3 <-
  factor(nesarc_sub$SmokingFreq3
  , levels = c( 4
              , 3
              , 2
              , 1
              )
  , labels = c( "Never"
              , "Monthly"
              , "Weekly"
              , "Daily"
              )
  )


summary(nesarc_sub)
       ID            Sex             Age      
 1      :    1   Female:24575   Min.   :18.0  
 2      :    1   Male  :18518   1st Qu.:32.0  
 3      :    1                  Median :44.0  
 4      :    1                  Mean   :46.4  
 5      :    1                  3rd Qu.:59.0  
 6      :    1                  Max.   :98.0  
 (Other):43087                                
                   SmokingStatus                NicotineDependence
 Smoked past 12 months    : 9913   Nicotine Dependence   : 4962   
 Smoked prior to 12 months: 8078   No Nicotine Dependence:38131   
 Never smoked             :25080                                  
 NA's                     :   22                                  
                                                                  
                                                                  
                                                                  
               SmokingFreq    DailyCigsSmoked  Ethnicity   
 Never               :25080   Min.   : 0.000   Cauc:24507  
 Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
 Once a month or less:  772   Median : 0.000   NaAm:  701  
 1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
 3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
 (Other)             :  869   Max.   :98.000               
 NA's                :  102   NA's   :262                  
          Depression    IncomePersonal    IncomePersonalCat   Height_ft   
 No Depression :35254   Min.   :      0   6      : 3940     Min.   :4.00  
 Yes Depression: 7839   1st Qu.:   8800   7      : 3887     1st Qu.:5.00  
                        Median :  20000   2      : 3823     Median :5.00  
                        Mean   :  28185   4      : 3669     Mean   :5.11  
                        3rd Qu.:  36000   1      : 3571     3rd Qu.:5.00  
                        Max.   :3000000   11     : 3291     Max.   :7.00  
                                          (Other):20912     NA's   :730   
   Height_in        Weight_lb       DaysSmoke     IncomePersonalCat_num
 Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :     0       
 1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:  9000       
 Median : 5.000   Median :165.0   Median : 0.00   Median : 22500       
 Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   : 27523       
 3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.: 37500       
 Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :120000       
 NA's   :761      NA's   :1376    NA's   :102                          
 Height_inches  DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2
 Min.   :48.0   Min.   : -Inf   Min.   :   0.0   Min.   : 0.000      
 1st Qu.:64.0   1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229      
 Median :66.0   Median : -Inf   Median :   0.0   Median : 8.492      
 Mean   :66.6   Mean   : -Inf   Mean   : 187.5   Mean   : 7.888      
 3rd Qu.:70.0   3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229      
 Max.   :84.0   Max.   :4.907   Max.   :2940.0   Max.   :11.522      
 NA's   :761    NA's   :102     NA's   :297      NA's   :25377       
 TotalCigsSmoked_sqrt TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
 Min.   : 0.000       Min.   :   1.0          Min.   : 0.000              
 1st Qu.: 0.000       1st Qu.: 150.0          1st Qu.: 7.229              
 Median : 0.000       Median : 360.0          Median : 8.492              
 Mean   : 7.887       Mean   : 453.1          Mean   : 7.888              
 3rd Qu.:17.321       3rd Qu.: 600.0          3rd Qu.: 9.229              
 Max.   :54.222       Max.   :2940.0          Max.   :11.522              
 NA's   :297          NA's   :25377           NA's   :25377               
                 CigsSmokedFac   CigsSmokedFac2  SmokingFreq3  
 No smoking (0)         :25080   Q1  :    0     Never  :25080  
 Lowest smoking (1-6)   : 1045   Q2  :    0     Monthly:  772  
 Low smoking (7-30)     : 1129   Q3  :    0     Weekly : 1156  
 Medium smoking (31-300): 6456   Q4  :    0     Daily  :15983  
 High smoking (301+)    : 9086   NA's:43093     NA's   :  102  
 NA's                   :  297                                 
                                                               

4.1.6 Data subset rows

Note

Most of you will not need to subset your data.

Finer points

By subsetting your data, you can focus on questions related to a specific subset of the population.

Data subset rows to young smokers

Only as an example (not used in the rest of the assignment), consider restricting our attention to only young smokers.

In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values. You may also want to subset to a new dataset object name to keep it separate from the full dataset.

In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter() function. Note that SmokingStatus == "Smoked past 12 months" is used to choose people who smoked in the past 12 months.

Notice, the missing values have been removed (since NAs are not less than or equal to numbers).

# the subset command below will not include a row that evalutes to NA for the
# specified subset.  For example:
  (NA == 1)
[1] NA
# age and smoking subset
nesarc_sub_young_smokers <-
  nesarc_sub %>%
  filter(
    Age <= 25                                 # people 25 or younger
  , SmokingStatus == "Smoked past 12 months"  # smoked in the past 12 months
  )
dim(nesarc_sub_young_smokers)
[1] 1706   26
summary(nesarc_sub_young_smokers$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   20.00   22.00   21.61   23.00   25.00 
summary(nesarc_sub_young_smokers$SmokingStatus)
    Smoked past 12 months Smoked prior to 12 months              Never smoked 
                     1706                         0                         0 

4.2 Data is complete (Class 04)

4.2.1 Plot entire dataset, show missing values

Only keep one of these for your assignment. I think the second one is much more informative.

e_plot_missing(nesarc_sub)
Warning: Using alpha for a discrete variable is not advised.

Adding additional options can help understand the structure of the missing values.

e_plot_missing(
    dat_plot      = nesarc_sub
  , var_group     = "Sex"
  , sw_group_sort = TRUE
  , var2_sort     = "TotalCigsSmoked"
)
Warning: Using alpha for a discrete variable is not advised.

4.2.2 Numerical summaries to assess correctness (Class 03)

summary(nesarc_sub)
       ID            Sex             Age      
 1      :    1   Female:24575   Min.   :18.0  
 2      :    1   Male  :18518   1st Qu.:32.0  
 3      :    1                  Median :44.0  
 4      :    1                  Mean   :46.4  
 5      :    1                  3rd Qu.:59.0  
 6      :    1                  Max.   :98.0  
 (Other):43087                                
                   SmokingStatus                NicotineDependence
 Smoked past 12 months    : 9913   Nicotine Dependence   : 4962   
 Smoked prior to 12 months: 8078   No Nicotine Dependence:38131   
 Never smoked             :25080                                  
 NA's                     :   22                                  
                                                                  
                                                                  
                                                                  
               SmokingFreq    DailyCigsSmoked  Ethnicity   
 Never               :25080   Min.   : 0.000   Cauc:24507  
 Every Day           :14836   1st Qu.: 0.000   AfAm: 8245  
 Once a month or less:  772   Median : 0.000   NaAm:  701  
 1 to 2 Days/week    :  747   Mean   : 6.462   Asia: 1332  
 3 to 4 Days/week    :  687   3rd Qu.:10.000   Hisp: 8308  
 (Other)             :  869   Max.   :98.000               
 NA's                :  102   NA's   :262                  
          Depression    IncomePersonal    IncomePersonalCat   Height_ft   
 No Depression :35254   Min.   :      0   6      : 3940     Min.   :4.00  
 Yes Depression: 7839   1st Qu.:   8800   7      : 3887     1st Qu.:5.00  
                        Median :  20000   2      : 3823     Median :5.00  
                        Mean   :  28185   4      : 3669     Mean   :5.11  
                        3rd Qu.:  36000   1      : 3571     3rd Qu.:5.00  
                        Max.   :3000000   11     : 3291     Max.   :7.00  
                                          (Other):20912     NA's   :730   
   Height_in        Weight_lb       DaysSmoke     IncomePersonalCat_num
 Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :     0       
 1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:  9000       
 Median : 5.000   Median :165.0   Median : 0.00   Median : 22500       
 Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   : 27523       
 3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.: 37500       
 Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :120000       
 NA's   :761      NA's   :1376    NA's   :102                          
 Height_inches  DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2
 Min.   :48.0   Min.   : -Inf   Min.   :   0.0   Min.   : 0.000      
 1st Qu.:64.0   1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229      
 Median :66.0   Median : -Inf   Median :   0.0   Median : 8.492      
 Mean   :66.6   Mean   : -Inf   Mean   : 187.5   Mean   : 7.888      
 3rd Qu.:70.0   3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229      
 Max.   :84.0   Max.   :4.907   Max.   :2940.0   Max.   :11.522      
 NA's   :761    NA's   :102     NA's   :297      NA's   :25377       
 TotalCigsSmoked_sqrt TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
 Min.   : 0.000       Min.   :   1.0          Min.   : 0.000              
 1st Qu.: 0.000       1st Qu.: 150.0          1st Qu.: 7.229              
 Median : 0.000       Median : 360.0          Median : 8.492              
 Mean   : 7.887       Mean   : 453.1          Mean   : 7.888              
 3rd Qu.:17.321       3rd Qu.: 600.0          3rd Qu.: 9.229              
 Max.   :54.222       Max.   :2940.0          Max.   :11.522              
 NA's   :297          NA's   :25377           NA's   :25377               
                 CigsSmokedFac   CigsSmokedFac2  SmokingFreq3  
 No smoking (0)         :25080   Q1  :    0     Never  :25080  
 Lowest smoking (1-6)   : 1045   Q2  :    0     Monthly:  772  
 Low smoking (7-30)     : 1129   Q3  :    0     Weekly : 1156  
 Medium smoking (31-300): 6456   Q4  :    0     Daily  :15983  
 High smoking (301+)    : 9086   NA's:43093     NA's   :  102  
 NA's                   :  297                                 
                                                               

5 Graphing and Tabulating

5.1 Class 04, Plotting univariate

Rubric

  1. (3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe.

  2. (3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers).

  3. (2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g., str(), dim(), summary()).

    • Scroll up to sections labeled “(Class 04)”.
  4. (2 p) Label levels of factor variables.

    • Scroll up to sections labeled “(Class 04)”.

5.2 Categorical variables

5.2.1 Tables for categorical variables

Basic tables can be created using the function table or xtabs, though for multiple variables the group_by(), summarize(), mutate() recipe makes a nicer table with proportions.

# table() produces a one-variable frequency table
table(nesarc_sub$NicotineDependence)

   Nicotine Dependence No Nicotine Dependence 
                  4962                  38131 
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()

   Nicotine Dependence No Nicotine Dependence 
             0.1151463              0.8848537 
tab_nicdep <-
  nesarc_sub %>%
  group_by(
    NicotineDependence
  ) %>%
  summarize(
    n = n()
  #, .groups = "drop_last"
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  #arrange(
  #  desc(n)
  #) %>%
  ungroup()

tab_nicdep
# A tibble: 2 × 3
  NicotineDependence         n  prop
  <fct>                  <int> <dbl>
1 Nicotine Dependence     4962 0.115
2 No Nicotine Dependence 38131 0.885

5.2.2 Graphing frequency tables

The barplots are all created with the package ggplot2.

library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = NicotineDependence))
p <- p + theme_bw()
p <- p + geom_bar()
p <- p + labs(
              title = "Nicotine Dependence for Adults"
            , x     = "Nicotine Dependence"
            , y     = "Total Number"
            )
print(p)

Interpretation: Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%).

5.2.2.1 Removing NAs from factor axes and tables

Finer points

Remove NAs using drop_na() from categorical variables in plots and tables when the NAs are unwanted.

When a factor variable has NAs, then NA appears as a level with frequences (see bar plot below). As of writing this, ggplot() does not have a way to remove the NAs before plotting; therefore, we have to do it manually.

The easiest and most transparent way is to use the drop_na() function below where we only plot the non-NA values.

Note

The table() function doesn’t show the NAs unless you ask it to.

# table() produces a one-variable frequency table
table(nesarc_sub$SmokingStatus)

    Smoked past 12 months Smoked prior to 12 months              Never smoked 
                     9913                      8078                     25080 
table(nesarc_sub$SmokingStatus, useNA = "always")

    Smoked past 12 months Smoked prior to 12 months              Never smoked 
                     9913                      8078                     25080 
                     <NA> 
                       22 
tab_smoke <-
  nesarc_sub %>%
  group_by(
    Sex, SmokingStatus
  ) %>%
  summarize(
    n = n()
  #, .groups = "drop_last"
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  #arrange(
  #  desc(n)
  #) %>%
  ungroup()
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
tab_smoke
# A tibble: 8 × 4
  Sex    SmokingStatus                 n  prop
  <fct>  <fct>                     <int> <dbl>
1 Female Smoked past 12 months      5070 0.206
2 Female Smoked prior to 12 months  3861 0.157
3 Female Never smoked              15635 0.636
4 Female <NA>                          9 0    
5 Male   Smoked past 12 months      4843 0.262
6 Male   Smoked prior to 12 months  4217 0.228
7 Male   Never smoked               9445 0.51 
8 Male   <NA>                         13 0.001
tab_smoke <-
  nesarc_sub %>%
  drop_na(Sex, SmokingStatus) %>%   # NAs dropped here
  group_by(
    Sex, SmokingStatus
  ) %>%
  summarize(
    n = n()
  #, .groups = "drop_last"
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  #arrange(
  #  desc(n)
  #) %>%
  ungroup()
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
tab_smoke
# A tibble: 6 × 4
  Sex    SmokingStatus                 n  prop
  <fct>  <fct>                     <int> <dbl>
1 Female Smoked past 12 months      5070 0.206
2 Female Smoked prior to 12 months  3861 0.157
3 Female Never smoked              15635 0.636
4 Male   Smoked past 12 months      4843 0.262
5 Male   Smoked prior to 12 months  4217 0.228
6 Male   Never smoked               9445 0.51 

Compare the two plots below of the same data, but the second plot had the NAs removed before plotting.

library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc_sub, aes(x = SmokingStatus))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
#p1

# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep, and "!is.na()" means "keep the non-NA values"
  # p2 <- ggplot(data = subset(nesarc_sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
# tidyverse syntax uses drop_na(Var) to drop observations where Var is NA
p2 <- ggplot(data = nesarc_sub %>% drop_na(SmokingStatus), aes(x = SmokingStatus))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x  = element_text(angle = 90, vjust = 0, hjust = 1))
#p2

p_arranged <-
  cowplot::plot_grid(
    plotlist  = list(p1, p2)   # list of plots
  , nrow      = 1              # number of rows for grid of plots
  , ncol      = NULL           # number of columns, left unspecified
  , labels    = "AUTO"         # A and B panel labels
  )

print(p_arranged)

5.3 Numeric variables

5.3.1 Graphing numeric variables

p <- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))
p <- p + theme_bw()
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5)
#p <- p + geom_rug()  # this plots every point, takes too long for big data
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4)
p <- p + labs(
              title = "Monthly cigaretts smoked for Young Smoking Adults"
            , x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            )
print(p)
Warning: Removed 262 rows containing non-finite values (stat_bin).
Warning: Removed 262 rows containing non-finite values (stat_density).

summary(nesarc_sub$DailyCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   6.462  10.000  98.000     262 
fivenum(nesarc_sub$DailyCigsSmoked)
[1]  0  0  0 10 98
median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
[1] 0
IQR(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
[1] 10

Interpretation: Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 0 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 300. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.

Finer points

Experiment with the geom_histogram() code in the next code chunk to set the binwidth = for your data (extremely important!). Often, a binwidth of 1 is good if the data are frequencies and high resolution is important. Decide if you need boundary = 0. Add a title and the labels on the \(x\)- and \(y\)-axes.

A smoothed density curve can be added with the geom_density() function, where adjust = 2.0 is a fairly smooth curve. Experiment with adjust values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness and decide what level best captures the general shape of your distribution.

Finer points

The numeric summaries can be used in the text. See the .qmd file to see the R code in the following sentences and in the interpretation.

  • The “shape” of the distribution for the estimated cigarettes smoked per month is skewed to the right.
  • The “center” (median) of the distribution is
  • The “spread” (interquartile range, middle 50%) for the distribution is
  • There are several “outliers”.

A second example follows. Notice that I filter the data to smokers, first.

Interpretation: Among smokers, about half smoke at most 10 cigarettes per day with the mode at 20 (1 pack per day), two smaller peaks at 30 and 40 (1.5 and 2 packs per day), and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).

p <- ggplot(data = nesarc_sub %>% filter(DailyCigsSmoked > 0), aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = ""
            , title = "Monthly cigaretts smoked for Young Smoking Adults"
            )
p <- p + theme_bw()
p


5.4 Class 05-1, Plotting bivariate, numeric response

Rubric

  1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

    1. Scatter plot (for regression): \(x\) = numerical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

    2. Box plots (for ANOVA): \(x\) = categorical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

5.4.1 Scatter plot (for regression): x = numerical, y = numerical

library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p <- p + theme_bw()
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
            title = "Total Cigarettes Smoked vs Age"
          #, x =
          , y = "Total Cigarettes Smoked per month"
          , caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
          )
print(p)
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
`geom_smooth()` using formula 'y ~ x'

Interpretation: The relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line.

Finer points

Improving plots and comparing by a categorical variable using facets.

Research question: With my questions, I don’t have a meaningful question to ask that will relate two numeric variables.

An exploratory question whether there is a relationship between a person’s age (Age) and total number of cigaretts smoked (TotalCigsSmoked). Because age is an integer varible, I jitter the points, and I reduce the opacity (increase the transparancy) with alpha = 1/4 to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).

The first plot has all the points in their original locations, but they end up stacking on top of each other so you can’t tell how many points are there.

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 297 rows containing non-finite values (stat_smooth).
Warning: Removed 297 rows containing missing values (geom_point).

This is the same data as above, but I have added some horizontal jitter (displacement) to each point which doesn’t affect their value on the y-axis but allows us to see how many points are at each discrete age year.

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.2), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 297 rows containing non-finite values (stat_smooth).
Warning: Removed 297 rows containing missing values (geom_point).

Here’s an example of height and weight with a strong positive relationship between the two numeric variables. Most variables in our dataset won’t exhibit such a nice relationship.

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + theme_bw()
p <- p + geom_jitter(height = 0.5, alpha = 1/10)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 1439 rows containing non-finite values (stat_smooth).
Warning: Removed 1439 rows containing missing values (geom_point).

If you have two groups you want to compare, you can create facets (small multiples) and show the relationship by each group. Below, I compare by sex. The slopes appear similar; later in the semester we’ll learn how to formally compare these lines.

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
p <- p + facet_wrap(~ Sex)
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 1439 rows containing non-finite values (stat_smooth).
Warning: Removed 1439 rows containing missing values (geom_point).

5.4.2 Box plots (for ANOVA): x = categorical, y = numerical

Research question: Is there a relationship between depression (Depression) and total number of cigarettes smoked (TotalCigsSmoked)?

library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + theme_bw()
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
                      colour = "red", alpha = 0.8)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total Cigarettes Smoked per month"
          , caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
          )
print(p)

Interpretation: For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.

5.5 Class 05-2, Plotting bivariate, categorical response

Rubric

  1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

    1. Mosaic plot or bivariate bar plots (for contingency tables): \(x\) = categorical, \(y\) = categorical, include axis labels and a title. Interpret the plot: describe the relationship.

    2. Logistic scatter plot (for logistic regression): \(x\) = numerical, \(y\) = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship.

5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical

Research question: Is there a relationship between depression status (Depression) and nicotine dependence (NicotineDependence)?

tab_dep_nic <-
  nesarc_sub %>%
  group_by(
    Depression, NicotineDependence
  ) %>%
  summarize(
    n = n()
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  ungroup()
`summarise()` has grouped output by 'Depression'. You can override using the
`.groups` argument.
tab_dep_nic
# A tibble: 4 × 4
  Depression     NicotineDependence         n  prop
  <fct>          <fct>                  <int> <dbl>
1 No Depression  Nicotine Dependence     3193 0.091
2 No Depression  No Nicotine Dependence 32061 0.909
3 Yes Depression Nicotine Dependence     1769 0.226
4 Yes Depression No Nicotine Dependence  6070 0.774
tab_nic_dep <-
  nesarc_sub %>%
  group_by(
    NicotineDependence, Depression
  ) %>%
  summarize(
    n = n()
  ) %>%
  mutate(
    prop = round(n / sum(n), 3)
  ) %>%
  ungroup()
`summarise()` has grouped output by 'NicotineDependence'. You can override
using the `.groups` argument.
tab_nic_dep
# A tibble: 4 × 4
  NicotineDependence     Depression         n  prop
  <fct>                  <fct>          <int> <dbl>
1 Nicotine Dependence    No Depression   3193 0.643
2 Nicotine Dependence    Yes Depression  1769 0.357
3 No Nicotine Dependence No Depression  32061 0.841
4 No Nicotine Dependence Yes Depression  6070 0.159

Reversing color order: It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function forcats::fct_rev() on your fill= variable.

library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + theme_bw()
p <- p + geom_bar(position = "fill")
p <- p + labs(
             x = "Depression status"
           , y = "Proportion"
           , fill = "Nicotine Dependence\nStatus"  # the legend title can be modified, if desired (try this line)
           , title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
           )
print(p)

Interpretation: When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%).

Finer points

We can compare by a categorical variable with facets (or “small multiples”).

Legends can be (A) removed or (B) moved and rearranged.

Research question: Is there a relationship between smoking frequency (SmokingFreq) and nicotine dependence (NicotineDependence)?

Also compare by sex.

library(ggplot2)
p <- ggplot(data = subset(nesarc_sub, !is.na(SmokingFreq)), aes(x = SmokingFreq, fill = NicotineDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
# tilt axis labels
p <- p + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
p <- p + labs(x = "Smoking Frequency", y = "Percent")
p1 <- p

p1 <- p1 + theme(legend.position = "none") # remove legend

# also compare by sex
p <- p + facet_grid(Sex ~ .)
p <- p + theme(legend.position = "bottom") # "none"
p <- p + theme(legend.box="vertical", legend.margin=margin()) # stack vertically
p <- p + guides(fill = guide_legend(ncol = 1, byrow = FALSE)) # shape levels of one legend
p2 <- p

p_arranged <-
  cowplot::plot_grid(
    plotlist  = list(p1, p2)   # list of plots
  , nrow      = 1              # number of rows for grid of plots
  , ncol      = NULL           # number of columns, left unspecified
  , labels    = "AUTO"         # A and B panel labels
  )

print(p_arranged)

5.5.1.1 Using Mosaic Plots

Note

This section on Mosaic plots can be completely skipped.

Finer points

Mosaic plots will be used later in the course when we calculate statistics to test whether a pair of categorical variables are associated. These are helpful because they color-code each cell based on whether they provide evidence of an association. These plots are difficult to customize, so the ggplot() versions above will generally be preferred to these.

Interpretation: If a person has depression they are more likely to have nicotine dependence than if they do not have depression.

library(vcd)
Loading required package: grid
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub)

mosaic( ~ NicotineDependence + Depression, data = nesarc_sub, shade = TRUE)

mosaic( ~ NicotineDependence + Depression + Sex, data = nesarc_sub, shade = TRUE)

It’s easy to look at such a fine resolution (too many variables) that it’s difficult to understand what’s happening.

mosaic(~ NicotineDependence + Depression + Sex + CigsSmokedFac, data = nesarc_sub, shade = TRUE)

5.5.1.1.1 Customizing Mosaic Plots

There are several solutions to dealing with the messy labels that overlap. The help line below will show the many options you have control over. Read through the comments of each code chunk, see the new options, and see the result.

library(vcd)
# help(labeling_border)

# Initial plot
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub)

# Abbreviate labels, specify the number of characters (try a few)
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub
      , abbreviate_labs = 4)

# Rotate and align labels
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub
      , rot_labels = c(0, 90, 0, 0)
      , just_labels = "right"  # right-justifies all the labels (top and side)
      )

# Rotate and align labels only y-axis label
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub
      , rot_labels = c(0, 90, 0, 0)
      , just_labels = c("center", "right") # order is opposite as in first argument
      )

# Rotate and align labels only y-axis label, and offset the variable name
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub
      , rot_labels = c(0, 90, 0, 0)
      , just_labels = c("center", "right") # order is opposite as in first argument
      , offset_varnames = c(0,0,0,9)      # offset the y-axis var name
      )

Final version

Labeling this plot well is, in my opinion, a bit too hard to be worth it. Notice in my code chunk options (you’ll need to look at the .Rmd file), that I’ve included fig.height = 8, fig.width = 8 which makes the size of the plot in the output 8 inches high by 8 inches wide.

library(vcd)

mosaic( ~ NicotineDependence + Depression, data = nesarc_sub
      , shade = TRUE
      , rot_labels = c(0, 90, 0, 0)        # order is: top, right, bottom, left
      , just_labels = c("center", "right") # order is opposite as in first argument
      , offset_varnames = c(0, 0, 0, 9)       # offset the y-axis var name
        # a big left margin makes space for the plot labels
      , margins = c(3, 3, 3, 14)           # order is: top, right, bottom, left
      , main = "Fraction of young adult smokers with and without\n nicotine dependence by depression status"
      , labeling_args = list(set_varnames = c(NicotineDependence = "Tobacco Addiction Status"
                                            , Depression = "Depression status")
                            )
      )

Interpretation: If a person has nicotine dependence they are more likely to have depression than if they do not have nicotine dependence.

5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)

Research question: Frequency and quantity of smoking (TotalCigsSmoked) are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (NicotineDependence) (this is true for other drugs as well).

table(nesarc_sub$NicotineDependence)

   Nicotine Dependence No Nicotine Dependence 
                  4962                  38131 
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc_sub$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)

    0     1 
38131  4962 
#  #  Note, sometimes it is necessary to make your variable a character type first.
#  #  If the code above doesn't produce both 0s and 1s, then try this:
#  nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0)
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)

# overlay a sigmodal curve to indicate direction of relationship
  # As of ggplot2 2.1.0, need to use this binomial_smooth() function:
  binomial_smooth <- function(...) {
    geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
  }
p <- p + binomial_smooth()    # old way: stat_smooth(method="glm", family="binomial", se=FALSE)

p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))

p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = "Nicotine Dependence (1=Yes, 0=No)"
            , title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
            )
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 297 rows containing non-finite values (stat_smooth).
Warning: Removed 297 rows containing missing values (geom_point).

Interpretation: The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.

Finer points

While the basic plot can be made without collapsing to binary, later in the semester we will learn about logistic regression where we model the probability of “success”. Categories associated with “success” are coded as 1, and all other categories are “failure” coded as 0. Below I show how to create this binary 0/1 variable from a numeric variable and a multi-level categorical variable.

Research question: 1. Frequency and quantity of smoking (TotalCigsSmoked) are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (NicotineDependence) (this is true for other drugs as well).

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
print(p)
Warning: Removed 297 rows containing missing values (geom_point).

Interpretation: These two groups are quite balanced on the probability of tobacco dependence given the total number of cigarettes smoked, though they increase together slightly.

To overlay a curve, we need to convert our response variable \(y\) to be binary 0 or 1, where we let 1 indicate a “success” in the direction of interest.

5.5.2.1 Collapsing a variable to binary

Whether you have a numeric or categorical variable that you’d like to represent with two levels, you’ll need to convert either to a numeric binary variable (values of 0 or 1, only). The value of 1 indicates “success” or the feature you’re interested in (below, 1 = someone with tobacco dependence) and 0 is “failure” (not tobacco dependent). Note that “success” does not necessarily mean “good”.

5.5.2.1.1 Numeric to binary

It is convenient to collapse a continuous variable into a categorical variable (as with the cut() command) or further collapse a categorical variable into a binary variable.

One strategy to create a binary variable is to use the ifelse() function.

Let’s create a binary Age variable; if 21 years and older, than legal to purchase alcohol, otherwise it is illegal. See how the two columns below correspond as we wanted them to. I do not recommend overwriting the original column, but create a new variable in the data.frame.

nesarc_sub$Alcohol_binary   <- ifelse(nesarc_sub$Age >= 21, "Legal", "Illegal")
nesarc_sub$Alcohol_binary01 <- ifelse(nesarc_sub$Age >= 21,       1,         0)
str(nesarc_sub$Alcohol_binary)
 chr [1:43093] "Legal" "Legal" "Legal" "Illegal" "Legal" "Legal" "Illegal" ...
head(nesarc_sub %>% select(Age, Alcohol_binary, Alcohol_binary01), 12)
   Age Alcohol_binary Alcohol_binary01
1   23          Legal                1
2   28          Legal                1
3   81          Legal                1
4   18        Illegal                0
5   36          Legal                1
6   34          Legal                1
7   19        Illegal                0
8   84          Legal                1
9   29          Legal                1
10  18        Illegal                0
11  68          Legal                1
12  48          Legal                1
5.5.2.1.2 Multi-level categorical to binary.
# Smoking frequency has 7 categories
# if at least 3 Days/week, then code as a 1, otherwise code as a 0
table(nesarc_sub$SmokingFreq)

               Never Once a month or less    2 to 3 Days/month 
               25080                  772                  409 
    1 to 2 Days/week     3 to 4 Days/week     5 to 6 Days/week 
                 747                  687                  460 
           Every Day 
               14836 
nesarc_sub$SmokingFreq01 <-
  ifelse(
    nesarc_sub$SmokingFreq
      %in%
    c("3 to 4 Days/week"
    , "5 to 6 Days/week"
    , "Every Day"
    )
  , 1 # success
  , 0 # failure
  )

# check that the frequencies are correct before and after
table(nesarc_sub$SmokingFreq01)

    0     1 
27110 15983 

5.6 Class 06, Figure arrangement, captions, cross-referencing

Rubric

  1. Reorganize your Class 05 bivariate plots above using plot_grid() and quarto, creating captions and cross-referencing them from the text.

    1. (3 p) For your numeric response plots, use cowplot::plot_grid() to create a single figure with separate plot panels.

    2. (3 p) For your categorical response plots, use quarto chunk options fig-cap, fig-subcap, and layout-ncol to create a single figure with separate plot panels.

    3. (2 p) Use captions to describe (not interpret) both sets of plots so a reader understands what is being plotted.

    4. (2 p) Use cross-referencing from the text to refer to the plots when you interpret them.

Note

Go above and reformat your plots and update your interpretations with cross-referencing.

Note

In my example, for pedagogical purposes, I reproduce the plots and illustrate the cross-referencing below. This allows me to separate for you the original creation of the plots from this step of arranging and cross-referencing them.

While I show the chunk options in the html output file, it’s best to look at the .qmd file to see the chuck options.

5.6.1 Plotting bivariate, numeric response, cowplot::plot_grid()

Note

Start by creating two plots and call them p1 and p2. Notice that I creating them with variable p, then assign p to p1 at the end.

# These two plots are copied from above.

# Scatter plot (for regression): x = numerical, y = numerical
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p <- p + theme_bw()
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
            title = "Total Cigarettes Smoked vs Age"
          #, x =
          , y = "Total Cigarettes Smoked per month"
          , caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
          )

p1 <- p


# Box plots (for ANOVA): x = categorical, y = numerical
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + theme_bw()
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
                      colour = "red", alpha = 0.8)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total Cigarettes Smoked per month"
          , caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
          )
p2 <- p
Finer points

Look at the help for ?cowplot::plot_grid; there are many options that I don’t use below. I put the p1 and p2 into a list and print the one plot. In the code chunk, I specified a label and a figure caption; I also increase the size of the plot. From the text, I use the chunk label as a cross-reference; note that I have to manually specify figure panels “A” and “B”. I just copied the previous interpretations.

These are the chunk options used below, I recommend looking in the .qmd file.

#| label: fig-bivariate-plots-num
#| fig-cap: Numeric response bivariate plots. (A) Total Cigarettes Smoked vs Age.  (B) Total Cigarettes Smoked by Depression
#| fig-width:   8
#| fig-height:  4
p_arranged <-
  cowplot::plot_grid(
    plotlist  = list(p1, p2)   # list of plots
  , nrow      = 1              # number of rows for grid of plots
  , ncol      = NULL           # number of columns, left unspecified
  , labels    = "AUTO"         # A and B panel labels
  )
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
`geom_smooth()` using formula 'y ~ x'
print(p_arranged)

Figure 1: Numeric response bivariate plots. (A) Total Cigarettes Smoked vs Age. (B) Total Cigarettes Smoked by Depression

Interpretation:

In Figure 1 A, the relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line.

In Figure 1 B, for NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.

5.6.2 Plotting bivariate, categorical response, quarto chunk options

Finer points

I’ve now also added a fig-subcap (figure subcaptions) which appear under each figure. They must be specified in the same order as the plots are printed and their cross-reference label appends a suffix of -1, -2, etc., to the fig-cap. In the interpretation below, I’ve included the cross-reference parenthetically at the end of the first sentence describing them, which is another common style.

These are the chunk options used below, I recommend looking in the .qmd file.

#| label: fig-bivariate-plots-cat
#| fig-cap: Categorical response bivariate plots.
#| fig-subcap:
#|   - "Proportion of young adult smokers with and without nicotine dependence by depression status"
#|   - "Likelihood of Nicotine Dependence increases with number of cigarettes smoked"
#| layout-ncol: 1  # one column for plots      # alternatively, #| layout: [45,-10, 45] puts 2 on one row with a 10% width space between
#| fig-width:   5
#| fig-height:  3
# Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + theme_bw()
p <- p + geom_bar(position = "fill")
p <- p + labs(x = "Depression status"
           , y = "Proportion"
           , fill = "Nicotine Dependence\nStatus"  # the legend title can be modified
           , title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
           )
print(p)

# Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)

# overlay a sigmodal curve to indicate direction of relationship
  # As of ggplot2 2.1.0, need to use this binomial_smooth() function:
  binomial_smooth <- function(...) {
    geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
  }
p <- p + binomial_smooth()    # old way: stat_smooth(method="glm", family="binomial", se=FALSE)

p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))

p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
            , y = "Nicotine Dependence (1=Yes, 0=No)"
            , title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
            )
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 297 rows containing non-finite values (stat_smooth).
Warning: Removed 297 rows containing missing values (geom_point).

(a) Proportion of young adult smokers with and without nicotine dependence by depression status

(b) Likelihood of Nicotine Dependence increases with number of cigarettes smoked

Figure 2: Categorical response bivariate plots.

Interpretation:

Categorical response bivariate plots are in Figure 2.

When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%) (Figure 2 (a)).

The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases (Figure 2 (b)). The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.


6 Statistical methods

6.1 Class 07-1, Simple linear regression (separate worksheet)

Note

Find this assignment on the course website.

6.2 Class 07-2, Simple linear regression

Rubric

  1. With your previous (or new) bivariate scatter plot, add a regression line.
    • (2 p) plot with regression line,
    • (1 p) label axes and title.
  2. Use lm() to fit the linear regression and interpret slope and \(R^2\) (R-squared) values.
    • (2 p) lm summary() table is presented,
    • (2 p) slope is interpreted with respect to a per-unit increase of the \(x\) variable in the context of the variables in the plot,
    • (2 p) \(R^2\) is interpretted in a sentence.
  3. (1 p) Interpret the intercept. Does it make sense in the context of your study?

6.2.1 1. Scatter plot, add a regression line.

library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p <- p + theme_bw()
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
#p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, fullrange = TRUE)  # adds the regression line
p <- p + labs(
            title = "Total Cigarettes Smoked vs Age"
          #, x =
          , y = "Total Cigarettes Smoked per month"
          , caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
          )
p <- p + xlim(c(0, NA))  # extends the range of the axis down to 0 to see the "intercept"
print(p)
`geom_smooth()` using formula 'y ~ x'

6.2.2 2. Fit the linear regression, interpret slope and \(R^2\) (R-squared) values

# fit the simple linear regression model
lm_TCS_Age <-
  lm(
    formula = TotalCigsSmoked ~ Age
  , data = nesarc_sub
  )
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)

Call:
lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub)

Residuals:
   Min     1Q Median     3Q    Max 
-332.1 -189.4 -139.1  106.3 2828.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 57.92792    4.35989   13.29   <2e-16 ***
Age          2.79797    0.08762   31.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 329.2 on 42794 degrees of freedom
  (297 observations deleted due to missingness)
Multiple R-squared:  0.02327,   Adjusted R-squared:  0.02325 
F-statistic:  1020 on 1 and 42794 DF,  p-value: < 2.2e-16
  • Slope: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month.

  • \(R^2\): The proportion of variance explained by the regression model, compared to the grand mean, is \(R^2 = 0.0233\), which is very small.

Note

When the \(x\) and \(y\) variables are correlated (\(r \ne 0\)), \(R^2\) will be larger (since \(R^2 = r^2\), the capitalization doesn’t matter in this case).

6.2.3 3. Interpret the intercept. Does it make sense?

For someone 0 years old, the expected number of cigarettes smoked per month is 57.93.

This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.

6.3 Class 08-1, Logarithm transformation (separate worksheet)

Note

Find this assignment on the course website.

6.4 Class 08-2, Logarithm transformation

Rubric

  1. Try plotting the data on a logarithmic scale
    • (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.
    1. original scales
    2. \(\log(x)\)-only
    3. \(\log(y)\)-only
    4. both \(\log(x)\) and \(\log(y)\)
  2. What happened to your data when you transformed it?
    • (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
    • (1 p) Choose the best scale for a linear relationship and explain why.
    • (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.

6.4.1 1. Try plotting on log scale (original scale, \(\log(x)\)-only, \(\log(y)\)-only, both \(\log(x)\) and \(\log(y)\))

# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()

library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
            title = "Original scale"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p1)

library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()          # convert the x-axis to the log10 scale
p2 <- p2 + labs(
            title = "Log(x)"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p2)

library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()          # convert the y-axis to the log10 scale
p3 <- p3 + labs(
            title = "Log(y)"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p3)

library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()          # convert the x-axis to the log10 scale
p4 <- p4 + scale_y_log10()          # convert the y-axis to the log10 scale
p4 <- p4 + labs(
            title = "Log(x) and Log(y)"
          , x = "Age"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p4)


# arrange them into a single figure
p_arranged <-
  cowplot::plot_grid(
    plotlist  = list(p1, p2, p3, p4)  # list of plots
  , nrow      = 2              # number of rows for grid of plots
  , ncol      = NULL           # number of columns, left unspecified
  , labels    = "AUTO"         # A and B panel labels
  )
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
print(p_arranged)

6.4.2 2. What happened to your data when you transformed it?

  • Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).
  1. Original: Same
  2. log(x): no improvement, age values became left skewed
  3. log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
  4. log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help.
  • Choose the best scale for a linear relationship and explain why.

log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.

  • Does your relationship benefit from a logarithmic transformation? Say why or why not.

A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.

Finer points

Notice that the interpretation is on the scale of the regression. If the \(x\) or \(y\) variable is on a different scale (such as \(\log_{10}\)), then the interpretation is on that scale, too.

The plot with \((x, \log_{10}(y))\) is an improvement over \((x,y)\) since the points are more symmetric around the regression line (they are not perfectly symmetric, but they are better). There might be a better transformation (such as square-root), or we might have to live with a non-symmetric response.

nesarc_sub <-
  nesarc_sub %>%
  mutate(
    TotalCigsSmokedLog10 = log10(TotalCigsSmoked)
  )

# fit the simple linear regression model
lm_TCSL10_Age <-
  lm(
    TotalCigsSmokedLog10 ~ Age
  , data = nesarc_sub %>% filter(TotalCigsSmoked > 0)
  )

# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCSL10_Age)

Call:
lm(formula = TotalCigsSmokedLog10 ~ Age, data = nesarc_sub %>% 
    filter(TotalCigsSmoked > 0))

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6113 -0.1787  0.2314  0.4303  1.2553 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.1093341  0.0152392  138.41   <2e-16 ***
Age         0.0054558  0.0002952   18.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6836 on 17714 degrees of freedom
Multiple R-squared:  0.01892,   Adjusted R-squared:  0.01886 
F-statistic: 341.6 on 1 and 17714 DF,  p-value: < 2.2e-16

The slope is 0.00546. Thus, for every year increase in age, we expect an increase of 0.00546 in log 10 of total cigarettes smoked; that is, we expect a small decrease on the log scale.

Finer points

Logorithm transformations of any base (2, \(e\), 10) modify the shape of the data in the same way. The difference is a multiplicative scaling of the axis. My favorite is base 2 because many datasets have several orders of “doubling” and it’s easy for me to think of each unit on the log 2 scale being twice as big as the previous unit. Compare this to base 10 where each unit is 10 times as large as the previous — very few datasets have such a vast range!

6.4.2.0.1 Additional example plots
Finer points

Here’s an example where transforming both \(x\) and \(y\) with the log makes a big improvement.

# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()

library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
            title = "Original scale"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p1)

library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
            title = "Log(x)"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p2)

library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
            title = "Log(y)"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p3)

library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
            title = "Log(x) and Log(y)"
          , x = "Personal Income"
          , y = "Total Cigarettes Smoked per month"
          )
#print(p4)

p_arranged <-
  cowplot::plot_grid(
    plotlist  = list(p1, p2, p3, p4)  # list of plots
  , nrow      = 2              # number of rows for grid of plots
  , ncol      = NULL           # number of columns, left unspecified
  , labels    = "AUTO"         # A and B panel labels
  )
`geom_smooth()` using formula 'y ~ x'
Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 736 rows containing non-finite values (stat_smooth).
`geom_smooth()` using formula 'y ~ x'
Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 736 rows containing non-finite values (stat_smooth).
print(p_arranged)

### This is another way to arrange several ggplot objects
# library(gridExtra)
# grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")

6.5 Class 09, Correlation (separate worksheet)

Note

Find this assignment on the course website.

6.6 Class 10, Categorical contingency tables (separate worksheet)

Note

Find this assignment on the course website.

6.7 Class 11, Correlation and Categorical contingency tables

Rubric

  1. With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
    • (1 p) plot is repeated here or the plot is referenced and easy to find from a plot above,
    • (1 p) correlation is calculated,
    • (2 p) correlation is interpreted (direction, strength of LINEAR relationship).
  2. With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
    • (1 p) frequency table of variables is given,
    • (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
    • (1 p) a well-labeled plot of the proportion table is given,
    • (2 p) the conditional proportions are interpreted and compared between conditions.

6.7.1 Correlation

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_point(alpha = 1/20)
p <- p + stat_smooth(method = lm)
p <- p + labs(
            title = "Log2 Cigarettes smoked vs Age"
          , x = "Age"
          , y = "Log2 Total Cigarettes Smoked per month"
          )
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 25377 rows containing non-finite values (stat_smooth).
Warning: Removed 25377 rows containing missing values (geom_point).

cor_A_C <-
  cor(
    nesarc_sub$Age
  , nesarc_sub$TotalCigsSmoked_log2
  , use = "complete.obs"
  )
cor_A_C
[1] 0.1375481

6.7.2 Interpretation of correlation

The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.

6.7.3 Contingency table

tab_S_D_N <-
  nesarc_sub %>%
  group_by(
    Sex
  , Depression
  , NicotineDependence
  ) %>%
  summarise(
    Frequency = n()
  ) %>%
  mutate(
    Proportion = round(Frequency / sum(Frequency), 3)
  ) %>%
  ungroup()
`summarise()` has grouped output by 'Sex', 'Depression'. You can override using
the `.groups` argument.
# tab_S_D_N

tab_S_D_N %>%
  knitr::kable()
Sex Depression NicotineDependence Frequency Proportion
Female No Depression Nicotine Dependence 1455 0.076
Female No Depression No Nicotine Dependence 17706 0.924
Female Yes Depression Nicotine Dependence 1117 0.206
Female Yes Depression No Nicotine Dependence 4297 0.794
Male No Depression Nicotine Dependence 1738 0.108
Male No Depression No Nicotine Dependence 14355 0.892
Male Yes Depression Nicotine Dependence 652 0.269
Male Yes Depression No Nicotine Dependence 1773 0.731
library(ggplot2)
p <- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + theme_bw()
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + labs(
    title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
  , y = "Proportion of Nicotine Dependence"
  , fill = "Nicotine Dependence"
  )
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_wrap(~ Sex)
p <- p + theme(legend.position = "bottom")
print(p)

6.7.4 Interpretation of conditional proportions

  • For Females without depression, the proportion who are nicotine dependent is 0.076, while for those with depression it is 0.206 (nearly 3 times as much).
  • For Males without depression, the proportion who are nicotine dependent is 0.108, while for those with depression it is 0.269 (nearly 3 times as much).

Males generally are more nicotine dependent than Females, but both are nearly 3 times more likely to be nicotine dependent if they are depressed compared to if they are not depressed.

6.8 Class 12-1, Parameter estimation (one-sample) (separate worksheet)

Note

Find this assignment on the course website.

6.9 Class 12-2, Inference and Parameter estimation (one-sample)

Rubric

  1. Using a numerical variable, calculate and interpret a confidence interval for the population mean.
    • (1 p) Identify and describe the variable,
    • (1 p) use t.test() to calculate the mean and confidence interval, and
    • (1 p) interpret the confidence interval.
    • (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
  2. Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
    • (1 p) Identify and describe the variable,
    • (1 p) use binom.test() to calculate the sample proportion and confidence interval, and
    • (1 p) interpret the confidence interval.
    • (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.

6.9.1 Numeric variable confidence interval for mean \(\mu\)

TotalCigsSmoked_log2 is the estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked) on the log2 scale.

# calculate mean and CI
t_result <-
  t.test(
    nesarc_sub$TotalCigsSmoked_log2
  )
t_result

    One Sample t-test

data:  nesarc_sub$TotalCigsSmoked_log2
t = 457.94, df = 17715, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 7.854201 7.921726
sample estimates:
mean of x 
 7.887963 

Interpretation:

  • The mean of TotalCigsSmoked_log2 is 7.89 and the 95% confidence interval is (7.85, 7.92).
  • In 95% of samples, a confidence interval constructed from the data will contain the true population mean. In our case, our CI was (7.85, 7.92). It either contains the true population mean or it doesn’t, but we don’t know.
# plot data and create table
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_histogram(binwidth = 0.5, alpha = 1)
#p <- p + geom_rug(alpha = 1/8)
# est
p <- p + geom_vline(aes(xintercept = as.numeric(t_result$estimate)), colour = "blue", size = 1, linetype = 3)
p <- p + geom_rect(aes(
          xmin = t_result$conf.int[1]
        , xmax = t_result$conf.int[2]
        , ymin = -150
        , ymax = 0)
                  , fill = "blue", alpha = 1)
p <- p + labs(title = "Log2 Total Cigs Smoked\nblue = est +- 95% CI")
print(p)
Warning: Removed 25377 rows containing non-finite values (stat_bin).

6.9.2 Categorical variable confidence interval for proportion \(p\)

Depression is a Yes/No variable indicating that the person has major depression (lifetime).

# proportions are of the last group_by() variable within combinations of the other variables
tab_dep_long <-
  nesarc_sub %>%
  group_by(Depression) %>%
  summarise(Frequency = n()) %>%
  mutate(Proportion = Frequency / sum(Frequency)) %>%
  ungroup()
tab_dep_long
# A tibble: 2 × 3
  Depression     Frequency Proportion
  <fct>              <int>      <dbl>
1 No Depression      35254      0.818
2 Yes Depression      7839      0.182
# prop.test() is an asymptotic (approximate) test for a binomial random variable
p_summary <-
  prop.test(
    x = tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency)
  , n = sum(tab_dep_long$Frequency)
  #, conf.level = 0.95
  )
p_summary

    1-sample proportions test with continuity correction

data:  tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency) out of sum(tab_dep_long$Frequency), null probability 0.5
X-squared = 17440, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1782835 0.1855912
sample estimates:
        p 
0.1819089 

Interpretation:

  • The proportion of people who have Depression is 0.182 and the 95% confidence interval is (0.178, 0.186).
  • In 95% of samples, a confidence interval constructed from the data will contain the true population proportion In our case, our CI was (0.178, 0.186). It either contains the true population proportion or it doesn’t, but we don’t know.
library(ggplot2)
p <- ggplot(data = tab_dep_long %>% filter(Depression == "Yes Depression"), aes(x = Depression, y = Proportion))
p <- p + theme_bw()
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
#p <- p + geom_bar(stat = "identity")
p <- p + geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25)
p <- p + geom_point(colour = "red", size = 6, pch = 18)
p <- p + scale_y_continuous(limits = c(0.17, 0.19))
# swap axes to take less vertical space
#p <- p + coord_flip()
print(p)
Warning: Removed 2 rows containing missing values (geom_hline).

6.10 Class 13, Hypothesis testing (one- and two-sample) (separate worksheet)

Note

Find this assignment on the course website.

6.11 Class 14, Paired data, assumption assessment (separate worksheet)

Note

Find this assignment on the course website.

6.12 Class 15, Hypothesis testing (one- and two-sample)

6.12.1 Mechanics of a hypothesis test (review)

  1. Set up the null and alternative hypotheses in words and notation.

    • In words: ``The population mean for [what is being studied] is different from [value of \(\mu_0\)].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
    • In notation: \(H_0: \mu=\mu_0\) versus \(H_A: \mu \ne \mu_0\) (where \(\mu_0\) is specified by the context of the problem).
  2. Choose the significance level of the test, such as \(\alpha=0.05\).

  3. Compute the test statistic, such as \(t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}\), where \(SE_{\bar{Y}}=s/\sqrt{n}\) is the standard error.

  4. Determine the tail(s) of the sampling distribution where the \(p\)-value from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, \(t_{s}\), with the critical value \(t_{\textrm{crit}}=t_{\alpha/2}\) in the direction of the alternative hypothesis from the \(t\)-distribution table with degrees of freedom \(df = n-1\).)

  5. State the conclusion in terms of the problem.

    • Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
    • Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT accept \(H_0\).)
  6. Check assumptions of the test (for now we skip this).

6.12.2 What do we do about “significance”?

Adapted from Significance Magazine.

Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more).

In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph.

“The difference between the two groups turns out to be small (8%), while the probability (\(p\)) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is \(p = 0.003\) (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold (\(p < 0.05\)). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.”

6.12.3 Two-sample \(t\)-test

Rubric

  1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample \(t\)-test associated with your research questions.
    • (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
    • (0 p) use t.test() to calculate the mean, test statistic, and p-value,
    • (2 p) state the significance level, test statistic, and p-value, and
    • (2 p) state the conclusion in the context of the problem.
    • (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
    • (2 p) Provide an appropriate plot of the data and sample estimates in a well-labeled plot.
    • (1 p) Check the assumptions of the test using the bootstrap and interpret the bootstrap sampling distribution plot.
  2. Set up the null and alternative hypotheses in words and notation.
    • In words: “The population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).”
    • In notation: \(H_0: \mu_D = \mu_N\) versus \(H_A: \mu_D > \mu_N\).
t_summary_TCS_D <-
  t.test(
    TotalCigsSmoked ~ Depression
  , data = nesarc_sub
  , alternative = "less"
  )
t_summary_TCS_D

    Welch Two Sample t-test

data:  TotalCigsSmoked by Depression
t = -17.226, df = 10439, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No Depression and group Yes Depression is less than 0
95 percent confidence interval:
      -Inf -71.94239
sample estimates:
 mean in group No Depression mean in group Yes Depression 
                    173.0296                     252.5677 
  1. Choose the significance level of the test, such as \(\alpha=0.05\).

  2. Compute the test statistic, such as \(t_{s} = -17.2\).

  3. \(p\)-value is \(p = 6.85\times 10^{-66}\).

  4. State the conclusion in terms of the problem.

    • Reject \(H_0\) in favor of \(H_A\) concluding that the population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).
  • Because we rejected the Null hypothesis, we could have made a Type I error.
## If we create a summary data.frame with a similar structure as our data, then we
##   can annotate our plot with those summaries.
est_mean_TCS_D <-
  nesarc_sub %>%
  group_by(Depression) %>%
  summarise(TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)) %>%
  ungroup()
est_mean_TCS_D
# A tibble: 2 × 2
  Depression     TotalCigsSmoked
  <fct>                    <dbl>
1 No Depression             173.
2 Yes Depression            253.
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + theme_bw()
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
                      colour = "red", alpha = 0.8)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total cigarettes smoked per month"
          )
print(p)
Warning: Removed 297 rows containing non-finite values (stat_boxplot).
Warning: Removed 297 rows containing non-finite values (stat_summary).
Warning: Removed 297 rows containing missing values (geom_point).

# histogram using ggplot
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + theme_bw()
p <- p + geom_histogram(binwidth = 40)
#p <- p + geom_rug()
p <- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour = "red")
p <- p + facet_grid(Depression ~ .)
p <- p + labs(
            title = "Total Cigarettes Smoked by Depression"
          , x = "Depression status"
          , y = "Total cigarettes smoked per month"
          )
print(p)
Warning: Removed 297 rows containing non-finite values (stat_bin).

e_plot_bs_two_samp_diff_dist(
    nesarc_sub %>% filter(Depression == "No Depression" ) %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked)
  , nesarc_sub %>% filter(Depression == "Yes Depression") %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked)
  , N = 1000
  )

  • The bootstrap sampling distribution of the difference in means is close to normal, so the model assumptions are met and we can rely on the decision we made in the hypothesis test above.

6.13 Class 16, ANOVA, Pairwise comparisons (separate worksheet)

Note

Find this assignment on the course website.

6.14 Class 17, ANOVA and Assessing Assumptions

Rubric

  1. Using a numerical response variable and a categorical variable with three to five levels (or a categorical variable you can reduce to three to five levels), specify an ANOVA hypothesis associated with your research questions.
    • (1 p) Specify the ANOVA hypotheses in words and notation,
    • (1 p) plot the data in a way that is consistent with hypothesis test (comparing means, assess equal variance assumption),
    • (1 p) use aov() to calculate the hypothesis test statistic and p-value,
    • (1 p) state the significance level, test statistic, and p-value,
    • (1 p) state the conclusion in the context of the problem,
    • (2 p) assess the normality assumption of the residuals using appropriate methods (QQ-plot and Anderson-Darling test), and
    • (1 p) assess the assumption of equal variance between your groups using an appropriate test (also mention standard deviations of each group).
    • (2 p) If you rejected the ANOVA null hypothesis, perform follow-up pairwise comparisons using Tukey’s HSD to indicate which groups have statistically different means and summarize the results.

6.14.1 Hypothesis and plot

Let \(\mu_j\) = pop mean number of the total cigarettes smoked for the five ethnicities identified in the dataset: Caucasian, African American, Native American, Asian, and Hispanic, numbered \((j = 1, 2, 3, 4, 5)\), respectively. We wish to test \(H_0: \mu_1 = \cdots = \mu_5\) against \(H_A: \textrm{not } H_0\) (at least one pair of means differ).

A table of the ethnicities to compare is below.

summary(nesarc_sub$Ethnicity)
 Cauc  AfAm  NaAm  Asia  Hisp 
24507  8245   701  1332  8308 

Plot the data in a way that compares the means. Error bars are 95% confidence intervals of the mean.

# Plot the data using ggplot
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Ethnicity, y = TotalCigsSmoked))
p <- p + theme_bw()
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(nesarc_sub$TotalCigsSmoked),
                    colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_violin(width = 0.5, alpha = 0.25)
p <- p + geom_boxplot(width = 0.25, alpha = 0.25)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 4,
                      colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
                      width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Total cigarettes smoked by Ethnicity")
p <- p + ylab("Total Cigs Smoked")
print(p)
Warning: Removed 297 rows containing non-finite values (stat_ydensity).
Warning: Removed 297 rows containing non-finite values (stat_boxplot).
Warning: Removed 297 rows containing non-finite values (stat_summary).
Removed 297 rows containing non-finite values (stat_summary).
Warning: Removed 1 rows containing missing values (geom_hline).
Warning: Removed 297 rows containing missing values (geom_point).

6.14.2 (NOT USED) Transform the response variable to satisfy assumptions

6.14.3 ANOVA Hypothesis test

Hypothesis test

  1. Set up the null and alternative hypotheses in words and notation.

    • In words: “The population mean total cigarettes smoked is different between ethnicities.”
    • In notation: \(H_0: \mu_1 = \cdots = \mu_5\) versus \(H_A: \textrm{not } H_0\) (at least one pair of means differ).
  2. Let the significance level of the test be \(\alpha = 0.05\).

  3. Compute the test statistic.

aov_summary <-
  aov(
    TotalCigsSmoked ~ Ethnicity
  , data = nesarc_sub
  )
summary(aov_summary)
               Df    Sum Sq  Mean Sq F value Pr(>F)    
Ethnicity       4 2.106e+08 52648664   496.4 <2e-16 ***
Residuals   42791 4.538e+09   106050                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
297 observations deleted due to missingness

The \(F\)-statistic for the ANOVA is \(F = 496\).

  1. Compute the \(p\)-value from the test statistic.

The p-value for testing the null hypothesis is \(p = 0\).

  1. State the conclusion in terms of the problem.

Because \(p = 0 < 0.05\), we reject \(H_0\) in favor of \(H_1\) concluding that the mean total cigarettes smoked is different between ethnicities.

6.14.4 Check assumptions

  1. Check assumptions of the test.
  1. Residuals are normal
  2. Populations have equal variances.
  • Check whether residuals are normal.

  • Plot the residuals and assess whether they appear normal.

# Plot the data using ggplot
df_res <- data.frame(res = aov_summary$residuals)
library(ggplot2)
p <- ggplot(df_res, aes(x = res))
p <- p + theme_bw()
p <- p + geom_histogram(aes(y = ..density..), binwidth = 50)
p <- p + geom_density(colour = "blue", adjust = 5)
#p <- p + geom_rug()
p <- p + stat_function(fun = dnorm, colour = "red", args = list(mean = mean(df_res$res), sd = sd(df_res$res)))
p <- p + labs(title = "ANOVA Residuals"
            , caption = "Blue = Kernal density curve\nRed = Normal distribution")
print(p)

The plot of the residuals is very right skewed (not normal).

# QQ plot
par(mfrow=c(1,1))
#library(car)
car::qqPlot(
    aov_summary$residuals
  , las = 1
  , id = list(n = 4, cex = 1)
  , lwd = 1
  , main="QQ Plot"
  )

38173   279  1034  2883 
37909   276  1026  2863 

The QQ-plot of the residuals versus the normal quantiles is very right skewed (U-shaped, not normal).

  • A formal test of normality on the residuals tests the hypothesis \(H_0:\) The distribution is Normal vs \(H_1:\) The distribution is not Normal. We can test the distribution of the residuals.
#shapiro.test(aov_summary$residuals)
#library(nortest)
nortest::ad.test(aov_summary$residuals)

    Anderson-Darling normality test

data:  aov_summary$residuals
A = 3399.5, p-value < 2.2e-16
nortest::cvm.test(aov_summary$residuals)
Warning in nortest::cvm.test(aov_summary$residuals): p-value is smaller than
7.37e-10, cannot be computed more accurately

    Cramer-von Mises normality test

data:  aov_summary$residuals
W = 637.55, p-value = 7.37e-10

The formal normality tests of the residuals reject \(H_0\) in favor of \(H_A\), concluding that the data are not normal.

  • Check whether populations have equal variances.

  • Look at the numerical summaries below.

# calculate summaries
dat_EthCig_summary <-
  nesarc_sub %>%
  group_by(Ethnicity) %>%
  summarize(
    m = mean(TotalCigsSmoked, na.rm = TRUE)
  , s = sd(TotalCigsSmoked, na.rm = TRUE)
  , n = n()
  , .groups = "drop_last"
  ) %>%
  ungroup()

dat_EthCig_summary
# A tibble: 5 × 4
  Ethnicity     m     s     n
  <fct>     <dbl> <dbl> <int>
1 Cauc      244.   376. 24507
2 AfAm      127.   250.  8245
3 NaAm      293.   420.   701
4 Asia       76.9  194.  1332
5 Hisp       89.7  226.  8308

The standard deviations appear different between groups, in particular, the Asian group has less than half the standard deviation of the Native American group.

  • Formal tests for equal variances. We can test whether the variances are equal between our three groups. This is similar to the ANOVA hypothesis, but instead of testing means we’re tesing variances. \(H_0: \sigma^2_1 = \cdots = \sigma^2_5\) versus \(H_A: \textrm{not } H_0\) (at least one pair of variances differ).
## Test equal variance
# assumes populations are normal
bartlett.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)

    Bartlett test of homogeneity of variances

data:  TotalCigsSmoked by Ethnicity
Bartlett's K-squared = 4464.9, df = 4, p-value < 2.2e-16
# does not assume normality, requires car package
#library(car)
car::leveneTest(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     4  498.24 < 2.2e-16 ***
      42791                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# nonparametric test
fligner.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)

    Fligner-Killeen test of homogeneity of variances

data:  TotalCigsSmoked by Ethnicity
Fligner-Killeen:med chi-squared = 2160.6, df = 4, p-value < 2.2e-16

Because the data were not normal, we interpret the Levene test. The p-value is less than 0.05, therefore we reject \(H_0\) of equal variances in favor of \(H_A\) that the variances are not equal.

6.14.5 Post Hoc pairwise comparison tests

EMM plot interpretation

This EMM plot (Estimated Marginal Means, aka Least-Squares Means) is only available when conditioning on one variable. The blue bars are confidence intervals for the EMMs; don’t ever use confidence intervals for EMMs to perform comparisons – they can be very misleading. The red arrows are for the comparisons among means; the degree to which the “comparison arrows” overlap reflects as much as possible the significance of the comparison of the two estimates. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”).

## Contrasts
adjust_method <- c("none", "tukey", "bonferroni")[2]

library(emmeans)
emm_cont <-
  emmeans::emmeans(
    aov_summary
  , specs = "Ethnicity"
  )

# means and CIs
emm_cont %>% print()
 Ethnicity emmean    SE    df lower.CL upper.CL
 Cauc       244.1  2.09 42791    240.0    248.1
 AfAm       127.2  3.60 42791    120.2    134.3
 NaAm       293.3 12.35 42791    269.0    317.5
 Asia        76.9  8.94 42791     59.4     94.5
 Hisp        89.7  3.58 42791     82.7     96.7

Confidence level used: 0.95 
# pairwise differences
emm_cont %>% pairs(adjust = adjust_method) %>% print()
 contrast    estimate    SE    df t.ratio p.value
 Cauc - AfAm    116.8  4.16 42791  28.055  <.0001
 Cauc - NaAm    -49.2 12.53 42791  -3.927  0.0008
 Cauc - Asia    167.1  9.18 42791  18.209  <.0001
 Cauc - Hisp    154.4  4.15 42791  37.235  <.0001
 AfAm - NaAm   -166.0 12.87 42791 -12.903  <.0001
 AfAm - Asia     50.3  9.64 42791   5.218  <.0001
 AfAm - Hisp     37.5  5.08 42791   7.387  <.0001
 NaAm - Asia    216.3 15.25 42791  14.188  <.0001
 NaAm - Hisp    203.6 12.86 42791  15.827  <.0001
 Asia - Hisp    -12.8  9.63 42791  -1.325  0.6758

P value adjustment: tukey method for comparing a family of 5 estimates 
# plot of means, CIs, and comparison arrows
plot(
    emm_cont
  , comparisons = TRUE
  , adjust = adjust_method
  , horizontal = FALSE
  , ylab = "Ethnicity"
  )

All pairs of Ethnicities differ except Asian and Hispanic.

  • Cauc - AfAm: sig diff
  • Cauc - NaAm: sig diff
  • Cauc - Asia: sig diff
  • Cauc - Hisp: sig diff
  • AfAm - NaAm: sig diff
  • AfAm - Asia: sig diff
  • AfAm - Hisp: sig diff
  • NaAm - Asia: sig diff
  • NaAm - Hisp: sig diff
  • Asia - Hisp: NOT sig diff
dat_EthCig_summary %>% arrange(m)
# A tibble: 5 × 4
  Ethnicity     m     s     n
  <fct>     <dbl> <dbl> <int>
1 Asia       76.9  194.  1332
2 Hisp       89.7  226.  8308
3 AfAm      127.   250.  8245
4 Cauc      244.   376. 24507
5 NaAm      293.   420.   701

Summarize results by ordering the means and grouping pairs that do not differ.


                    Ethnicity     m     s     n
                    <fct>     <dbl> <dbl> <int>
            |     1 Asia       76.9  194.  1332
            |     2 Hisp       89.7  226.  8308
             |    3 AfAm      127.   250.  8245
              |   4 Cauc      244.   376. 24507
               |  5 NaAm      293.   420.   701

6.15 Class 18, Nonparametric methods (separate worksheet)

Note

Find this assignment on the course website.

6.16 Class 19, Binomial and Multinomial tests (separate worksheet)

Note

Find this assignment on the course website.

6.17 Class 20-1, Two-way categorical tables (separate worksheet)

Note

Find this assignment on the course website.

6.18 Class 20-2, Simple linear regression (separate worksheet)

Note

Find this assignment on the course website.

6.19 Class 21, Two-way categorical and simple linear regression

6.19.1 Two-way categorical analysis

  1. Two-way categorical analysis.

Using two categorical variables with two to five levels each, specify a hypothesis test for homogeneity of proportions associated with your research questions.

6.19.1.1 (1 p) Specify the hypotheses in words and notation.

  • In words: “There is an association between NicotineDependence and Ethnicity.”
  • In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) for all row categories \(i\) and column categories \(j\), versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\) for at least one row category \(i\) and column category \(j\).

6.19.1.2 (1 p) State the conclusion of the test in the context of the problem.

# Tabulate by two categorical variables:
tab_E_TD <-
  xtabs(
    ~ Ethnicity + NicotineDependence
  , data = nesarc_sub
  )
tab_E_TD
         NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
     Cauc                3337                  21170
     AfAm                 836                   7409
     NaAm                 158                    543
     Asia                  95                   1237
     Hisp                 536                   7772
# column proportions
prop.table(
    tab_E_TD
  , margin = 1
  )
         NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
     Cauc          0.13616518             0.86383482
     AfAm          0.10139478             0.89860522
     NaAm          0.22539230             0.77460770
     Asia          0.07132132             0.92867868
     Hisp          0.06451613             0.93548387
# Chi-sq test
chisq_E_TD <-
  chisq.test(
    tab_E_TD
  , correct = FALSE
  )
chisq_E_TD

    Pearson's Chi-squared test

data:  tab_E_TD
X-squared = 439.32, df = 4, p-value < 2.2e-16

The test statistic is \(X^2 = 439\).

Because the p-value \(= 8.84\times 10^{-94} < 0.05\) we reject the null hypothesis concluding that there is an association between NicotineDependence and Ethnicity.

# table of expected frequencies:
chisq_E_TD$expected
         NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
     Cauc          2821.89066             21685.1093
     AfAm           949.38134              7295.6187
     NaAm            80.71756               620.2824
     Asia           153.37489              1178.6251
     Hisp           956.63556              7351.3644
# smallest expected frequency:
min(chisq_E_TD$expected)
[1] 80.71756

The model assumptions are met since the expected count for each cell is at least 5.

6.19.1.2.1 Nicer version of contingency (chi-square) table for publication
## gtsummary tables
tab_cross <-
  nesarc_sub %>%
  gtsummary::tbl_cross(
    row          = "Ethnicity"
  , col          = "NicotineDependence"
  , label        = NULL
  , statistic    = NULL
  , digits       = NULL
  , percent      = c("none", "column", "row", "cell")[3]
  , margin       = c("column", "row")
  , missing      = c("ifany", "always", "no")[1]
  , missing_text = "Unknown"
  , margin_text  = "Total"
  ) %>%
  gtsummary::add_p()          %>%  # add p-values to the output comparing values across groups (tbl_cross options: test = c("chisq.test", "fisher.test")[2])
  gtsummary::bold_labels()    %>%  # bold variable name labels
  gtsummary::italicize_levels() %>% # italicize levels
  gtsummary::modify_caption("Nicotine dependence by Ethnicity")

tab_cross
Nicotine dependence by Ethnicity
NicotineDependence Total p-value1
Nicotine Dependence No Nicotine Dependence
Ethnicity <0.001
Cauc 3,337 (14%) 21,170 (86%) 24,507 (100%)
AfAm 836 (10%) 7,409 (90%) 8,245 (100%)
NaAm 158 (23%) 543 (77%) 701 (100%)
Asia 95 (7.1%) 1,237 (93%) 1,332 (100%)
Hisp 536 (6.5%) 7,772 (94%) 8,308 (100%)
Total 4,962 (12%) 38,131 (88%) 43,093 (100%)
1 Pearson's Chi-squared test

6.19.1.3 (1 p) Plot a mosaic plot of the data and Pearson residuals.

# The Pearson residuals
chisq_E_TD$residuals
         NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
     Cauc            9.696820              -3.497990
     AfAm           -3.679775               1.327427
     NaAm            8.601947              -3.103031
     Asia           -4.713559               1.700350
     Hisp          -13.599806               4.905937
# The sum of the squared residuals is the chi-squared statistic:
#chisq_E_TD$residuals^2
#sum(chisq_E_TD$residuals^2)
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
mosaic(
    tab_E_TD
  , shade  = TRUE
  , legend = TRUE
  , abbreviate_labs = 4
  , rot_labels = c(0, 90, 0, 0)
  , just_labels = c("center", "right") # order is opposite as in first argument
  , offset_varnames = c(0, 0, 0, 2)      # offset the y-axis var name
  #, margins = c(3, 3, 3, 14)           # order is: top, right, bottom, left
  , main = "Fraction of smokers with and without\n nicotine dependence by Ethnicity"
  , labeling_args = list(set_varnames =
                          c(NicotineDependence  = "Nicotine Dependence/Addiction Status"
                          , Ethnicity           = "Ethnicity")
                          )
  )

6.19.1.4 (1 p) Interpret the mosaic plot with reference to the Pearson residuals.

The primary cause of rejecting \(H_0\) is that Hispanics and Asians have less nicotine dependence than expected while Caucasians and (especially) Native Americans have higher nicotine dependence than expected.

6.19.2 Simple linear regression

  1. Simple linear regression.

Select two numerical variables.

6.19.2.1 (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.

(See below.)

6.19.2.2 (0 p) Fit the simple linear regression model.

# fit model
#lm_fit <- lm(TotalCigsSmoked_sqrt ~ DaysSmoke, data = nesarc_sub)
lm_fit <- lm(TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)
coef(lm_fit)
   (Intercept) DaysSmoke_log2 
     0.4585303      1.6635414 

Note that prediction intervals should be bounded below at 0.

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = DaysSmoke_log2, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm_fit, data.frame(DaysSmoke_log2)
                                        , interval = "prediction", level = 0.95)[, 2],
                         ymax = predict(lm_fit, data.frame(DaysSmoke_log2)
                                        , interval = "prediction", level = 0.95)[, 3],)
                  , alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(0.05), alpha = 0.01)
p <- p + labs(title = "Regression with confidence and prediction bands"
              , x = "log2 of Days smoke"
              , y = "log2 of Total cigarettes smoked"
              , caption = "Blue line is fitted regression line.\nGray band is the confidence band.\nGreen band is the prediction band.")
print(p)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 25377 rows containing non-finite values (stat_smooth).
Warning: Removed 25377 rows containing missing values (geom_point).

6.19.2.3 (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and \(x\)-value).

e_plot_lm_diagostics(
    lm_fit
  , sw_plot_set = "simple"
  )

  • Residuals vs x: each group (based on x-variable) of values is roughly symmetric and the y=0 line passes through the center of most groups. The x=0 group has many observations at the lowest value which is why the x=0 (or regression) is below the distribution that is spread above it.

6.19.2.4 (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).

  • QQ-Plot: On original scale, the pattern is non-normal with right-skewness in y-variable. After log2 transformation of both variables, we still do not have normality (but it is much better).

6.19.2.5 (1 p) Assess the relative influence of points.

  • Cook’s distance: several points with slightly higher influence, but none with extremely large influence.
  • Cook’s distance vs Leverage: Those points with high influence also have high leverage, thus, those points that are extremes in the x-direction (least frequent or most frequent smokers) have the highest influence.

6.19.2.6 (1 p) Test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).

Though the model assumptions are not perfectly met, we will continue to interpret the model that we have.

summary(lm_fit)

Call:
lm(formula = TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7145 -0.4585  0.1924  0.6075  5.5859 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.45853    0.03447    13.3   <2e-16 ***
DaysSmoke_log2  1.66354    0.00746   223.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.175 on 17714 degrees of freedom
  (25377 observations deleted due to missingness)
Multiple R-squared:  0.7373,    Adjusted R-squared:  0.7373 
F-statistic: 4.973e+04 on 1 and 17714 DF,  p-value: < 2.2e-16

Because the p-value = \(0 < 0.05\), we reject \(H_0\) in favor of \(H_A\) concluding that the slope is different from 0.

Interpretation (extra): For every unit increase in log2 of DaysSmoke, we expect an increase of 1.66 in the log2 of TotalCigsSmoked. On the original scale, this increase is scary.

6.19.2.7 (1 p) Interpret the \(R^2\) value.

The regression model explains \(R^2 = 0.737\) proportion of the variation in the response above using the mean value alone.

6.19.2.7.1 Nicer version of linear model parameter estimate table for publication
tab_fit_lm <-
  lm_fit %>%
  gtsummary::tbl_regression(
  #  label           = NULL
  #, exponentiate    = FALSE
  #, include         = everything()
  #, show_single_row = NULL
    conf.level      = 0.95
  , intercept       = TRUE
  #, estimate_fun    = NULL
  #, pvalue_fun      = NULL
  #, tidy_fun        = NULL
  #, add_estimate_to_reference_rows = FALSE
  , conf.int        = TRUE
  ) %>%
  #gtsummary::add_q()                  %>%  # add a column of q values to control for multiple comparisons
  gtsummary::bold_p(t = 0.05, q = FALSE) %>% # Bold significant p-values or q-values
  gtsummary::add_global_p()           %>%  # adds the global p-value for a categorical variables, instead of difference from intercept
  gtsummary::add_glance_source_note() %>%  # adds statistics from `broom::glance()` as source note
  #gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)])  %>%  # adds column of the variance inflation factors (VIF)
  gtsummary::bold_labels()    %>%  # bold variable name labels
  gtsummary::italicize_levels() %>% # italicize levels
  gtsummary::modify_caption("Table title")

tab_fit_lm
Table title
Characteristic Beta 95% CI1 p-value
(Intercept) 0.46 0.39, 0.53 <0.001
DaysSmoke_log2 1.7 1.6, 1.7 <0.001
R² = 0.737; Adjusted R² = 0.737; Sigma = 1.18; Statistic = 49,726; p-value = <0.001; df = 1; Log-likelihood = -27,995; AIC = 55,995; BIC = 56,018; Deviance = 24,458; Residual df = 17,714; No. Obs. = 17,716
1 CI = Confidence Interval

6.20 Class 22, Logistic regression (separate worksheet)

Note

Find this assignment on the course website.

6.21 Class 23, Logistic regression

  1. Logistic regression.

6.21.1 Select a binary response and continuous explanatory/predictor variable.

6.21.2 (1 p) Plot the data.

See below.

6.21.3 (1 p) Summarize the \(\hat{p}\) values for each value of the \(x\)-variable. Also, calculate the empirical logits.

Summarize observed probability of success for each unique value of x.

dat_Nic_Cigs_sum <-
  nesarc_sub %>%
  group_by(
    TotalCigsSmoked_smokers_log2
  ) %>%
  summarize(
    Success = sum(NicotineDependence01)
  , Total   = n()
  # estimated proportion of successes for each unique value of x
  , p_hat   = Success / Total
  , .groups = "drop_last"
  ) %>%
  ungroup() %>%
  na.omit()

# emperical logits
dat_Nic_Cigs_sum <-
  dat_Nic_Cigs_sum %>%
  mutate(
    emp_logit = log((p_hat + 0.5/Total) / (1 - p_hat + 0.5/Total))
  )

dat_Nic_Cigs_sum
# A tibble: 103 × 5
   TotalCigsSmoked_smokers_log2 Success Total   p_hat emp_logit
                          <dbl>   <dbl> <int>   <dbl>     <dbl>
 1                         0         17   356 0.0478      -2.97
 2                         1          4   170 0.0235      -3.61
 3                         1.32       1   110 0.00909     -4.29
 4                         1.58       3    70 0.0429      -2.96
 5                         2          1    23 0.0435      -2.71
 6                         2.32      12   144 0.0833      -2.36
 7                         2.58       8   172 0.0465      -2.96
 8                         2.81       0     3 0           -1.95
 9                         2.91       6    64 0.0938      -2.20
10                         3          0     2 0           -1.61
# … with 93 more rows

6.21.4 (1 p) Plot the \(\hat{p}\) values vs the \(x\)-variable and plot the empirical logits vs the \(x\)-variable.

6.21.4.1 Probability/proportion scale

Plots on the probability scale should follow a sigmoidal curve (a little difficult to see here). Note that the overlayed reference curve (red) is a weighted smoothed curve (loess), not the model fit.

library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))
p1 <- p1 + theme_bw()
# data
p1 <- p1 + geom_jitter(width = 0.05, height = 0.025, size = 0.5, alpha = 1/10, colour = "green")
# summaries
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))
p1 <- p1 + geom_smooth(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, weight = Total), se = FALSE, colour = "red")  # just for reference
# axes
p1 <- p1 + scale_y_continuous(breaks = seq(0, 1, by = 0.2))
p1 <- p1 + expand_limits(y = c(0, 1))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
    title   = "Proportion of nicotine dependence by log2 Total cigarettes smoked"
  , subtitle= "Observed data, smokers only"
  , x       = "log2 Total cigarettes smoked"
  , y       = "Nicotine dependence (0/1)\nObserved probability of nicotine dependence"
  , caption = paste(
                "Green = Indicator points of nicotine dependence (1) or not (0)."
              , "Black = Observed proportions of nicotine dependence"
              , "Red = Smoothed curve to proportions"
              , sep = "\n" # separate by new lines
              )
  )
print(p1)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
Warning: Removed 25377 rows containing missing values (geom_point).

6.21.4.2 Logit scale

On the logit scale, if points follow a straight line, then we can fit a simple logistic regression model. Note that the overlayed reference straight line (blue) is from weighted least squares (not the official fitted line), and the curve (red) is a weighted smoothed curve (loess). If these two lines are similar, that suggests a straight line on the logit scale is a good fit. Both lines are weighted by the total number of observations that each point represents, so that points representing few observations don’t contribute as much as points representing many observations, thus our decision should not be heavily influenced by random deviations where there is little data.

# plot on logit scale
library(ggplot2)
p1 <- ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))
p1 <- p1 + theme_bw()

# summaries
p1 <- p1 + geom_point(aes(size = Total))
p1 <- p1 + stat_smooth(aes(weight = Total), method = "lm", se = FALSE, linetype = 3)  # just for reference
p1 <- p1 + geom_smooth(aes(weight = Total), se = FALSE, colour = "red")  # just for reference

# axes
#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
    title   = "Logit of nicotine dependence by log2 Total cigarettes smoked"
  , subtitle= "Observed data, smokers only"
  , x       = "log2 Total cigarettes smoked"
  , y       = "Empirical logit of the probability of nicotine dependence"
  , caption = paste(
                "Black = Observed logit proportions of nicotine dependence"
              , "Blue = Naive LM fit of logit proportions"
              , "Red = Loess smooth curve of empirical logits"
              , sep = "\n" # separate by new lines
              )
  )
print(p1)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

6.21.5 (1 p) Describe the logit-vs-\(x\) plot. Is it linear? If not, consider a transformation of \(x\) to improve linearity; describe the transformation you chose if you needed one.

  • Log2 of Total cigarettes smoked improved the linear relationship. A straight line describes the data well from 0 to 10, but does not fit the data when x is greater than 10.

6.21.6 (1 p) Fit the glm() model and assess the deviance lack-of-fit test.

The simple logistic regression model expresses the population proportion \(p\) of individuals with a given attribute (called the probability of success) as a function of a single predictor variable \(X\). The model assumes that \(p\) is related to \(X\) through \[ \log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X \] or, equivalently, as \[ p = \frac{ \exp( \beta_0 + \beta_1 X ) }{ 1 + \exp( \beta_0 + \beta_1 X ) }. \] The logistic regression model is a binary response model, where the response for each case falls into one of two exclusive and exhaustive categories, success (cases with the attribute of interest) and failure (cases without the attribute of interest).

Fit the model.

# For our summarized data (with frequencies and totals for each age)
# The left-hand side of our formula binds two columns together with cbind():
#   the columns are the number of "successes" and "failures".
# For logistic regression with logit link we specify family = binomial,
#   where logit is the default link function for the binomial family.

glm_n_c <-
  glm(
    cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2
  , family = binomial
  , data   = dat_Nic_Cigs_sum
  )

6.21.6.1 Deviance statistic for lack-of-fit

Unfortunately, there aren’t many model diagnostics for logistic regression. Visual inspection of the empirical logits is good when you can summarize the data as we’ve done.

One simple test for lack-of-fit uses the deviance statistic. Under the null hypothesis (that you’ll state below), the residual deviance follows a \(\chi^2\) distribution with the associated degrees-of-freedom.

Below, we calculate the deviance p-value and perform the test for lack-of-fit.

# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
glm_n_c$deviance
[1] 310.0391
glm_n_c$df.residual
[1] 101
dev_p_val <- 1 - pchisq(glm_n_c$deviance, glm_n_c$df.residual)
dev_p_val
[1] 0

State the null and alternative hypotheses for lack-of-fit.

  • \(H_0:\) Model fits the data.

  • \(H_A:\) Model does not fit the data.

For your preferred model, the deviance statistic is

  • D = 310 with

  • 101 df, giving

  • p-value = 0.

  • Because the p-value is less than 0.05, we reject \(H_0\) in favor of \(H_A\) concluding that the model does not fit the data. However, for this assignment, we will continue with interpretation.

6.21.7 (1 p) Calculate the confidence bands around the model fit/predictions. Plot on both the logit and \(\hat{p}\) scales.

The glm() statement creates an object which we can use to create the fitted probabilities and 95% CIs for the population proportions at the ages at first vaginal intercourse. The fitted probabilities and the limits are stored in columns labeled fit_p, fit_p_lower, and fit_p_upper, respectively.

Below I create confidence bands for the model and plot the fit against the data, first on the logit scale to assess model fit and then on the probability scale for interpretation.

# predict() uses all the Load values in dataset, including appended values
fit_logit_pred <-
  predict(
    glm_n_c
  #, newdata = data.frame(TotalCigsSmoked_smokers_log2 = dat_Nic_Cigs_sum$TotalCigsSmoked_smokers_log2)
  , type   = "link"
  , se.fit = TRUE
  ) %>%
  as_tibble()

# put the fitted values in the data.frame
dat_Nic_Cigs_sum <-
  dat_Nic_Cigs_sum %>%
  mutate(
    # logit scale values
    fit_logit       = fit_logit_pred$fit
  , fit_logit_se    = fit_logit_pred$se.fit
  , fit_logit_lower = fit_logit - 1.96 * fit_logit_se
  , fit_logit_upper = fit_logit + 1.96 * fit_logit_se
  # proportion scale values
  , fit_p           = exp(fit_logit) / (1 + exp(fit_logit))
  , fit_p_lower     = exp(fit_logit_lower) / (1 + exp(fit_logit_lower))
  , fit_p_upper     = exp(fit_logit_upper) / (1 + exp(fit_logit_upper))
  )

6.21.7.1 Logit scale

# plot on logit scale
library(ggplot2)
p1 <- ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))
p1 <- p1 + theme_bw()

# MODEL FIT
# predicted curve and point-wise 95% CI
p1 <- p1 + geom_ribbon(aes(x = TotalCigsSmoked_smokers_log2, ymin = fit_logit_lower, ymax = fit_logit_upper), fill = "blue", linetype = 1, alpha = 0.2)
p1 <- p1 + geom_line(aes(x = TotalCigsSmoked_smokers_log2, y = fit_logit), colour = "blue", size = 1)
# fitted values
p1 <- p1 + geom_point(aes(y = fit_logit), colour = "blue", size = 2)

# summaries
p1 <- p1 + geom_point(aes(size = Total))

# axes
#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
    title   = "Logit of nicotine dependence by log2 Total cigarettes smoked"
  , subtitle= "Logistic model fit"
  , x       = "log2 Total cigarettes smoked"
  , y       = "Logit scale of the probability of nicotine dependence"
  , caption = paste(
                "Black = Observed logit proportions of nicotine dependence given log2 Total cigarettes smoked"
              , "Blue = Logistic model fitted logit proportions"
              , sep = "\n" # separate by new lines
              )
  )
print(p1)

6.21.7.2 Probability/proportion scale

# plot on probability scale
library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))
p1 <- p1 + theme_bw()
# data
p1 <- p1 + geom_jitter(width = 0.05, height = 0.025, size = 0.5, alpha = 1/10, colour = "green")
# summaries
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))

# MODEL FIT
# predicted curve and point-wise 95% CI
p1 <- p1 + geom_ribbon(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p, ymin = fit_p_lower, ymax = fit_p_upper), fill = "blue", linetype = 1, alpha = 0.2)
p1 <- p1 + geom_line(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p), colour = "blue", size = 1)
# fitted values
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(y = fit_p), colour = "blue", size = 2)

# axes
p1 <- p1 + scale_y_continuous(breaks = seq(0, 1, by = 0.2))
p1 <- p1 + expand_limits(y = c(0, 1))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
    title   = "Proportion of nicotine dependence by log2 Total cigarettes smoked"
  , subtitle= "Logistic model fit"
  , x       = "log2 Total cigarettes smoked"
  , y       = "At least one pregnancy (0/1)\nObserved probability of nicotine dependence"
  , caption = paste(
                "Green = Indicator points of nicotine dependence (1) or not (0)."
              , "Black = Observed proportions of nicotine dependence given log2 Total cigarettes smoked"
              , "Blue = Logistic model fitted proportions"
              , sep = "\n" # separate by new lines
              )
  )
print(p1)
Warning: Removed 25377 rows containing missing values (geom_point).

6.21.8 (1 p) Interpret the sign (\(+\) or \(-\)) of the slope parameter and test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).

The summary table gives MLEs and standard errors for the regression parameters. The z-value column is the parameter estimate divided by its standard error. The p-values are used to test whether the corresponding parameters of the logistic model are zero.

summary(glm_n_c)

Call:
glm(formula = cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2, 
    family = binomial, data = dat_Nic_Cigs_sum)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-6.9776  -1.1004  -0.6275   0.4655   3.8816  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -2.94584    0.08730  -33.74   <2e-16 ***
TotalCigsSmoked_smokers_log2  0.23882    0.01014   23.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1020.01  on 102  degrees of freedom
Residual deviance:  310.04  on 101  degrees of freedom
AIC: 607.54

Number of Fisher Scoring iterations: 4
  # see names(summary(glm_n_c)) to find the object that has the coefficients.
  # can also use coef(glm_n_c)

Interpret the sign (\(+\) or \(-\)) of the slope.

  • Because our slope is positive, as (x) log2 Total cigarettes smoked increases, the probability of “success” (y) of nicotine dependence increases.

Hypothesis test

  • \(H_0: \beta_1 = 0\)
  • \(H_A: \beta_1 \ne 0\)
  • The p-value = \(1.41\times 10^{-122} < 0.05\), therefore we reject \(H_0\) in favor of \(H_A\) concluding that the slope is different from 0.
6.21.8.0.1 Nicer version of logistic linear model parameter estimate table for publication
# Logit scale uses exponentiate = FALSE
tab_fit_glm <-
  glm_n_c %>%
  gtsummary::tbl_regression(
  #  label           = NULL
    exponentiate    = FALSE              ## do not exponentiate gives logit scale
  #, include         = everything()
  #, show_single_row = NULL
  , conf.level      = 0.95
  , intercept       = TRUE
  #, estimate_fun    = NULL
  #, pvalue_fun      = NULL
  #, tidy_fun        = NULL
  #, add_estimate_to_reference_rows = FALSE
  , conf.int        = TRUE
  ) %>%
  #gtsummary::add_q()                  %>%  # add a column of q values to control for multiple comparisons
  gtsummary::bold_p(t = 0.05, q = FALSE) %>% # Bold significant p-values or q-values
  gtsummary::add_global_p()           %>%  # adds the global p-value for a categorical variables, instead of difference from intercept
  gtsummary::add_glance_source_note() %>%  # adds statistics from `broom::glance()` as source note
  #gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)])  %>%  # adds column of the variance inflation factors (VIF)
  gtsummary::bold_labels()    %>%  # bold variable name labels
  gtsummary::italicize_levels() %>% # italicize levels
  gtsummary::modify_caption("Logit scale results")

tab_fit_glm
Logit scale results
Characteristic log(OR)1 95% CI1 p-value
(Intercept) -2.9 -3.1, -2.8 <0.001
TotalCigsSmoked_smokers_log2 0.24 0.22, 0.26 <0.001
Null deviance = 1,020; Null df = 102; Log-likelihood = -302; AIC = 608; BIC = 613; Deviance = 310; Residual df = 101; No. Obs. = 103
1 OR = Odds Ratio, CI = Confidence Interval
# Odds ratio uses exponentiate = TRUE
tab_fit_glm <-
  glm_n_c %>%
  gtsummary::tbl_regression(
  #  label           = NULL
    exponentiate    = TRUE               ## exponentiate gives Odds Ratios (ORs)
  #, include         = everything()
  #, show_single_row = NULL
  , conf.level      = 0.95
  , intercept       = TRUE
  #, estimate_fun    = NULL
  #, pvalue_fun      = NULL
  #, tidy_fun        = NULL
  #, add_estimate_to_reference_rows = FALSE
  , conf.int        = TRUE
  ) %>%
  #gtsummary::add_q()                  %>%  # add a column of q values to control for multiple comparisons
  gtsummary::bold_p(t = 0.05, q = FALSE) %>% # Bold significant p-values or q-values
  gtsummary::add_global_p()           %>%  # adds the global p-value for a categorical variables, instead of difference from intercept
  gtsummary::add_glance_source_note() %>%  # adds statistics from `broom::glance()` as source note
  #gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)])  %>%  # adds column of the variance inflation factors (VIF)
  gtsummary::bold_labels()    %>%  # bold variable name labels
  gtsummary::italicize_levels() %>% # italicize levels
  gtsummary::modify_caption("Odds ratio results")

tab_fit_glm
Odds ratio results
Characteristic OR1 95% CI1 p-value
(Intercept) 0.05 0.04, 0.06 <0.001
TotalCigsSmoked_smokers_log2 1.27 1.24, 1.30 <0.001
Null deviance = 1,020; Null df = 102; Log-likelihood = -302; AIC = 608; BIC = 613; Deviance = 310; Residual df = 101; No. Obs. = 103
1 OR = Odds Ratio, CI = Confidence Interval

7 Poster

7.1 Classes 24, 25, and 26: Poster Preparation

7.1.1 Class 24, Poster Preparation, research questions, data sources, analyses

See items under Class 26.

From the list in Class 26, complete Items 2, 3, 4, and 5.

7.1.2 Class 25, Poster Preparation, literature review, references, discussion, future work

See items under Class 26.

From the list in Class 26, complete Items 1, 6, 7, and 8.

Citation help is available at this page: https://quarto.org/docs/authoring/footnotes-and-citations.html

7.1.2.1 Class Literature review (detailed example)

A slightly more detailed literature review would typically be completed for a research project. Here I show what that might look like. However, in our class I’m not expecting so much detail. See the bullet points that I have used in the “Class 26 poster preparation” section.

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Research question:

  1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
    • Google scholar search: “nicotine dependence smoking frequency”
    • Citation: pdf file available for Dierker et al. (2007)
    • Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
    • Others: Kandel and Chen (2000) and Caraballo, Novak, and Asman (2009)

You don’t need to include images in your literature review. I’m providing these figures and tables to illustrate what they look like.

  1. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
    • Google scholar search: “nicotine dependence depression”
    • Citation: pdf file available for Naomi Breslau (1995)
    • Interesting points: Table 2, Smoking and Nicotine Dependence both associated with Education. Table 3, Major depression associated with being nicotine dependent and Sex.
    • Others: N. Breslau et al. (1998)

  1. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
    • In Question 2 we see differences by Education and Sex.

I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.

All required variables have been found and added to my personal codebook (by expanding Class 03).

Citations are at the bottom of this document in the “References” section.

7.1.3 Class 26, Poster Preparation, complete content

Rubric

Organize the content of your poster.

Complete the content for each of these sections:

Title: Smoking behavior is (barely) associated with major depression in young adults

  1. (Class 25) (3 p) Introduction

    • (Lit Review) 2-4 bullets describing the study, previous research.
  1. (Class 24) (2 p) Research Questions

    • (Class 02, Personal Codebook) 2 bullets, one for each research question, stated as the alternative hypothesis.

The goals of the analysis include answering these two questions:

  • Is there an association between major depression and nicotine dependence?

  • Does the prevalence of nicotine dependence differ by ethnicity?

  1. (Class 24) (2 p) Methods
  • (Class 02, Personal Codebook) Data source(s).

A sample from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) (US adults). We focus on young adult (18–25) smokers (43093 of 43093 respondants).

In the Introduction to the National Epidemiologic Survey on Alcohol and Related Conditions they describe the sampling design. Adult respondants were sampled from a subset of households included in the US Census 2000 Supplementary Survey. To improve information about smaller groups, Blacks and Hispanics were oversampled by about 50%, as were young adults ages 18-24 by 125%. Using the US Census as a sampling frame helps to provide a representative sampling of the country. Oversampling smaller subsets of the population means that comparisons involving those subsets will be more reliable.

  • (Class 02, Personal Codebook) Variables used.

    • Lifetime major depression (Depression, 0=No, 1=Yes) (Bridget F. Grant et al. 2003; B. F. Grant et al. 1995).

    • Total cigarettes per month (TotalCigsSmoked = days smoked times daily cigs smoked, range: 0–NA), a combination of these questions: “About how often did you usually smoke in the past year?” and “On the days that you smoked in the last year, about how many cigarettes did you usually smoke?”

    • Nicotine dependence (NicotineDependence, 0=No, 1=Yes) in the last 12 months.

  • (Various) Statistical methods used to answer the research questions.

    • A two-sample \(t\)-test comparing the mean total cigarettes smoked by depression status.

    • A \(\chi^2\) analysis of a two-way contingency table of nicotine dependence by ethnicity.

  1. (Class 24) (3 p) Results for your first research question.

    • Plot and describe the data, as well as the statistical model. This can often be done in a single plot. Examples:

      • ANOVA: A mean with CI bars is the statistical model overlayed on the data points.

      • Contingency table: A mosaic plot with colored boxes relative to contribution to Pearson \(\chi^2\) shows the data with evidence towards the alternative hypothesis.

      • Simple linear regression: A regression line is the statistical model overlayed on the data points.

      • Logistic regression: The logistic curve is the statistical model overlayed on the top/bottom histograms of the data.

    • State the conclusion of the hypothesis test and interpret it in the context of the research question.

  • “Is the population mean square-root total cigarettes smoked different for those with depression or not?”
  • \(H_0: \mu_{D} = \mu_{ND}\) versus \(H_A: \mu_{D} \ne \mu_{ND}\)
Warning: Removed 297 rows containing non-finite values (stat_bin).

Welch Two Sample t-test

data: TotalCigsSmoked_sqrt by Depression t = -19.317, df = 10794, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is not equal to 0 95 percent confidence interval: -3.176909 -2.591550 sample estimates: mean in group No Depression mean in group Yes Depression 7.360964 10.245194

Welch Two Sample t-test

data: TotalCigsSmoked by Depression t = -17.226, df = 10439, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is not equal to 0 95 percent confidence interval: -88.58909 -70.48702 sample estimates: mean in group No Depression mean in group Yes Depression 173.0296 252.5677

  • Model assumptions are met, the the sampling distribution of the difference in means is normal.

  • Because \(p=9.24\times 10^{-82} > 0.05\) (with \(t_{s} = -19.32\)), we have insufficient evidence to reject \(H_0\) at an \(\alpha=0.05\) significance level, concluding that the total cigarettes smoked does not differ by depression status.

  • The difference is very small, on the square-root scale it is 2.88 and on the natural scale it is 54.1837915, 104.9639984, a difference of 51 cigarettes each year.

  1. (Class 24) (3 p) Results for your second research question.

    • Plot and describe the data, as well as the statistical model. This can often be done in a single plot.

    • State the conclusion of the hypothesis test and interpret it in the context of the research question.

  • Is there an association between NicotineDependence and Ethnicity?

The model assumptions are met since the expected count for each cell is at least 5.

Because the p-value \(= 8.84\times 10^{-94} < 0.05\) (\(X^2 = 439\)) we reject the null hypothesis concluding that there is an association between NicotineDependence and Ethnicity.

The primary cause of rejecting \(H_0\) is that Hispanics have less nicotine dependence than expected by chance.

  1. (Class 25) (4 p) Discussion

    • Put the results you found for each research question in the context you provided in your introduction.

These results support the previous literature that individuals with major depression are more sensitive to the development of nicotine dependence regardless of how much they smoke. Thus, people with depression are an important population subgroup for targeted smoking intervention programs.

  1. (Class 25) (1 p) Further directions or Future work or Next steps or something else that indicates there more to do and you’ve thought about it.

    • What do these results lead you to want to investigate?

We can extend focus from young adults to all adults, and (with multiple regression techniques we’ll learn in ADA2) assess the particular depressed subpopulations most susceptible to developing nicotine dependence.

  1. (Class 25) (2 p) References

    • By citing sources in your introduction, this section will automatically have your bibliography.

References

References are supposed to appear here, but they may appear at the end of the document.

7.2 Class 27, Poster Preparation, into poster template

7.3 Class 28, Poster Preparation, reviewed by instructor

7.4 Class 29, Poster Presentations

  • Graduate students.

7.5 Class 30, Poster Presentations

  • Undergraduate students.

[End]

Note

Most of you will not need to work with dates.

Finer points

References

Breslau, Naomi. 1995. “Psychiatric Comorbidity of Smoking and Nicotine Dependence.” Behavior Genetics 25 (2): 95–101.
Breslau, N., E. L. Peterson, L. R. Schultz, H. D. Chilcoat, and P. Andreski. 1998. Major Depression and Stages of Smoking. A Longitudinal Investigation.” Archives of General Psychiatry 55 (2): 161–66.
Caraballo, Ralph S., Scott P. Novak, and Katherine Asman. 2009. “Linking Quantity and Frequency Profiles of Cigarette Smoking to the Presence of Nicotine Dependence Symptoms Among Adolescent Smokers: Findings from the 2004 National Youth Tobacco Survey.” Nicotine & Tobacco Research, January, ntn008. https://doi.org/10.1093/ntr/ntn008.
Dierker, Lisa C., Shelli Avenevoli, Abbie Goldberg, and Meyer Glantz. 2004. Defining Subgroups of Adolescents at Risk for Experimental and Regular Smoking.” Prevention Science: The Official Journal of the Society for Prevention Research 5 (3): 169–83.
Dierker, Lisa C., Shelli Avenevoli, Kathleen R. Merikangas, Brian P. Flaherty, and Marilyn Stolar. 2001. “Association Between Psychiatric Disorders and the Progression of Tobacco Use Behaviors.” Journal of the American Academy of Child & Adolescent Psychiatry 40 (10): 1159–67. https://doi.org/10.1097/00004583-200110000-00009.
Dierker, Lisa C., Eric Donny, Stephen Tiffany, Suzanne M. Colby, Nicholas Perrine, and Richard R. Clayton. 2007. “The Association Between Cigarette Smoking and DSM-IV Nicotine Dependence Among First Year College Students.” Drug and Alcohol Dependence 86 (2–3): 106–14. https://doi.org/10.1016/j.drugalcdep.2006.05.025.
Grant, B. F., T. C. Harford, D. A. Dawson, P. S. Chou, and R. P. Pickering. 1995. The Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS): Reliability of Alcohol and Drug Modules in a General Population Sample.” Drug and Alcohol Dependence 39 (1): 37–44.
Grant, Bridget F., Deborah A. Dawson, Frederick S. Stinson, Patricia S. Chou, Ward Kay, and Roger Pickering. 2003. The Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV (AUDADIS-IV): Reliability of Alcohol Consumption, Tobacco Use, Family History of Depression and Psychiatric Diagnostic Modules in a General Population Sample.” Drug and Alcohol Dependence 71 (1): 7–16.
Kandel, D. B., and K. Chen. 2000. Extent of Smoking and Nicotine Dependence in the United States: 1991-1993.” Nicotine & Tobacco Research: Official Journal of the Society for Research on Nicotine and Tobacco 2 (3): 263–74.
Rohde, Paul, Christopher W. Kahler, Peter M. Lewinsohn, and Richard A. Brown. 2004. “Psychiatric Disorders, Familial Factors, and Cigarette Smoking: II. Associations with Progression to Daily Smoking.” Nicotine & Tobacco Research 6 (1): 119–32. https://doi.org/10.1080/14622200310001656948.
Rohde, Paul, Peter M. Lewinsohn, Richard A. Brown, Jeffrey M. Gau, and Christopher W. Kahler. 2003. “Psychiatric Disorders, Familial Factors and Cigarette Smoking: I. Associations with Smoking Initiation.” Nicotine & Tobacco Research 5 (1): 85–98. https://doi.org/10.1080/1462220031000070507.
Stanton, Warren R., John B. Lowe, and Phil A. Silva. 1995. “Antecedents of Vulnerability and Resilience to Smoking Among Adolescents.” Journal of Adolescent Health 16 (1): 71–77. https://doi.org/10.1016/1054-139X(94)00051-F.

Citation

BibTeX citation:
@online{erhardt2022,
  author = {Erik Erhardt},
  editor = {},
  title = {ADA1: {Erik’s} Cumulative Project File},
  date = {2022-08-24},
  url = {https://statacumen.com/teaching/ada1/ada1-f22/},
  langid = {en}
}
For attribution, please cite this work as:
Erik Erhardt. 2022. “ADA1: Erik’s Cumulative Project File.” August 24, 2022. https://statacumen.com/teaching/ada1/ada1-f22/.