# ADA1: Erik’s cumulative project file

Model assignment: NESARC, Smoking and depression in adults

Advanced Data Analysis 1, Stat 427/527, Fall 2022, Prof. Erik Erhardt, UNM

Author

Erik Erhardt

Published

August 24, 2022

# 1 Document overview

Important

Please don’t let the initial size and detail of this document intimidate you. You got this!

This document is organized by Week and Class number. The worksheet assignments are indicated by the Class numbers.

• organize the document clearly (use this document as an example)
• label minor sections under each day (use this document as an example)
• For each thing you do, always have these three parts:
1. Say what you’re going to do and why.
2. Do it with code, and document your code.
3. Interpret the results.

## 1.1 Document

### 1.1.1 Naming

Note

Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

• ADA1_ALL_05.qmd
• ADA1_ALL_06.qmd
• ADA1_ALL_07.qmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

• ADA1_ALL_20220903.qmd
• ADA1_ALL_20220905.qmd
• ADA1_ALL_20220910.qmd

### 1.1.2 Structure

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

### 1.1.3 Classes not appearing in this document

Some assignments are in a separate worksheet and are indicated with “(separate worksheet)”. For these, I’ll provide a dataset I want you to analyze. Typically, you’ll then return to this document and repeat the same type of analysis with your dataset.

# 2 Research Questions

## 2.1 Class 02, Personal Codebook

Rubric

1. (1 p) Is there a topic of interest?

2. (2 p) Are the variables relevant to a set of research questions?

3. (4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

• 1 categorical variable with only 2 levels
• 1 categorical variable with at least 3 levels
• 2 numerical variables with many possible unique values
• More variables are welcome and you’re likely to add to this later in the semester
4. (3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

5. Compile this qmd file to an html, print/save to pdf, and upload to UNM Canvas.

### 2.1.1 Topic and research questions

Topic:

Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

Research questions:

1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?

2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?

3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

# 3 Codebook

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description

ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Categorical, Nominal
43093 1-43093. Unique Identification number

Sex
SEX
SEX
Categorical, Nominal
18518 1. Male
24575 2. Female

Age
AGE
AGE
Numerical, Continuous
43079 18-97. Age in years
14 98. 98 years or older

SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Categorical, Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

NicotineDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Categorical, Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence

SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Categorical, Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Numerical, Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Categorical, Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino

Depression
MAJORDEPLIFE
Categorical, Nominal
35254 0. No
7839 1. Yes

IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Numerical, Continuous
43093 0-3000000. Income in dollars

IncomePersonalCat
S1Q10B
TOTAL PERSONAL INCOME IN LAST 12 MONTHS: CATEGORY
Categorical, Ordinal
2462  0. $0 (No personal income) 3571 1.$      1 to $4,999 3823 2.$  5,000 to $7,999 2002 3.$  8,000 to $9,999 3669 4.$ 10,000 to $12,999 1634 5.$ 13,000 to $14,999 3940 6.$ 15,000 to $19,999 3887 7.$ 20,000 to $24,999 3085 8.$ 25,000 to $29,999 3003 9.$ 30,000 to $34,999 2351 10.$ 35,000 to $39,999 3291 11.$ 40,000 to $49,999 2059 12.$ 50,000 to $59,999 1328 13.$ 60,000 to $69,999 857 14.$ 70,000 to $79,999 521 15.$ 80,000 to $89,999 290 16.$ 90,000 to $99,999 1320 17.$100,000 or more

Height_ft
S1Q24FT
HEIGHT: FEET
Numerical, Continuous
42363 4-7. Feet
730 99. Unknown
* change 99 to NA

Height_in
S1Q24IN
HEIGHT: INCHES
Numerical, Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA

Weight_lb
S1Q24LB
WEIGHT: POUNDS
Numerical, Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA

Additional variables were created from the original variables:

CREATED VARIABLES

DaysSmoke
estimated number of days per month a subject smokes
Numerical, Continuous (due to way estimated), assumes 30 days per month using SmokingFreq
1-30.

TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Numerical, Continuous
0-large.

TotalCigsSmoked_smoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Smokers, only! (replaced 0s with NAs)
Continuous
1-large.

CigsSmokedFac
quartiles of TotalCigsSmoked
Categorical, Ordinal
ranges for each of the four quarters

SmokingFreq3
three levels of smoking frequency
Categorical, Ordinal
from SmokingFreq
1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never   = 7. Never

Height_inches
Total height in inches
Numerical, Continuous
Height_ft * 12 + Height_in

# 4 Data Management

## 4.1 Class 03, Data subset and numerical summaries

Rubric

1. (4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

• Scroll down to sections labeled “(Class 03)”.
2. (1 p) Output confirms the subset is correct (e.g., using dim() and str()).

3. (3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

• Scroll down to sections labeled “(Class 03)”.
4. (2 p) Provide numerical summaries for all variables (e.g., using summary()).

• Scroll down to sections labeled “(Class 03)”.

### 4.1.1 Data subset (Class 03)

First, read the data and see the size (dim()).

# data analysis packages
library(erikmisc)   # Helpful functions
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.16 ──
✔ tibble 3.1.8     ✔ dplyr  1.0.9
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)  # Data manipulation and visualization suite
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::lag()    masks stats::lag()
library(lubridate)  # Dates

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

date, intersect, setdiff, union
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots

## 2. Use the load() statement for the dataset you want to use.

dim(NESARC)
[1] 43093  3008

Select the variables to include in our subset.

# variables to include in our data subset
nesarc_sub <-
NESARC %>%
select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q10B
, S1Q24FT
, S1Q24IN
, S1Q24LB
)

# Check size and structure of data subset.
# Note that categorical variables are already Factor variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
[1] 43093    14
# structure of subset
str(nesarc_sub)
'data.frame':   43093 obs. of  14 variables:
$IDNUM : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...$ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$AGE : num 23 28 81 18 36 34 19 84 29 18 ...$ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$TAB12MDX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$S3AQ3C1 : num NA NA NA NA NA NA NA NA NA NA ...$ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...$ S1Q10A      : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$S1Q10B : Factor w/ 18 levels "0","1","2","3",..: 7 5 3 9 12 2 2 8 14 2 ...$ S1Q24FT     : num  5 5 5 5 6 5 5 5 5 5 ...
$S1Q24IN : num 7 1 4 7 1 5 9 4 4 10 ...$ S1Q24LB     : num  195 127 185 180 220 140 160 120 140 150 ...

### 4.1.2 Renaming Variables (Class 03)

nesarc_sub <-
nesarc_sub %>%
rename(
ID                  = IDNUM
, Sex                 = SEX
, Age                 = AGE
, SmokingStatus       = CHECK321
, NicotineDependence  = TAB12MDX
, SmokingFreq         = S3AQ3B1
, DailyCigsSmoked     = S3AQ3C1
, Ethnicity           = ETHRACE2A
, Depression          = MAJORDEPLIFE
, IncomePersonal      = S1Q10A
, IncomePersonalCat   = S1Q10B
, Height_ft           = S1Q24FT
, Height_in           = S1Q24IN
, Weight_lb           = S1Q24LB
)

str(nesarc_sub)
'data.frame':   43093 obs. of  14 variables:
$ID : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...$ Sex               : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$Age : num 23 28 81 18 36 34 19 84 29 18 ...$ SmokingStatus     : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ SmokingFreq       : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$DailyCigsSmoked : num NA NA NA NA NA NA NA NA NA NA ...$ Ethnicity         : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$Depression : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...$ IncomePersonal    : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$IncomePersonalCat : Factor w/ 18 levels "0","1","2","3",..: 7 5 3 9 12 2 2 8 14 2 ...$ Height_ft         : num  5 5 5 5 6 5 5 5 5 5 ...
$Height_in : num 7 1 4 7 1 5 9 4 4 10 ...$ Weight_lb         : num  195 127 185 180 220 140 160 120 140 150 ...

### 4.1.3 Coding missing values (Class 04)

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

#### 4.1.3.1 Coding NAs as meaningful “missing”

Note

Most of you will not need to code this type of missing.

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

##### 4.1.3.1.1NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
table(nesarc_sub$DailyCigsSmoked)  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40 17 18 19 20 21 22 23 24 25 27 28 29 30 33 34 35 22 59 5 5366 1 10 2 7 155 2 3 3 909 1 1 30 37 39 40 45 50 55 57 60 66 70 75 80 98 99 2 1 993 8 106 2 1 241 1 12 2 47 15 262  nesarc_sub <- nesarc_sub %>% replace_na( # replace NA with new numeric value list( DailyCigsSmoked = 0 ) ) table(nesarc_sub$DailyCigsSmoked)

0     1     2     3     4     5     6     7     8     9    10    11    12
25080   934   884   923   573  1070   463   269   299    49  3077    23   230
13    14    15    16    17    18    19    20    21    22    23    24    25
34    25   851    40    22    59     5  5366     1    10     2     7   155
27    28    29    30    33    34    35    37    39    40    45    50    55
2     3     3   909     1     1    30     2     1   993     8   106     2
57    60    66    70    75    80    98    99
1   241     1    12     2    47    15   262 
##### 4.1.3.1.2NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never

S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
table(nesarc_sub$SmokingStatus )  1 2 9 9913 8078 22  table(nesarc_sub$SmokingFreq    )

1     2     3     4     5     6     9
14836   460   687   747   409   772   102 
nesarc_sub <-
nesarc_sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus   = as.character(SmokingStatus)
, SmokingFreq     = as.character(SmokingFreq  )
) %>%
replace_na(
# replace NA with new category values
list(
SmokingStatus = "3"
, SmokingFreq   = "7"
)
)

table(nesarc_sub$SmokingStatus )  1 2 3 9 9913 8078 25080 22  table(nesarc_sub$SmokingFreq    )

1     2     3     4     5     6     7     9
14836   460   687   747   409   772 25080   102 

#### 4.1.3.2 Coding 9s and 99s as NAs

Note

Everyone will need to do this.

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc_sub$SmokingStatus )  1 2 3 9 9913 8078 25080 22  table(nesarc_sub$SmokingFreq    )

1     2     3     4     5     6     7     9
14836   460   687   747   409   772 25080   102 
table(nesarc_sub$DailyCigsSmoked)  0 1 2 3 4 5 6 7 8 9 10 11 12 25080 934 884 923 573 1070 463 269 299 49 3077 23 230 13 14 15 16 17 18 19 20 21 22 23 24 25 34 25 851 40 22 59 5 5366 1 10 2 7 155 27 28 29 30 33 34 35 37 39 40 45 50 55 2 3 3 909 1 1 30 2 1 993 8 106 2 57 60 66 70 75 80 98 99 1 241 1 12 2 47 15 262  nesarc_sub <- nesarc_sub %>% mutate( # replace missing values, factor variables use quotes, numeric no quotes SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA) , SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA) , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA) , Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA) , Height_in = replace(Height_in , Height_in %in% c( 99), NA) , Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA) # drop unused factor levels (not for numeric) , SmokingStatus = fct_drop(SmokingStatus) , SmokingFreq = fct_drop(SmokingFreq ) ) table(nesarc_sub$SmokingStatus  )

1     2     3
9913  8078 25080 
table(nesarc_sub$SmokingFreq )  1 2 3 4 5 6 7 14836 460 687 747 409 772 25080  table(nesarc_sub$DailyCigsSmoked)

0     1     2     3     4     5     6     7     8     9    10    11    12
25080   934   884   923   573  1070   463   269   299    49  3077    23   230
13    14    15    16    17    18    19    20    21    22    23    24    25
34    25   851    40    22    59     5  5366     1    10     2     7   155
27    28    29    30    33    34    35    37    39    40    45    50    55
2     3     3   909     1     1    30     2     1   993     8   106     2
57    60    66    70    75    80    98
1   241     1    12     2    47    15 
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
IncomePersonalCat   Height_ft      Height_in        Weight_lb
6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0
7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0
2      : 3823     Median :5.00   Median : 5.000   Median :165.0
4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6
1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0
11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0
(Other):20912     NA's   :730    NA's   :761      NA's   :1376   

#### 4.1.3.3 Removing rows with missing values (don’t do this)

Note

Do not do this.

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc_sub)
[1] 43093    14
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
[1] 43093    14
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
IncomePersonalCat   Height_ft      Height_in        Weight_lb
6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0
7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0
2      : 3823     Median :5.00   Median : 5.000   Median :165.0
4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6
1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0
11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0
(Other):20912     NA's   :730    NA's   :761      NA's   :1376   

Now we (would) have a dataset with no missing values.

### 4.1.4 Creating new variables (Class 04+)

#### 4.1.4.1 From categories to numeric

Note

Most of you will not need to do this.

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30      # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1       # 6. Once a month or less
, SmokingFreq == 7 ~ 0       # 7. Never
)
)

summary(nesarc_sub$DaysSmoke)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 0.00 0.00 10.96 30.00 30.00 102  This shows an example with many levels. Note, that there is a numeric variable in this dataset, but this categorical sitation occurs frequent enough to show you an example. table(nesarc_sub$IncomePersonalCat)

0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
2462 3571 3823 2002 3669 1634 3940 3887 3085 3003 2351 3291 2059 1328  857  521
16   17
290 1320 
# create a numeric version of this categorical variable
nesarc_sub <-
nesarc_sub %>%
mutate(
IncomePersonalCat_num =
case_when(
IncomePersonalCat ==  0 ~      0     #  $0 (No personal income) , IncomePersonalCat == 1 ~ 2500 #$      1 to $4,999 , IncomePersonalCat == 2 ~ 6500 #$  5,000 to $7,999 , IncomePersonalCat == 3 ~ 9000 #$  8,000 to $9,999 , IncomePersonalCat == 4 ~ 11500 #$ 10,000 to $12,999 , IncomePersonalCat == 5 ~ 14000 #$ 13,000 to $14,999 , IncomePersonalCat == 6 ~ 17500 #$ 15,000 to $19,999 , IncomePersonalCat == 7 ~ 22500 #$ 20,000 to $24,999 , IncomePersonalCat == 8 ~ 27500 #$ 25,000 to $29,999 , IncomePersonalCat == 9 ~ 32500 #$ 30,000 to $34,999 , IncomePersonalCat == 10 ~ 37500 #$ 35,000 to $39,999 , IncomePersonalCat == 11 ~ 45000 #$ 40,000 to $49,999 , IncomePersonalCat == 12 ~ 55000 #$ 50,000 to $59,999 , IncomePersonalCat == 13 ~ 65000 #$ 60,000 to $69,999 , IncomePersonalCat == 14 ~ 75000 #$ 70,000 to $79,999 , IncomePersonalCat == 15 ~ 85000 #$ 80,000 to $89,999 , IncomePersonalCat == 16 ~ 95000 #$ 90,000 to $99,999 , IncomePersonalCat == 17 ~ 120000 #$100,000 or more
)
)

table(nesarc_sub$IncomePersonalCat_num)  0 2500 6500 9000 11500 14000 17500 22500 27500 32500 37500 2462 3571 3823 2002 3669 1634 3940 3887 3085 3003 2351 45000 55000 65000 75000 85000 95000 120000 3291 2059 1328 857 521 290 1320  #### 4.1.4.2 From numeric to numeric Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable. nesarc_sub <- nesarc_sub %>% mutate( Height_inches = Height_ft * 12 + Height_in ) summary(nesarc_sub$Height_inches)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
48.0    64.0    66.0    66.6    70.0    84.0     761 
Note

Most of you will need to transform a variable to address extreme right skewness.

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

Log scale. I like log2 since its interpretation is that every unit is a doubling. Notice that I added 1 to the TotalCigsSmoked before computing the log. This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.

Square root scale. Square root is another transformation that spreads out values between 0 and 1 and compresses values from 1 to infinity. It can be preferred to a log transformation in some cases.

nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke_log2       = log2(DaysSmoke)
, TotalCigsSmoked      = DaysSmoke * DailyCigsSmoked
, TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
, TotalCigsSmoked_sqrt = sqrt(TotalCigsSmoked)
, TotalCigsSmoked_smokers      = ifelse(TotalCigsSmoked > 0, TotalCigsSmoked, NA)
, TotalCigsSmoked_smokers_log2 = TotalCigsSmoked_log2
)

summary(nesarc_sub$TotalCigsSmoked)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0 0.0 0.0 187.5 300.0 2940.0 297  summary(nesarc_sub$TotalCigsSmoked_smokers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
1.0   150.0   360.0   453.1   600.0  2940.0   25377 

#### 4.1.4.3 From numeric to categories based on quantiles

The numeric variable TotalCigsSmoked is converted into a factor with five levels.

nesarc_sub <-
nesarc_sub %>%
mutate(
CigsSmokedFac  =
case_when(
TotalCigsSmoked == 0                              ~ "No smoking (0)"
, (TotalCigsSmoked >  0) & (TotalCigsSmoked <=   6) ~ "Lowest smoking (1-6)"
, (TotalCigsSmoked >  6) & (TotalCigsSmoked <=  30) ~ "Low smoking (7-30)"
, (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
, (TotalCigsSmoked > 300)                           ~ "High smoking (301+)"
)
, CigsSmokedFac =
factor(
CigsSmokedFac
, levels = c(
"No smoking (0)"
, "Lowest smoking (1-6)"
, "Low smoking (7-30)"
, "Medium smoking (31-300)"
, "High smoking (301+)"
)
)
)

summary(nesarc_sub$TotalCigsSmoked)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0 0.0 0.0 187.5 300.0 2940.0 297  summary(nesarc_sub$CigsSmokedFac)
         No smoking (0)    Lowest smoking (1-6)      Low smoking (7-30)
25080                    1045                    1129
Medium smoking (31-300)     High smoking (301+)                    NA's
6456                    9086                     297 
table(nesarc_sub$CigsSmokedFac)  No smoking (0) Lowest smoking (1-6) Low smoking (7-30) 25080 1045 1129 Medium smoking (31-300) High smoking (301+) 6456 9086  Finer points Cutting a numeric variable into equal-sized categories. The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function. The label "Q1" indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4" indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations. quantile(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)
  0%  25%  50%  75% 100%
0    0    0  300 2940 
nesarc_sub <-
nesarc_sub %>%
mutate(
CigsSmokedFac2 =
.bincode(TotalCigsSmoked
, breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
, include.lowest = TRUE
)
, CigsSmokedFac2 = factor( CigsSmokedFac
, levels = c(1, 2, 3, 4)
, labels = c("Q1", "Q2", "Q3", "Q4")
)
)

In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.

summary(nesarc_sub$CigsSmokedFac2)  Q1 Q2 Q3 Q4 NA's 0 0 0 0 43093  #### 4.1.4.4 From many categories to a few The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier. ## SmokingFreq3 variable # notice that I'm doing this before I've assigned the labels # if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below # inspect original variable table(nesarc_sub$SmokingFreq)

1     2     3     4     5     6     7
14836   460   687   747   409   772 25080 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc_sub <-
nesarc_sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3")  # 1. daily
, "2" = c("4", "5")       # 2. weekly
, "3" = "6"               # 3. monthly
, "4" = "7"               # 7. never
)
)

# new variable
table(nesarc_sub$SmokingFreq3)  1 2 3 4 15983 1156 772 25080  # we will label the new factor variable below, with the others #### 4.1.4.5 Working with Dates Note Most of you will not need to work with dates. How to calculate the duration between two dates. (You probably won’t need to use this for your project.) Dates can be tricky in R. The lubridate package makes them much easier. Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it. I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth day, month, and year. dat_sub <- NESARC %>% select( AGE , CDAY , CMON , CYEAR , DOBD , DOBM , DOBY ) str(dat_sub) 'data.frame': 43093 obs. of 7 variables:$ AGE  : num  23 28 81 18 36 34 19 84 29 18 ...
$CDAY : num 14 12 23 9 18 7 1 18 9 16 ...$ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
$CYEAR: num 2001 2002 2001 2001 2001 ...$ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
$DOBM : num 9 12 12 6 9 12 3 11 4 3 ...$ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.

dat_sub <-
dat_sub %>%
mutate(
## interview date
# combine day, month, and year into one text string
cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR) # create date object by interpretting the day, month, year with dmy() , cdate_date = dmy(cdate_text) ## date of birth # combine day, month, and year into one text string , dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
#    such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
, dob_date = dmy(dob_text)
)
Warning: 1094 failed to parse.
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966
dob_date
1 1977-09-27
2 1973-12-15
3 1919-12-28
4 1983-06-27
5 1965-09-11
6 1966-12-28

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

dat_sub <-
dat_sub %>%
mutate(
age_days  = cdate_date - dob_date
, age_years = as.numeric(age_days / 365.25)  # change from "difftime" to "numeric"
# difference between the age we calculated and the age in the original dataset
, age_diff  = age_years - AGE
)
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966
dob_date   age_days age_years   age_diff
1 1977-09-27  8722 days  23.87953 0.87953457
2 1973-12-15 10255 days  28.07666 0.07665982
3 1919-12-28 29916 days  81.90554 0.90554415
4 1983-06-27  6649 days  18.20397 0.20396988
5 1965-09-11 13186 days  36.10130 0.10130048
6 1966-12-28 12672 days  34.69405 0.69404517

Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!

# drop the original ones you don't need anymore
#   by select(-VAR), it excludes the VAR column
dat_sub <-
dat_sub %>%
select(
-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
# -c(cdate_text, cdate_date, dob_text, dob_date)
, -c(age_days, age_diff)
)
str(dat_sub)
'data.frame':   43093 obs. of  6 variables:
$AGE : num 23 28 81 18 36 34 19 84 29 18 ...$ cdate_text: chr  "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" ...
$cdate_date: Date, format: "2001-08-14" "2002-01-12" ...$ dob_text  : chr  "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" ...
$dob_date : Date, format: "1977-09-27" "1973-12-15" ...$ age_years : num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

(For this example, I’m going to delete the dat_sub object since it was just for this demonstration.

rm(dat_sub)

#### 4.1.4.6 Review results of new variables

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
IncomePersonalCat   Height_ft      Height_in        Weight_lb
6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0
7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0
2      : 3823     Median :5.00   Median : 5.000   Median :165.0
4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6
1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0
11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0
(Other):20912     NA's   :730    NA's   :761      NA's   :1376
DaysSmoke     IncomePersonalCat_num Height_inches  DaysSmoke_log2
Min.   : 0.00   Min.   :     0        Min.   :48.0   Min.   : -Inf
1st Qu.: 0.00   1st Qu.:  9000        1st Qu.:64.0   1st Qu.: -Inf
Median : 0.00   Median : 22500        Median :66.0   Median : -Inf
Mean   :10.96   Mean   : 27523        Mean   :66.6   Mean   : -Inf
3rd Qu.:30.00   3rd Qu.: 37500        3rd Qu.:70.0   3rd Qu.:4.907
Max.   :30.00   Max.   :120000        Max.   :84.0   Max.   :4.907
NA's   :102                           NA's   :761    NA's   :102
TotalCigsSmoked  TotalCigsSmoked_log2 TotalCigsSmoked_sqrt
Min.   :   0.0   Min.   : 0.000       Min.   : 0.000
1st Qu.:   0.0   1st Qu.: 7.229       1st Qu.: 0.000
Median :   0.0   Median : 8.492       Median : 0.000
Mean   : 187.5   Mean   : 7.888       Mean   : 7.887
3rd Qu.: 300.0   3rd Qu.: 9.229       3rd Qu.:17.321
Max.   :2940.0   Max.   :11.522       Max.   :54.222
NA's   :297      NA's   :25377        NA's   :297
TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
Min.   :   1.0          Min.   : 0.000
1st Qu.: 150.0          1st Qu.: 7.229
Median : 360.0          Median : 8.492
Mean   : 453.1          Mean   : 7.888
3rd Qu.: 600.0          3rd Qu.: 9.229
Max.   :2940.0          Max.   :11.522
NA's   :25377           NA's   :25377
CigsSmokedFac   CigsSmokedFac2 SmokingFreq3
No smoking (0)         :25080   Q1  :    0     1   :15983
Lowest smoking (1-6)   : 1045   Q2  :    0     2   : 1156
Low smoking (7-30)     : 1129   Q3  :    0     3   :  772
Medium smoking (31-300): 6456   Q4  :    0     4   :25080
High smoking (301+)    : 9086   NA's:43093     NA's:  102
NA's                   :  297


### 4.1.5 Labeling Categorical variable levels (Class 04)

Look at summary of dataset to see the state of the labeling before we make changes.

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
IncomePersonalCat   Height_ft      Height_in        Weight_lb
6      : 3940     Min.   :4.00   Min.   : 0.000   Min.   : 62.0
7      : 3887     1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0
2      : 3823     Median :5.00   Median : 5.000   Median :165.0
4      : 3669     Mean   :5.11   Mean   : 5.279   Mean   :170.6
1      : 3571     3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0
11     : 3291     Max.   :7.00   Max.   :11.000   Max.   :500.0
(Other):20912     NA's   :730    NA's   :761      NA's   :1376
DaysSmoke     IncomePersonalCat_num Height_inches  DaysSmoke_log2
Min.   : 0.00   Min.   :     0        Min.   :48.0   Min.   : -Inf
1st Qu.: 0.00   1st Qu.:  9000        1st Qu.:64.0   1st Qu.: -Inf
Median : 0.00   Median : 22500        Median :66.0   Median : -Inf
Mean   :10.96   Mean   : 27523        Mean   :66.6   Mean   : -Inf
3rd Qu.:30.00   3rd Qu.: 37500        3rd Qu.:70.0   3rd Qu.:4.907
Max.   :30.00   Max.   :120000        Max.   :84.0   Max.   :4.907
NA's   :102                           NA's   :761    NA's   :102
TotalCigsSmoked  TotalCigsSmoked_log2 TotalCigsSmoked_sqrt
Min.   :   0.0   Min.   : 0.000       Min.   : 0.000
1st Qu.:   0.0   1st Qu.: 7.229       1st Qu.: 0.000
Median :   0.0   Median : 8.492       Median : 0.000
Mean   : 187.5   Mean   : 7.888       Mean   : 7.887
3rd Qu.: 300.0   3rd Qu.: 9.229       3rd Qu.:17.321
Max.   :2940.0   Max.   :11.522       Max.   :54.222
NA's   :297      NA's   :25377        NA's   :297
TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
Min.   :   1.0          Min.   : 0.000
1st Qu.: 150.0          1st Qu.: 7.229
Median : 360.0          Median : 8.492
Mean   : 453.1          Mean   : 7.888
3rd Qu.: 600.0          3rd Qu.: 9.229
Max.   :2940.0          Max.   :11.522
NA's   :25377           NA's   :25377
CigsSmokedFac   CigsSmokedFac2 SmokingFreq3
No smoking (0)         :25080   Q1  :    0     1   :15983
Lowest smoking (1-6)   : 1045   Q2  :    0     2   : 1156
Low smoking (7-30)     : 1129   Q3  :    0     3   :  772
Medium smoking (31-300): 6456   Q4  :    0     4   :25080
High smoking (301+)    : 9086   NA's:43093     NA's:  102
NA's                   :  297


Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, NicotineDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <- factor( nesarc_sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)

nesarc_sub$SmokingStatus <- factor( nesarc_sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)

# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <- factor(nesarc_sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc_sub$Sex)  Female Male 24575 18518  nesarc_sub$NicotineDependence <-
factor(nesarc_sub$NicotineDependence , levels = c( 1 , 0 ) , labels = c( "Nicotine Dependence" , "No Nicotine Dependence" ) ) # Remember to include the new category level from the original NA values. nesarc_sub$SmokingFreq <-
factor(nesarc_sub$SmokingFreq , levels = c( 7 , 6 , 5 , 4 , 3 , 2 , 1 ) , labels = c( "Never" , "Once a month or less" , "2 to 3 Days/month" , "1 to 2 Days/week" , "3 to 4 Days/week" , "5 to 6 Days/week" , "Every Day" ) ) nesarc_sub$Ethnicity <-
factor(nesarc_sub$Ethnicity # shorter labels are helpful for plots , labels = c( "Cauc" , "AfAm" , "NaAm" , "Asia" , "Hisp" ) # , labels = c("Caucasian" # , "African American" # , "Native American" # , "Asian" # , "Hispanic" # ) ) nesarc_sub$SmokingFreq3 <-
factor(nesarc_sub$SmokingFreq3 , levels = c( 4 , 3 , 2 , 1 ) , labels = c( "Never" , "Monthly" , "Weekly" , "Daily" ) ) summary(nesarc_sub)  ID Sex Age 1 : 1 Female:24575 Min. :18.0 2 : 1 Male :18518 1st Qu.:32.0 3 : 1 Median :44.0 4 : 1 Mean :46.4 5 : 1 3rd Qu.:59.0 6 : 1 Max. :98.0 (Other):43087 SmokingStatus NicotineDependence Smoked past 12 months : 9913 Nicotine Dependence : 4962 Smoked prior to 12 months: 8078 No Nicotine Dependence:38131 Never smoked :25080 NA's : 22 SmokingFreq DailyCigsSmoked Ethnicity Never :25080 Min. : 0.000 Cauc:24507 Every Day :14836 1st Qu.: 0.000 AfAm: 8245 Once a month or less: 772 Median : 0.000 NaAm: 701 1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332 3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308 (Other) : 869 Max. :98.000 NA's : 102 NA's :262 Depression IncomePersonal IncomePersonalCat Height_ft No Depression :35254 Min. : 0 6 : 3940 Min. :4.00 Yes Depression: 7839 1st Qu.: 8800 7 : 3887 1st Qu.:5.00 Median : 20000 2 : 3823 Median :5.00 Mean : 28185 4 : 3669 Mean :5.11 3rd Qu.: 36000 1 : 3571 3rd Qu.:5.00 Max. :3000000 11 : 3291 Max. :7.00 (Other):20912 NA's :730 Height_in Weight_lb DaysSmoke IncomePersonalCat_num Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. : 0 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.: 9000 Median : 5.000 Median :165.0 Median : 0.00 Median : 22500 Mean : 5.279 Mean :170.6 Mean :10.96 Mean : 27523 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.: 37500 Max. :11.000 Max. :500.0 Max. :30.00 Max. :120000 NA's :761 NA's :1376 NA's :102 Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2 Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000 1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229 Median :66.0 Median : -Inf Median : 0.0 Median : 8.492 Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888 3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229 Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522 NA's :761 NA's :102 NA's :297 NA's :25377 TotalCigsSmoked_sqrt TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2 Min. : 0.000 Min. : 1.0 Min. : 0.000 1st Qu.: 0.000 1st Qu.: 150.0 1st Qu.: 7.229 Median : 0.000 Median : 360.0 Median : 8.492 Mean : 7.887 Mean : 453.1 Mean : 7.888 3rd Qu.:17.321 3rd Qu.: 600.0 3rd Qu.: 9.229 Max. :54.222 Max. :2940.0 Max. :11.522 NA's :297 NA's :25377 NA's :25377 CigsSmokedFac CigsSmokedFac2 SmokingFreq3 No smoking (0) :25080 Q1 : 0 Never :25080 Lowest smoking (1-6) : 1045 Q2 : 0 Monthly: 772 Low smoking (7-30) : 1129 Q3 : 0 Weekly : 1156 Medium smoking (31-300): 6456 Q4 : 0 Daily :15983 High smoking (301+) : 9086 NA's:43093 NA's : 102 NA's : 297  ### 4.1.6 Data subset rows Note Most of you will not need to subset your data. Finer points By subsetting your data, you can focus on questions related to a specific subset of the population. Data subset rows to young smokers Only as an example (not used in the rest of the assignment), consider restricting our attention to only young smokers. In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values. You may also want to subset to a new dataset object name to keep it separate from the full dataset. In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter() function. Note that SmokingStatus == "Smoked past 12 months" is used to choose people who smoked in the past 12 months. Notice, the missing values have been removed (since NAs are not less than or equal to numbers). # the subset command below will not include a row that evalutes to NA for the # specified subset. For example: (NA == 1) [1] NA # age and smoking subset nesarc_sub_young_smokers <- nesarc_sub %>% filter( Age <= 25 # people 25 or younger , SmokingStatus == "Smoked past 12 months" # smoked in the past 12 months ) dim(nesarc_sub_young_smokers) [1] 1706 26 summary(nesarc_sub_young_smokers$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
18.00   20.00   22.00   21.61   23.00   25.00 
summary(nesarc_sub_young_smokers$SmokingStatus)  Smoked past 12 months Smoked prior to 12 months Never smoked 1706 0 0  ## 4.2 Data is complete (Class 04) ### 4.2.1 Plot entire dataset, show missing values Only keep one of these for your assignment. I think the second one is much more informative. e_plot_missing(nesarc_sub) Warning: Using alpha for a discrete variable is not advised. Adding additional options can help understand the structure of the missing values. e_plot_missing( dat_plot = nesarc_sub , var_group = "Sex" , sw_group_sort = TRUE , var2_sort = "TotalCigsSmoked" ) Warning: Using alpha for a discrete variable is not advised. ### 4.2.2 Numerical summaries to assess correctness (Class 03) summary(nesarc_sub)  ID Sex Age 1 : 1 Female:24575 Min. :18.0 2 : 1 Male :18518 1st Qu.:32.0 3 : 1 Median :44.0 4 : 1 Mean :46.4 5 : 1 3rd Qu.:59.0 6 : 1 Max. :98.0 (Other):43087 SmokingStatus NicotineDependence Smoked past 12 months : 9913 Nicotine Dependence : 4962 Smoked prior to 12 months: 8078 No Nicotine Dependence:38131 Never smoked :25080 NA's : 22 SmokingFreq DailyCigsSmoked Ethnicity Never :25080 Min. : 0.000 Cauc:24507 Every Day :14836 1st Qu.: 0.000 AfAm: 8245 Once a month or less: 772 Median : 0.000 NaAm: 701 1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332 3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308 (Other) : 869 Max. :98.000 NA's : 102 NA's :262 Depression IncomePersonal IncomePersonalCat Height_ft No Depression :35254 Min. : 0 6 : 3940 Min. :4.00 Yes Depression: 7839 1st Qu.: 8800 7 : 3887 1st Qu.:5.00 Median : 20000 2 : 3823 Median :5.00 Mean : 28185 4 : 3669 Mean :5.11 3rd Qu.: 36000 1 : 3571 3rd Qu.:5.00 Max. :3000000 11 : 3291 Max. :7.00 (Other):20912 NA's :730 Height_in Weight_lb DaysSmoke IncomePersonalCat_num Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. : 0 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.: 9000 Median : 5.000 Median :165.0 Median : 0.00 Median : 22500 Mean : 5.279 Mean :170.6 Mean :10.96 Mean : 27523 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.: 37500 Max. :11.000 Max. :500.0 Max. :30.00 Max. :120000 NA's :761 NA's :1376 NA's :102 Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2 Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000 1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229 Median :66.0 Median : -Inf Median : 0.0 Median : 8.492 Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888 3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229 Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522 NA's :761 NA's :102 NA's :297 NA's :25377 TotalCigsSmoked_sqrt TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2 Min. : 0.000 Min. : 1.0 Min. : 0.000 1st Qu.: 0.000 1st Qu.: 150.0 1st Qu.: 7.229 Median : 0.000 Median : 360.0 Median : 8.492 Mean : 7.887 Mean : 453.1 Mean : 7.888 3rd Qu.:17.321 3rd Qu.: 600.0 3rd Qu.: 9.229 Max. :54.222 Max. :2940.0 Max. :11.522 NA's :297 NA's :25377 NA's :25377 CigsSmokedFac CigsSmokedFac2 SmokingFreq3 No smoking (0) :25080 Q1 : 0 Never :25080 Lowest smoking (1-6) : 1045 Q2 : 0 Monthly: 772 Low smoking (7-30) : 1129 Q3 : 0 Weekly : 1156 Medium smoking (31-300): 6456 Q4 : 0 Daily :15983 High smoking (301+) : 9086 NA's:43093 NA's : 102 NA's : 297  # 5 Graphing and Tabulating ## 5.1 Class 04, Plotting univariate Rubric 1. (3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe. 2. (3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers). 3. (2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g., str(), dim(), summary()). • Scroll up to sections labeled “(Class 04)”. 4. (2 p) Label levels of factor variables. • Scroll up to sections labeled “(Class 04)”. ## 5.2 Categorical variables ### 5.2.1 Tables for categorical variables Basic tables can be created using the function table or xtabs, though for multiple variables the group_by(), summarize(), mutate() recipe makes a nicer table with proportions. # table() produces a one-variable frequency table table(nesarc_sub$NicotineDependence)

Nicotine Dependence No Nicotine Dependence
4962                  38131 
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()  Nicotine Dependence No Nicotine Dependence 0.1151463 0.8848537  tab_nicdep <- nesarc_sub %>% group_by( NicotineDependence ) %>% summarize( n = n() #, .groups = "drop_last" ) %>% mutate( prop = round(n / sum(n), 3) ) %>% #arrange( # desc(n) #) %>% ungroup() tab_nicdep # A tibble: 2 × 3 NicotineDependence n prop <fct> <int> <dbl> 1 Nicotine Dependence 4962 0.115 2 No Nicotine Dependence 38131 0.885 ### 5.2.2 Graphing frequency tables The barplots are all created with the package ggplot2. library(ggplot2) p <- ggplot(data = nesarc_sub, aes(x = NicotineDependence)) p <- p + theme_bw() p <- p + geom_bar() p <- p + labs( title = "Nicotine Dependence for Adults" , x = "Nicotine Dependence" , y = "Total Number" ) print(p) Interpretation: Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%). #### 5.2.2.1 Removing NAs from factor axes and tables Finer points Remove NAs using drop_na() from categorical variables in plots and tables when the NAs are unwanted. When a factor variable has NAs, then NA appears as a level with frequences (see bar plot below). As of writing this, ggplot() does not have a way to remove the NAs before plotting; therefore, we have to do it manually. The easiest and most transparent way is to use the drop_na() function below where we only plot the non-NA values. Note The table() function doesn’t show the NAs unless you ask it to. # table() produces a one-variable frequency table table(nesarc_sub$SmokingStatus)

Smoked past 12 months Smoked prior to 12 months              Never smoked
9913                      8078                     25080 
table(nesarc_sub$SmokingStatus, useNA = "always")  Smoked past 12 months Smoked prior to 12 months Never smoked 9913 8078 25080 <NA> 22  tab_smoke <- nesarc_sub %>% group_by( Sex, SmokingStatus ) %>% summarize( n = n() #, .groups = "drop_last" ) %>% mutate( prop = round(n / sum(n), 3) ) %>% #arrange( # desc(n) #) %>% ungroup() summarise() has grouped output by 'Sex'. You can override using the .groups argument. tab_smoke # A tibble: 8 × 4 Sex SmokingStatus n prop <fct> <fct> <int> <dbl> 1 Female Smoked past 12 months 5070 0.206 2 Female Smoked prior to 12 months 3861 0.157 3 Female Never smoked 15635 0.636 4 Female <NA> 9 0 5 Male Smoked past 12 months 4843 0.262 6 Male Smoked prior to 12 months 4217 0.228 7 Male Never smoked 9445 0.51 8 Male <NA> 13 0.001 tab_smoke <- nesarc_sub %>% drop_na(Sex, SmokingStatus) %>% # NAs dropped here group_by( Sex, SmokingStatus ) %>% summarize( n = n() #, .groups = "drop_last" ) %>% mutate( prop = round(n / sum(n), 3) ) %>% #arrange( # desc(n) #) %>% ungroup() summarise() has grouped output by 'Sex'. You can override using the .groups argument. tab_smoke # A tibble: 6 × 4 Sex SmokingStatus n prop <fct> <fct> <int> <dbl> 1 Female Smoked past 12 months 5070 0.206 2 Female Smoked prior to 12 months 3861 0.157 3 Female Never smoked 15635 0.636 4 Male Smoked past 12 months 4843 0.262 5 Male Smoked prior to 12 months 4217 0.228 6 Male Never smoked 9445 0.51  Compare the two plots below of the same data, but the second plot had the NAs removed before plotting. library(ggplot2) # Original plot with NAs p1 <- ggplot(data = nesarc_sub, aes(x = SmokingStatus)) p1 <- p1 + geom_bar() p1 <- p1 + labs(title = "Basic plot") p1 <- p1 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1)) #p1 # subset excluded the NAs for the variable being plotted # "subset() specifies the values to keep, and "!is.na()" means "keep the non-NA values" # p2 <- ggplot(data = subset(nesarc_sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac)) # tidyverse syntax uses drop_na(Var) to drop observations where Var is NA p2 <- ggplot(data = nesarc_sub %>% drop_na(SmokingStatus), aes(x = SmokingStatus)) p2 <- p2 + geom_bar() p2 <- p2 + labs(title = "Using drop_na()") p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1)) #p2 p_arranged <- cowplot::plot_grid( plotlist = list(p1, p2) # list of plots , nrow = 1 # number of rows for grid of plots , ncol = NULL # number of columns, left unspecified , labels = "AUTO" # A and B panel labels ) print(p_arranged) ## 5.3 Numeric variables ### 5.3.1 Graphing numeric variables p <- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked)) p <- p + theme_bw() p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5) #p <- p + geom_rug() # this plots every point, takes too long for big data p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4) p <- p + labs( title = "Monthly cigaretts smoked for Young Smoking Adults" , x = "Estimated Cigarettes Smoked per Month" , y = "" ) print(p) Warning: Removed 262 rows containing non-finite values (stat_bin). Warning: Removed 262 rows containing non-finite values (stat_density). summary(nesarc_sub$DailyCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.000   0.000   0.000   6.462  10.000  98.000     262 
fivenum(nesarc_sub$DailyCigsSmoked) [1] 0 0 0 10 98 median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
[1] 0
IQR(nesarc_subDailyCigsSmoked, na.rm = TRUE) [1] 10 Interpretation: Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 0 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 300. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding. Finer points Experiment with the geom_histogram() code in the next code chunk to set the binwidth = for your data (extremely important!). Often, a binwidth of 1 is good if the data are frequencies and high resolution is important. Decide if you need boundary = 0. Add a title and the labels on the $$x$$- and $$y$$-axes. A smoothed density curve can be added with the geom_density() function, where adjust = 2.0 is a fairly smooth curve. Experiment with adjust values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness and decide what level best captures the general shape of your distribution. Finer points The numeric summaries can be used in the text. See the .qmd file to see the R code in the following sentences and in the interpretation. • The “shape” of the distribution for the estimated cigarettes smoked per month is skewed to the right. • The “center” (median) of the distribution is • The “spread” (interquartile range, middle 50%) for the distribution is • There are several “outliers”. A second example follows. Notice that I filter the data to smokers, first. Interpretation: Among smokers, about half smoke at most 10 cigarettes per day with the mode at 20 (1 pack per day), two smaller peaks at 30 and 40 (1.5 and 2 packs per day), and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day). p <- ggplot(data = nesarc_sub %>% filter(DailyCigsSmoked > 0), aes(x = DailyCigsSmoked)) p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2) p <- p + geom_rug() p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5) p <- p + labs(x = "Estimated Cigarettes Smoked per Month" , y = "" , title = "Monthly cigaretts smoked for Young Smoking Adults" ) p <- p + theme_bw() p ## 5.4 Class 05-1, Plotting bivariate, numeric response Rubric 1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation): 1. Scatter plot (for regression): $$x$$ = numerical, $$y$$ = numerical, include axis labels and a title. Interpret the plot: describe the relationship. 2. Box plots (for ANOVA): $$x$$ = categorical, $$y$$ = numerical, include axis labels and a title. Interpret the plot: describe the relationship. ### 5.4.1 Scatter plot (for regression): x = numerical, y = numerical library(ggplot2) p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) p <- p + theme_bw() #p <- p + geom_point(alpha = 1/10) p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10) p <- p + geom_smooth() p <- p + stat_smooth(method = lm, colour = "red") p <- p + labs( title = "Total Cigarettes Smoked vs Age" #, x = , y = "Total Cigarettes Smoked per month" , caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)." ) print(p) geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")' geom_smooth() using formula 'y ~ x' Interpretation: The relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line. Finer points Improving plots and comparing by a categorical variable using facets. Research question: With my questions, I don’t have a meaningful question to ask that will relate two numeric variables. An exploratory question whether there is a relationship between a person’s age (Age) and total number of cigaretts smoked (TotalCigsSmoked). Because age is an integer varible, I jitter the points, and I reduce the opacity (increase the transparancy) with alpha = 1/4 to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque). The first plot has all the points in their original locations, but they end up stacking on top of each other so you can’t tell how many points are there. library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked)) p <- p + geom_point(alpha = 1/4) p <- p + stat_smooth(method = lm) p <- p + labs(title = "Total Cigarettes Smoked vs Age") print(p) geom_smooth() using formula 'y ~ x' Warning: Removed 297 rows containing non-finite values (stat_smooth). Warning: Removed 297 rows containing missing values (geom_point). This is the same data as above, but I have added some horizontal jitter (displacement) to each point which doesn’t affect their value on the y-axis but allows us to see how many points are at each discrete age year. library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked)) p <- p + geom_jitter(position = position_jitter(width = 0.2), alpha = 1/4) p <- p + stat_smooth(method = lm) p <- p + labs(title = "Total Cigarettes Smoked vs Age") print(p) geom_smooth() using formula 'y ~ x' Warning: Removed 297 rows containing non-finite values (stat_smooth). Warning: Removed 297 rows containing missing values (geom_point). Here’s an example of height and weight with a strong positive relationship between the two numeric variables. Most variables in our dataset won’t exhibit such a nice relationship. library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb)) p <- p + theme_bw() p <- p + geom_jitter(height = 0.5, alpha = 1/10) p <- p + stat_smooth(method = lm) p <- p + labs(title = "Weight vs Height") print(p) geom_smooth() using formula 'y ~ x' Warning: Removed 1439 rows containing non-finite values (stat_smooth). Warning: Removed 1439 rows containing missing values (geom_point). If you have two groups you want to compare, you can create facets (small multiples) and show the relationship by each group. Below, I compare by sex. The slopes appear similar; later in the semester we’ll learn how to formally compare these lines. library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb)) p <- p + geom_point(alpha = 1/4) p <- p + stat_smooth(method = lm) p <- p + labs(title = "Weight vs Height") p <- p + facet_wrap(~ Sex) print(p) geom_smooth() using formula 'y ~ x' Warning: Removed 1439 rows containing non-finite values (stat_smooth). Warning: Removed 1439 rows containing missing values (geom_point). ### 5.4.2 Box plots (for ANOVA): x = categorical, y = numerical Research question: Is there a relationship between depression (Depression) and total number of cigarettes smoked (TotalCigsSmoked)? library(ggplot2) p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked)) p <- p + theme_bw() p <- p + geom_boxplot(width = 0.5, alpha = 0.5) p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100) # diamond at mean for each group p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6, colour = "red", alpha = 0.8) p <- p + labs( title = "Total Cigarettes Smoked by Depression" , x = "Depression status" , y = "Total Cigarettes Smoked per month" , caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)." ) print(p) Interpretation: For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median. ## 5.5 Class 05-2, Plotting bivariate, categorical response Rubric 1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation): 1. Mosaic plot or bivariate bar plots (for contingency tables): $$x$$ = categorical, $$y$$ = categorical, include axis labels and a title. Interpret the plot: describe the relationship. 2. Logistic scatter plot (for logistic regression): $$x$$ = numerical, $$y$$ = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship. ### 5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical Research question: Is there a relationship between depression status (Depression) and nicotine dependence (NicotineDependence)? tab_dep_nic <- nesarc_sub %>% group_by( Depression, NicotineDependence ) %>% summarize( n = n() ) %>% mutate( prop = round(n / sum(n), 3) ) %>% ungroup() summarise() has grouped output by 'Depression'. You can override using the .groups argument. tab_dep_nic # A tibble: 4 × 4 Depression NicotineDependence n prop <fct> <fct> <int> <dbl> 1 No Depression Nicotine Dependence 3193 0.091 2 No Depression No Nicotine Dependence 32061 0.909 3 Yes Depression Nicotine Dependence 1769 0.226 4 Yes Depression No Nicotine Dependence 6070 0.774 tab_nic_dep <- nesarc_sub %>% group_by( NicotineDependence, Depression ) %>% summarize( n = n() ) %>% mutate( prop = round(n / sum(n), 3) ) %>% ungroup() summarise() has grouped output by 'NicotineDependence'. You can override using the .groups argument. tab_nic_dep # A tibble: 4 × 4 NicotineDependence Depression n prop <fct> <fct> <int> <dbl> 1 Nicotine Dependence No Depression 3193 0.643 2 Nicotine Dependence Yes Depression 1769 0.357 3 No Nicotine Dependence No Depression 32061 0.841 4 No Nicotine Dependence Yes Depression 6070 0.159 Reversing color order: It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function forcats::fct_rev() on your fill= variable. library(ggplot2) p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence))) p <- p + theme_bw() p <- p + geom_bar(position = "fill") p <- p + labs( x = "Depression status" , y = "Proportion" , fill = "Nicotine Dependence\nStatus" # the legend title can be modified, if desired (try this line) , title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status" ) print(p) Interpretation: When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%). Finer points We can compare by a categorical variable with facets (or “small multiples”). Legends can be (A) removed or (B) moved and rearranged. Research question: Is there a relationship between smoking frequency (SmokingFreq) and nicotine dependence (NicotineDependence)? Also compare by sex. library(ggplot2) p <- ggplot(data = subset(nesarc_sub, !is.na(SmokingFreq)), aes(x = SmokingFreq, fill = NicotineDependence)) p <- p + geom_bar(position = "fill") p <- p + theme_bw() # tilt axis labels p <- p + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) p <- p + labs(x = "Smoking Frequency", y = "Percent") p1 <- p p1 <- p1 + theme(legend.position = "none") # remove legend # also compare by sex p <- p + facet_grid(Sex ~ .) p <- p + theme(legend.position = "bottom") # "none" p <- p + theme(legend.box="vertical", legend.margin=margin()) # stack vertically p <- p + guides(fill = guide_legend(ncol = 1, byrow = FALSE)) # shape levels of one legend p2 <- p p_arranged <- cowplot::plot_grid( plotlist = list(p1, p2) # list of plots , nrow = 1 # number of rows for grid of plots , ncol = NULL # number of columns, left unspecified , labels = "AUTO" # A and B panel labels ) print(p_arranged) #### 5.5.1.1 Using Mosaic Plots Note This section on Mosaic plots can be completely skipped. Finer points Mosaic plots will be used later in the course when we calculate statistics to test whether a pair of categorical variables are associated. These are helpful because they color-code each cell based on whether they provide evidence of an association. These plots are difficult to customize, so the ggplot() versions above will generally be preferred to these. Interpretation: If a person has depression they are more likely to have nicotine dependence than if they do not have depression. library(vcd) Loading required package: grid mosaic( ~ NicotineDependence + Depression, data = nesarc_sub) mosaic( ~ NicotineDependence + Depression, data = nesarc_sub, shade = TRUE) mosaic( ~ NicotineDependence + Depression + Sex, data = nesarc_sub, shade = TRUE) It’s easy to look at such a fine resolution (too many variables) that it’s difficult to understand what’s happening. mosaic(~ NicotineDependence + Depression + Sex + CigsSmokedFac, data = nesarc_sub, shade = TRUE) ##### 5.5.1.1.1 Customizing Mosaic Plots There are several solutions to dealing with the messy labels that overlap. The help line below will show the many options you have control over. Read through the comments of each code chunk, see the new options, and see the result. library(vcd) # help(labeling_border) # Initial plot mosaic( ~ NicotineDependence + Depression, data = nesarc_sub) # Abbreviate labels, specify the number of characters (try a few) mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , abbreviate_labs = 4) # Rotate and align labels mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , rot_labels = c(0, 90, 0, 0) , just_labels = "right" # right-justifies all the labels (top and side) ) # Rotate and align labels only y-axis label mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , rot_labels = c(0, 90, 0, 0) , just_labels = c("center", "right") # order is opposite as in first argument ) # Rotate and align labels only y-axis label, and offset the variable name mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , rot_labels = c(0, 90, 0, 0) , just_labels = c("center", "right") # order is opposite as in first argument , offset_varnames = c(0,0,0,9) # offset the y-axis var name ) Final version Labeling this plot well is, in my opinion, a bit too hard to be worth it. Notice in my code chunk options (you’ll need to look at the .Rmd file), that I’ve included fig.height = 8, fig.width = 8 which makes the size of the plot in the output 8 inches high by 8 inches wide. library(vcd) mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , shade = TRUE , rot_labels = c(0, 90, 0, 0) # order is: top, right, bottom, left , just_labels = c("center", "right") # order is opposite as in first argument , offset_varnames = c(0, 0, 0, 9) # offset the y-axis var name # a big left margin makes space for the plot labels , margins = c(3, 3, 3, 14) # order is: top, right, bottom, left , main = "Fraction of young adult smokers with and without\n nicotine dependence by depression status" , labeling_args = list(set_varnames = c(NicotineDependence = "Tobacco Addiction Status" , Depression = "Depression status") ) ) Interpretation: If a person has nicotine dependence they are more likely to have depression than if they do not have nicotine dependence. ### 5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary) Research question: Frequency and quantity of smoking (TotalCigsSmoked) are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (NicotineDependence) (this is true for other drugs as well). table(nesarc_subNicotineDependence)

Nicotine Dependence No Nicotine Dependence
4962                  38131 
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc_sub$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)  0 1 38131 4962  # # Note, sometimes it is necessary to make your variable a character type first. # # If the code above doesn't produce both 0s and 1s, then try this: # nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0) library(ggplot2) p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01)) p <- p + theme_bw() p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4) # overlay a sigmodal curve to indicate direction of relationship # As of ggplot2 2.1.0, need to use this binomial_smooth() function: binomial_smooth <- function(...) { geom_smooth(method = "glm", method.args = list(family = "binomial"), ...) } p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE) p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1)) p <- p + labs(x = "Estimated Cigarettes Smoked per Month" , y = "Nicotine Dependence (1=Yes, 0=No)" , title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked" ) print(p) geom_smooth() using formula 'y ~ x' Warning: Removed 297 rows containing non-finite values (stat_smooth). Warning: Removed 297 rows containing missing values (geom_point). Interpretation: The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes. Finer points While the basic plot can be made without collapsing to binary, later in the semester we will learn about logistic regression where we model the probability of “success”. Categories associated with “success” are coded as 1, and all other categories are “failure” coded as 0. Below I show how to create this binary 0/1 variable from a numeric variable and a multi-level categorical variable. Research question: 1. Frequency and quantity of smoking (TotalCigsSmoked) are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (NicotineDependence) (this is true for other drugs as well). library(ggplot2) p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence)) p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4) print(p) Warning: Removed 297 rows containing missing values (geom_point). Interpretation: These two groups are quite balanced on the probability of tobacco dependence given the total number of cigarettes smoked, though they increase together slightly. To overlay a curve, we need to convert our response variable $$y$$ to be binary 0 or 1, where we let 1 indicate a “success” in the direction of interest. #### 5.5.2.1 Collapsing a variable to binary Whether you have a numeric or categorical variable that you’d like to represent with two levels, you’ll need to convert either to a numeric binary variable (values of 0 or 1, only). The value of 1 indicates “success” or the feature you’re interested in (below, 1 = someone with tobacco dependence) and 0 is “failure” (not tobacco dependent). Note that “success” does not necessarily mean “good”. ##### 5.5.2.1.1 Numeric to binary It is convenient to collapse a continuous variable into a categorical variable (as with the cut() command) or further collapse a categorical variable into a binary variable. One strategy to create a binary variable is to use the ifelse() function. Let’s create a binary Age variable; if 21 years and older, than legal to purchase alcohol, otherwise it is illegal. See how the two columns below correspond as we wanted them to. I do not recommend overwriting the original column, but create a new variable in the data.frame. nesarc_sub$Alcohol_binary   <- ifelse(nesarc_sub$Age >= 21, "Legal", "Illegal") nesarc_sub$Alcohol_binary01 <- ifelse(nesarc_sub$Age >= 21, 1, 0) str(nesarc_sub$Alcohol_binary)
 chr [1:43093] "Legal" "Legal" "Legal" "Illegal" "Legal" "Legal" "Illegal" ...
head(nesarc_sub %>% select(Age, Alcohol_binary, Alcohol_binary01), 12)
   Age Alcohol_binary Alcohol_binary01
1   23          Legal                1
2   28          Legal                1
3   81          Legal                1
4   18        Illegal                0
5   36          Legal                1
6   34          Legal                1
7   19        Illegal                0
8   84          Legal                1
9   29          Legal                1
10  18        Illegal                0
11  68          Legal                1
12  48          Legal                1
##### 5.5.2.1.2 Multi-level categorical to binary.
# Smoking frequency has 7 categories
# if at least 3 Days/week, then code as a 1, otherwise code as a 0
table(nesarc_sub$SmokingFreq)  Never Once a month or less 2 to 3 Days/month 25080 772 409 1 to 2 Days/week 3 to 4 Days/week 5 to 6 Days/week 747 687 460 Every Day 14836  nesarc_sub$SmokingFreq01 <-
ifelse(
nesarc_sub$SmokingFreq %in% c("3 to 4 Days/week" , "5 to 6 Days/week" , "Every Day" ) , 1 # success , 0 # failure ) # check that the frequencies are correct before and after table(nesarc_sub$SmokingFreq01)

0     1
27110 15983 

## 5.6 Class 06, Figure arrangement, captions, cross-referencing

Rubric

1. Reorganize your Class 05 bivariate plots above using plot_grid() and quarto, creating captions and cross-referencing them from the text.

1. (3 p) For your numeric response plots, use cowplot::plot_grid() to create a single figure with separate plot panels.

2. (3 p) For your categorical response plots, use quarto chunk options fig-cap, fig-subcap, and layout-ncol to create a single figure with separate plot panels.

3. (2 p) Use captions to describe (not interpret) both sets of plots so a reader understands what is being plotted.

4. (2 p) Use cross-referencing from the text to refer to the plots when you interpret them.

Note

Go above and reformat your plots and update your interpretations with cross-referencing.

Note

In my example, for pedagogical purposes, I reproduce the plots and illustrate the cross-referencing below. This allows me to separate for you the original creation of the plots from this step of arranging and cross-referencing them.

While I show the chunk options in the html output file, it’s best to look at the .qmd file to see the chuck options.

### 5.6.1 Plotting bivariate, numeric response, cowplot::plot_grid()

Note

Start by creating two plots and call them p1 and p2. Notice that I creating them with variable p, then assign p to p1 at the end.

# These two plots are copied from above.

# Scatter plot (for regression): x = numerical, y = numerical
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p <- p + theme_bw()
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)

p1 <- p

# Box plots (for ANOVA): x = categorical, y = numerical
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + theme_bw()
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total Cigarettes Smoked per month"
, caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
)
p2 <- p
Finer points

Look at the help for ?cowplot::plot_grid; there are many options that I don’t use below. I put the p1 and p2 into a list and print the one plot. In the code chunk, I specified a label and a figure caption; I also increase the size of the plot. From the text, I use the chunk label as a cross-reference; note that I have to manually specify figure panels “A” and “B”. I just copied the previous interpretations.

These are the chunk options used below, I recommend looking in the .qmd file.

#| label: fig-bivariate-plots-num
#| fig-cap: Numeric response bivariate plots. (A) Total Cigarettes Smoked vs Age.  (B) Total Cigarettes Smoked by Depression
#| fig-width:   8
#| fig-height:  4
p_arranged <-
cowplot::plot_grid(
plotlist  = list(p1, p2)   # list of plots
, nrow      = 1              # number of rows for grid of plots
, ncol      = NULL           # number of columns, left unspecified
, labels    = "AUTO"         # A and B panel labels
)
geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
geom_smooth() using formula 'y ~ x'
print(p_arranged)

Interpretation:

In Figure 1 A, the relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line.

In Figure 1 B, for NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.

### 5.6.2 Plotting bivariate, categorical response, quarto chunk options

Finer points

I’ve now also added a fig-subcap (figure subcaptions) which appear under each figure. They must be specified in the same order as the plots are printed and their cross-reference label appends a suffix of -1, -2, etc., to the fig-cap. In the interpretation below, I’ve included the cross-reference parenthetically at the end of the first sentence describing them, which is another common style.

These are the chunk options used below, I recommend looking in the .qmd file.

#| label: fig-bivariate-plots-cat
#| fig-cap: Categorical response bivariate plots.
#| fig-subcap:
#|   - "Proportion of young adult smokers with and without nicotine dependence by depression status"
#|   - "Likelihood of Nicotine Dependence increases with number of cigarettes smoked"
#| layout-ncol: 1  # one column for plots      # alternatively, #| layout: [45,-10, 45] puts 2 on one row with a 10% width space between
#| fig-width:   5
#| fig-height:  3
# Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + theme_bw()
p <- p + geom_bar(position = "fill")
p <- p + labs(x = "Depression status"
, y = "Proportion"
, fill = "Nicotine Dependence\nStatus"  # the legend title can be modified
, title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
)
print(p)

# Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)

# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth()    # old way: stat_smooth(method="glm", family="binomial", se=FALSE)

p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))

p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = "Nicotine Dependence (1=Yes, 0=No)"
, title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
)
print(p)
geom_smooth() using formula 'y ~ x'
Warning: Removed 297 rows containing non-finite values (stat_smooth).
Warning: Removed 297 rows containing missing values (geom_point).

Interpretation:

Categorical response bivariate plots are in Figure 2.

When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%) (Figure 2 (a)).

The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases (Figure 2 (b)). The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.

# 6 Statistical methods

## 6.1 Class 07-1, Simple linear regression (separate worksheet)

Note

Find this assignment on the course website.

## 6.2 Class 07-2, Simple linear regression

Rubric

1. With your previous (or new) bivariate scatter plot, add a regression line.
• (2 p) plot with regression line,
• (1 p) label axes and title.
2. Use lm() to fit the linear regression and interpret slope and $$R^2$$ (R-squared) values.
• (2 p) lm summary() table is presented,
• (2 p) slope is interpreted with respect to a per-unit increase of the $$x$$ variable in the context of the variables in the plot,
• (2 p) $$R^2$$ is interpretted in a sentence.
3. (1 p) Interpret the intercept. Does it make sense in the context of your study?

### 6.2.1 1. Scatter plot, add a regression line.

library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p <- p + theme_bw()
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
#p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, fullrange = TRUE)  # adds the regression line
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
p <- p + xlim(c(0, NA))  # extends the range of the axis down to 0 to see the "intercept"
print(p)
geom_smooth() using formula 'y ~ x'

### 6.2.2 2. Fit the linear regression, interpret slope and $$R^2$$ (R-squared) values

# fit the simple linear regression model
lm_TCS_Age <-
lm(
formula = TotalCigsSmoked ~ Age
, data = nesarc_sub
)
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)

Call:
lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub)

Residuals:
Min     1Q Median     3Q    Max
-332.1 -189.4 -139.1  106.3 2828.9

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.92792    4.35989   13.29   <2e-16 ***
Age          2.79797    0.08762   31.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 329.2 on 42794 degrees of freedom
(297 observations deleted due to missingness)
Multiple R-squared:  0.02327,   Adjusted R-squared:  0.02325
F-statistic:  1020 on 1 and 42794 DF,  p-value: < 2.2e-16
• Slope: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month.

• $$R^2$$: The proportion of variance explained by the regression model, compared to the grand mean, is $$R^2 = 0.0233$$, which is very small.

Note

When the $$x$$ and $$y$$ variables are correlated ($$r \ne 0$$), $$R^2$$ will be larger (since $$R^2 = r^2$$, the capitalization doesn’t matter in this case).

### 6.2.3 3. Interpret the intercept. Does it make sense?

For someone 0 years old, the expected number of cigarettes smoked per month is 57.93.

This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.

## 6.3 Class 08-1, Logarithm transformation (separate worksheet)

Note

Find this assignment on the course website.

## 6.4 Class 08-2, Logarithm transformation

Rubric

1. Try plotting the data on a logarithmic scale
• (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.
1. original scales
2. $$\log(x)$$-only
3. $$\log(y)$$-only
4. both $$\log(x)$$ and $$\log(y)$$
2. What happened to your data when you transformed it?
• (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
• (1 p) Choose the best scale for a linear relationship and explain why.
• (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.

### 6.4.1 1. Try plotting on log scale (original scale, $$\log(x)$$-only, $$\log(y)$$-only, both $$\log(x)$$ and $$\log(y)$$)

# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()

library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)

library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()          # convert the x-axis to the log10 scale
p2 <- p2 + labs(
title = "Log(x)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)

library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()          # convert the y-axis to the log10 scale
p3 <- p3 + labs(
title = "Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)

library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()          # convert the x-axis to the log10 scale
p4 <- p4 + scale_y_log10()          # convert the y-axis to the log10 scale
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)

# arrange them into a single figure
p_arranged <-
cowplot::plot_grid(
plotlist  = list(p1, p2, p3, p4)  # list of plots
, nrow      = 2              # number of rows for grid of plots
, ncol      = NULL           # number of columns, left unspecified
, labels    = "AUTO"         # A and B panel labels
)
geom_smooth() using formula 'y ~ x'
geom_smooth() using formula 'y ~ x'
geom_smooth() using formula 'y ~ x'
geom_smooth() using formula 'y ~ x'
print(p_arranged)

### 6.4.2 2. What happened to your data when you transformed it?

• Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).
1. Original: Same
2. log(x): no improvement, age values became left skewed
3. log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
4. log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help.
• Choose the best scale for a linear relationship and explain why.

log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.

• Does your relationship benefit from a logarithmic transformation? Say why or why not.

A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.

Finer points

Notice that the interpretation is on the scale of the regression. If the $$x$$ or $$y$$ variable is on a different scale (such as $$\log_{10}$$), then the interpretation is on that scale, too.

The plot with $$(x, \log_{10}(y))$$ is an improvement over $$(x,y)$$ since the points are more symmetric around the regression line (they are not perfectly symmetric, but they are better). There might be a better transformation (such as square-root), or we might have to live with a non-symmetric response.

nesarc_sub <-
nesarc_sub %>%
mutate(
TotalCigsSmokedLog10 = log10(TotalCigsSmoked)
)

# fit the simple linear regression model
lm_TCSL10_Age <-
lm(
TotalCigsSmokedLog10 ~ Age
, data = nesarc_sub %>% filter(TotalCigsSmoked > 0)
)

# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCSL10_Age)

Call:
lm(formula = TotalCigsSmokedLog10 ~ Age, data = nesarc_sub %>%
filter(TotalCigsSmoked > 0))

Residuals:
Min      1Q  Median      3Q     Max
-2.6113 -0.1787  0.2314  0.4303  1.2553

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1093341  0.0152392  138.41   <2e-16 ***
Age         0.0054558  0.0002952   18.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6836 on 17714 degrees of freedom
Multiple R-squared:  0.01892,   Adjusted R-squared:  0.01886
F-statistic: 341.6 on 1 and 17714 DF,  p-value: < 2.2e-16

The slope is 0.00546. Thus, for every year increase in age, we expect an increase of 0.00546 in log 10 of total cigarettes smoked; that is, we expect a small decrease on the log scale.

Finer points

Logorithm transformations of any base (2, $$e$$, 10) modify the shape of the data in the same way. The difference is a multiplicative scaling of the axis. My favorite is base 2 because many datasets have several orders of “doubling” and it’s easy for me to think of each unit on the log 2 scale being twice as big as the previous unit. Compare this to base 10 where each unit is 10 times as large as the previous — very few datasets have such a vast range!

Finer points

Here’s an example where transforming both $$x$$ and $$y$$ with the log makes a big improvement.

# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()

library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)

library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)

library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)

library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)

p_arranged <-
cowplot::plot_grid(
plotlist  = list(p1, p2, p3, p4)  # list of plots
, nrow      = 2              # number of rows for grid of plots
, ncol      = NULL           # number of columns, left unspecified
, labels    = "AUTO"         # A and B panel labels
)
geom_smooth() using formula 'y ~ x'
Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis
geom_smooth() using formula 'y ~ x'
Warning: Removed 736 rows containing non-finite values (stat_smooth).
geom_smooth() using formula 'y ~ x'
Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis
geom_smooth() using formula 'y ~ x'
Warning: Removed 736 rows containing non-finite values (stat_smooth).
print(p_arranged)

### This is another way to arrange several ggplot objects
# library(gridExtra)
# grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")

## 6.5 Class 09, Correlation (separate worksheet)

Note

Find this assignment on the course website.

## 6.6 Class 10, Categorical contingency tables (separate worksheet)

Note

Find this assignment on the course website.

## 6.7 Class 11, Correlation and Categorical contingency tables

Rubric

1. With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
• (1 p) plot is repeated here or the plot is referenced and easy to find from a plot above,
• (1 p) correlation is calculated,
• (2 p) correlation is interpreted (direction, strength of LINEAR relationship).
2. With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
• (1 p) frequency table of variables is given,
• (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
• (1 p) a well-labeled plot of the proportion table is given,
• (2 p) the conditional proportions are interpreted and compared between conditions.

### 6.7.1 Correlation

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_point(alpha = 1/20)
p <- p + stat_smooth(method = lm)
p <- p + labs(
title = "Log2 Cigarettes smoked vs Age"
, x = "Age"
, y = "Log2 Total Cigarettes Smoked per month"
)
print(p)
geom_smooth() using formula 'y ~ x'
Warning: Removed 25377 rows containing non-finite values (stat_smooth).
Warning: Removed 25377 rows containing missing values (geom_point).

cor_A_C <-
cor(
nesarc_sub$Age , nesarc_sub$TotalCigsSmoked_log2
, use = "complete.obs"
)
cor_A_C
[1] 0.1375481

### 6.7.2 Interpretation of correlation

The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.

### 6.7.3 Contingency table

tab_S_D_N <-
nesarc_sub %>%
group_by(
Sex
, Depression
, NicotineDependence
) %>%
summarise(
Frequency = n()
) %>%
mutate(
Proportion = round(Frequency / sum(Frequency), 3)
) %>%
ungroup()
summarise() has grouped output by 'Sex', 'Depression'. You can override using
the .groups argument.
# tab_S_D_N

tab_S_D_N %>%
knitr::kable()
Sex Depression NicotineDependence Frequency Proportion
Female No Depression Nicotine Dependence 1455 0.076
Female No Depression No Nicotine Dependence 17706 0.924
Female Yes Depression Nicotine Dependence 1117 0.206
Female Yes Depression No Nicotine Dependence 4297 0.794
Male No Depression Nicotine Dependence 1738 0.108
Male No Depression No Nicotine Dependence 14355 0.892
Male Yes Depression Nicotine Dependence 652 0.269
Male Yes Depression No Nicotine Dependence 1773 0.731
library(ggplot2)
p <- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + theme_bw()
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + labs(
title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
, y = "Proportion of Nicotine Dependence"
, fill = "Nicotine Dependence"
)
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_wrap(~ Sex)
p <- p + theme(legend.position = "bottom")
print(p)

### 6.7.4 Interpretation of conditional proportions

• For Females without depression, the proportion who are nicotine dependent is 0.076, while for those with depression it is 0.206 (nearly 3 times as much).
• For Males without depression, the proportion who are nicotine dependent is 0.108, while for those with depression it is 0.269 (nearly 3 times as much).

Males generally are more nicotine dependent than Females, but both are nearly 3 times more likely to be nicotine dependent if they are depressed compared to if they are not depressed.

## 6.8 Class 12-1, Parameter estimation (one-sample) (separate worksheet)

Note

Find this assignment on the course website.

## 6.9 Class 12-2, Inference and Parameter estimation (one-sample)

Rubric

1. Using a numerical variable, calculate and interpret a confidence interval for the population mean.
• (1 p) Identify and describe the variable,
• (1 p) use t.test() to calculate the mean and confidence interval, and
• (1 p) interpret the confidence interval.
• (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
2. Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
• (1 p) Identify and describe the variable,
• (1 p) use binom.test() to calculate the sample proportion and confidence interval, and
• (1 p) interpret the confidence interval.
• (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.

### 6.9.1 Numeric variable confidence interval for mean $$\mu$$

TotalCigsSmoked_log2 is the estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked) on the log2 scale.

# calculate mean and CI
t_result <-
t.test(
nesarc_sub$TotalCigsSmoked_log2 ) t_result  One Sample t-test data: nesarc_sub$TotalCigsSmoked_log2
t = 457.94, df = 17715, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
7.854201 7.921726
sample estimates:
mean of x
7.887963 

Interpretation:

• The mean of TotalCigsSmoked_log2 is 7.89 and the 95% confidence interval is (7.85, 7.92).
• In 95% of samples, a confidence interval constructed from the data will contain the true population mean. In our case, our CI was (7.85, 7.92). It either contains the true population mean or it doesn’t, but we don’t know.
# plot data and create table
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_histogram(binwidth = 0.5, alpha = 1)
#p <- p + geom_rug(alpha = 1/8)
# est
p <- p + geom_vline(aes(xintercept = as.numeric(t_result$estimate)), colour = "blue", size = 1, linetype = 3) p <- p + geom_rect(aes( xmin = t_result$conf.int[1]
, xmax = t_result$conf.int[2] , ymin = -150 , ymax = 0) , fill = "blue", alpha = 1) p <- p + labs(title = "Log2 Total Cigs Smoked\nblue = est +- 95% CI") print(p) Warning: Removed 25377 rows containing non-finite values (stat_bin). ### 6.9.2 Categorical variable confidence interval for proportion $$p$$ Depression is a Yes/No variable indicating that the person has major depression (lifetime). # proportions are of the last group_by() variable within combinations of the other variables tab_dep_long <- nesarc_sub %>% group_by(Depression) %>% summarise(Frequency = n()) %>% mutate(Proportion = Frequency / sum(Frequency)) %>% ungroup() tab_dep_long # A tibble: 2 × 3 Depression Frequency Proportion <fct> <int> <dbl> 1 No Depression 35254 0.818 2 Yes Depression 7839 0.182 # prop.test() is an asymptotic (approximate) test for a binomial random variable p_summary <- prop.test( x = tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency) , n = sum(tab_dep_long$Frequency)
#, conf.level = 0.95
)
p_summary

1-sample proportions test with continuity correction

data:  tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency) out of sum(tab_dep_long$Frequency), null probability 0.5 X-squared = 17440, df = 1, p-value < 2.2e-16 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.1782835 0.1855912 sample estimates: p 0.1819089  Interpretation: • The proportion of people who have Depression is 0.182 and the 95% confidence interval is (0.178, 0.186). • In 95% of samples, a confidence interval constructed from the data will contain the true population proportion In our case, our CI was (0.178, 0.186). It either contains the true population proportion or it doesn’t, but we don’t know. library(ggplot2) p <- ggplot(data = tab_dep_long %>% filter(Depression == "Yes Depression"), aes(x = Depression, y = Proportion)) p <- p + theme_bw() p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4) #p <- p + geom_bar(stat = "identity") p <- p + geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25) p <- p + geom_point(colour = "red", size = 6, pch = 18) p <- p + scale_y_continuous(limits = c(0.17, 0.19)) # swap axes to take less vertical space #p <- p + coord_flip() print(p) Warning: Removed 2 rows containing missing values (geom_hline). ## 6.10 Class 13, Hypothesis testing (one- and two-sample) (separate worksheet) Note Find this assignment on the course website. ## 6.11 Class 14, Paired data, assumption assessment (separate worksheet) Note Find this assignment on the course website. ## 6.12 Class 15, Hypothesis testing (one- and two-sample) ### 6.12.1 Mechanics of a hypothesis test (review) 1. Set up the null and alternative hypotheses in words and notation. • In words: The population mean for [what is being studied] is different from [value of $$\mu_0$$].’’ (Note that the statement in words is in terms of the alternative hypothesis.) • In notation: $$H_0: \mu=\mu_0$$ versus $$H_A: \mu \ne \mu_0$$ (where $$\mu_0$$ is specified by the context of the problem). 2. Choose the significance level of the test, such as $$\alpha=0.05$$. 3. Compute the test statistic, such as $$t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}$$, where $$SE_{\bar{Y}}=s/\sqrt{n}$$ is the standard error. 4. Determine the tail(s) of the sampling distribution where the $$p$$-value from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, $$t_{s}$$, with the critical value $$t_{\textrm{crit}}=t_{\alpha/2}$$ in the direction of the alternative hypothesis from the $$t$$-distribution table with degrees of freedom $$df = n-1$$.) 5. State the conclusion in terms of the problem. • Reject $$H_0$$ in favor of $$H_A$$ if $$p\textrm{-value} < \alpha$$. • Fail to reject $$H_0$$ if $$p\textrm{-value} \ge \alpha$$. (Note: We DO NOT accept $$H_0$$.) 6. Check assumptions of the test (for now we skip this). ### 6.12.2 What do we do about “significance”? Adapted from Significance Magazine. Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more). In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph. “The difference between the two groups turns out to be small (8%), while the probability ($$p$$) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is $$p = 0.003$$ (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold ($$p < 0.05$$). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.” ### 6.12.3 Two-sample $$t$$-test Rubric 1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample $$t$$-test associated with your research questions. • (2 p) Specify the hypotheses in words and notation (either one- or two-sided test), • (0 p) use t.test() to calculate the mean, test statistic, and p-value, • (2 p) state the significance level, test statistic, and p-value, and • (2 p) state the conclusion in the context of the problem. • (1 p) Given your conclusion, could you have committed at Type-I or Type-II error? • (2 p) Provide an appropriate plot of the data and sample estimates in a well-labeled plot. • (1 p) Check the assumptions of the test using the bootstrap and interpret the bootstrap sampling distribution plot. 2. Set up the null and alternative hypotheses in words and notation. • In words: “The population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).” • In notation: $$H_0: \mu_D = \mu_N$$ versus $$H_A: \mu_D > \mu_N$$. t_summary_TCS_D <- t.test( TotalCigsSmoked ~ Depression , data = nesarc_sub , alternative = "less" ) t_summary_TCS_D  Welch Two Sample t-test data: TotalCigsSmoked by Depression t = -17.226, df = 10439, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is less than 0 95 percent confidence interval: -Inf -71.94239 sample estimates: mean in group No Depression mean in group Yes Depression 173.0296 252.5677  1. Choose the significance level of the test, such as $$\alpha=0.05$$. 2. Compute the test statistic, such as $$t_{s} = -17.2$$. 3. $$p$$-value is $$p = 6.85\times 10^{-66}$$. 4. State the conclusion in terms of the problem. • Reject $$H_0$$ in favor of $$H_A$$ concluding that the population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression). • Because we rejected the Null hypothesis, we could have made a Type I error. ## If we create a summary data.frame with a similar structure as our data, then we ## can annotate our plot with those summaries. est_mean_TCS_D <- nesarc_sub %>% group_by(Depression) %>% summarise(TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)) %>% ungroup() est_mean_TCS_D # A tibble: 2 × 2 Depression TotalCigsSmoked <fct> <dbl> 1 No Depression 173. 2 Yes Depression 253. library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked)) p <- p + theme_bw() p <- p + geom_boxplot(width = 0.5, alpha = 0.5) p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4) # diamond at mean for each group p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6, colour = "red", alpha = 0.8) p <- p + labs( title = "Total Cigarettes Smoked by Depression" , x = "Depression status" , y = "Total cigarettes smoked per month" ) print(p) Warning: Removed 297 rows containing non-finite values (stat_boxplot). Warning: Removed 297 rows containing non-finite values (stat_summary). Warning: Removed 297 rows containing missing values (geom_point). # histogram using ggplot p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked)) p <- p + theme_bw() p <- p + geom_histogram(binwidth = 40) #p <- p + geom_rug() p <- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour = "red") p <- p + facet_grid(Depression ~ .) p <- p + labs( title = "Total Cigarettes Smoked by Depression" , x = "Depression status" , y = "Total cigarettes smoked per month" ) print(p) Warning: Removed 297 rows containing non-finite values (stat_bin). e_plot_bs_two_samp_diff_dist( nesarc_sub %>% filter(Depression == "No Depression" ) %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked) , nesarc_sub %>% filter(Depression == "Yes Depression") %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked) , N = 1000 ) • The bootstrap sampling distribution of the difference in means is close to normal, so the model assumptions are met and we can rely on the decision we made in the hypothesis test above. ## 6.13 Class 16, ANOVA, Pairwise comparisons (separate worksheet) Note Find this assignment on the course website. ## 6.14 Class 17, ANOVA and Assessing Assumptions Rubric 1. Using a numerical response variable and a categorical variable with three to five levels (or a categorical variable you can reduce to three to five levels), specify an ANOVA hypothesis associated with your research questions. • (1 p) Specify the ANOVA hypotheses in words and notation, • (1 p) plot the data in a way that is consistent with hypothesis test (comparing means, assess equal variance assumption), • (1 p) use aov() to calculate the hypothesis test statistic and p-value, • (1 p) state the significance level, test statistic, and p-value, • (1 p) state the conclusion in the context of the problem, • (2 p) assess the normality assumption of the residuals using appropriate methods (QQ-plot and Anderson-Darling test), and • (1 p) assess the assumption of equal variance between your groups using an appropriate test (also mention standard deviations of each group). • (2 p) If you rejected the ANOVA null hypothesis, perform follow-up pairwise comparisons using Tukey’s HSD to indicate which groups have statistically different means and summarize the results. ### 6.14.1 Hypothesis and plot Let $$\mu_j$$ = pop mean number of the total cigarettes smoked for the five ethnicities identified in the dataset: Caucasian, African American, Native American, Asian, and Hispanic, numbered $$(j = 1, 2, 3, 4, 5)$$, respectively. We wish to test $$H_0: \mu_1 = \cdots = \mu_5$$ against $$H_A: \textrm{not } H_0$$ (at least one pair of means differ). A table of the ethnicities to compare is below. summary(nesarc_sub$Ethnicity)
 Cauc  AfAm  NaAm  Asia  Hisp
24507  8245   701  1332  8308 

Plot the data in a way that compares the means. Error bars are 95% confidence intervals of the mean.

# Plot the data using ggplot
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Ethnicity, y = TotalCigsSmoked))
p <- p + theme_bw()
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(nesarc_sub$TotalCigsSmoked), colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5) # boxplot, size=.75 to stand out behind CI p <- p + geom_violin(width = 0.5, alpha = 0.25) p <- p + geom_boxplot(width = 0.25, alpha = 0.25) # points for observed data p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2) # diamond at mean for each group p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 4, colour = "red", alpha = 0.8) # confidence limits based on normal distribution p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, colour = "red", alpha = 0.8) p <- p + labs(title = "Total cigarettes smoked by Ethnicity") p <- p + ylab("Total Cigs Smoked") print(p) Warning: Removed 297 rows containing non-finite values (stat_ydensity). Warning: Removed 297 rows containing non-finite values (stat_boxplot). Warning: Removed 297 rows containing non-finite values (stat_summary). Removed 297 rows containing non-finite values (stat_summary). Warning: Removed 1 rows containing missing values (geom_hline). Warning: Removed 297 rows containing missing values (geom_point). ### 6.14.2 (NOT USED) Transform the response variable to satisfy assumptions ### 6.14.3 ANOVA Hypothesis test Hypothesis test 1. Set up the null and alternative hypotheses in words and notation. • In words: “The population mean total cigarettes smoked is different between ethnicities.” • In notation: $$H_0: \mu_1 = \cdots = \mu_5$$ versus $$H_A: \textrm{not } H_0$$ (at least one pair of means differ). 2. Let the significance level of the test be $$\alpha = 0.05$$. 3. Compute the test statistic. aov_summary <- aov( TotalCigsSmoked ~ Ethnicity , data = nesarc_sub ) summary(aov_summary)  Df Sum Sq Mean Sq F value Pr(>F) Ethnicity 4 2.106e+08 52648664 496.4 <2e-16 *** Residuals 42791 4.538e+09 106050 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 297 observations deleted due to missingness The $$F$$-statistic for the ANOVA is $$F = 496$$. 1. Compute the $$p$$-value from the test statistic. The p-value for testing the null hypothesis is $$p = 0$$. 1. State the conclusion in terms of the problem. Because $$p = 0 < 0.05$$, we reject $$H_0$$ in favor of $$H_1$$ concluding that the mean total cigarettes smoked is different between ethnicities. ### 6.14.4 Check assumptions 1. Check assumptions of the test. 1. Residuals are normal 2. Populations have equal variances. • Check whether residuals are normal. • Plot the residuals and assess whether they appear normal. # Plot the data using ggplot df_res <- data.frame(res = aov_summary$residuals)
library(ggplot2)
p <- ggplot(df_res, aes(x = res))
p <- p + theme_bw()
p <- p + geom_histogram(aes(y = ..density..), binwidth = 50)
p <- p + geom_density(colour = "blue", adjust = 5)
#p <- p + geom_rug()
p <- p + stat_function(fun = dnorm, colour = "red", args = list(mean = mean(df_res$res), sd = sd(df_res$res)))
p <- p + labs(title = "ANOVA Residuals"
, caption = "Blue = Kernal density curve\nRed = Normal distribution")
print(p)

The plot of the residuals is very right skewed (not normal).

# QQ plot
par(mfrow=c(1,1))
#library(car)
car::qqPlot(
aov_summary$residuals , las = 1 , id = list(n = 4, cex = 1) , lwd = 1 , main="QQ Plot" ) 38173 279 1034 2883 37909 276 1026 2863  The QQ-plot of the residuals versus the normal quantiles is very right skewed (U-shaped, not normal). • A formal test of normality on the residuals tests the hypothesis $$H_0:$$ The distribution is Normal vs $$H_1:$$ The distribution is not Normal. We can test the distribution of the residuals. #shapiro.test(aov_summary$residuals)
#library(nortest)
nortest::ad.test(aov_summary$residuals)  Anderson-Darling normality test data: aov_summary$residuals
A = 3399.5, p-value < 2.2e-16
nortest::cvm.test(aov_summary$residuals) Warning in nortest::cvm.test(aov_summary$residuals): p-value is smaller than
7.37e-10, cannot be computed more accurately

Cramer-von Mises normality test

data:  aov_summary$residuals W = 637.55, p-value = 7.37e-10 The formal normality tests of the residuals reject $$H_0$$ in favor of $$H_A$$, concluding that the data are not normal. • Check whether populations have equal variances. • Look at the numerical summaries below. # calculate summaries dat_EthCig_summary <- nesarc_sub %>% group_by(Ethnicity) %>% summarize( m = mean(TotalCigsSmoked, na.rm = TRUE) , s = sd(TotalCigsSmoked, na.rm = TRUE) , n = n() , .groups = "drop_last" ) %>% ungroup() dat_EthCig_summary # A tibble: 5 × 4 Ethnicity m s n <fct> <dbl> <dbl> <int> 1 Cauc 244. 376. 24507 2 AfAm 127. 250. 8245 3 NaAm 293. 420. 701 4 Asia 76.9 194. 1332 5 Hisp 89.7 226. 8308 The standard deviations appear different between groups, in particular, the Asian group has less than half the standard deviation of the Native American group. • Formal tests for equal variances. We can test whether the variances are equal between our three groups. This is similar to the ANOVA hypothesis, but instead of testing means we’re tesing variances. $$H_0: \sigma^2_1 = \cdots = \sigma^2_5$$ versus $$H_A: \textrm{not } H_0$$ (at least one pair of variances differ). ## Test equal variance # assumes populations are normal bartlett.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)  Bartlett test of homogeneity of variances data: TotalCigsSmoked by Ethnicity Bartlett's K-squared = 4464.9, df = 4, p-value < 2.2e-16 # does not assume normality, requires car package #library(car) car::leveneTest(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 4 498.24 < 2.2e-16 *** 42791 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # nonparametric test fligner.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)  Fligner-Killeen test of homogeneity of variances data: TotalCigsSmoked by Ethnicity Fligner-Killeen:med chi-squared = 2160.6, df = 4, p-value < 2.2e-16 Because the data were not normal, we interpret the Levene test. The p-value is less than 0.05, therefore we reject $$H_0$$ of equal variances in favor of $$H_A$$ that the variances are not equal. ### 6.14.5 Post Hoc pairwise comparison tests EMM plot interpretation This EMM plot (Estimated Marginal Means, aka Least-Squares Means) is only available when conditioning on one variable. The blue bars are confidence intervals for the EMMs; don’t ever use confidence intervals for EMMs to perform comparisons – they can be very misleading. The red arrows are for the comparisons among means; the degree to which the “comparison arrows” overlap reflects as much as possible the significance of the comparison of the two estimates. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”). ## Contrasts adjust_method <- c("none", "tukey", "bonferroni")[2] library(emmeans) emm_cont <- emmeans::emmeans( aov_summary , specs = "Ethnicity" ) # means and CIs emm_cont %>% print()  Ethnicity emmean SE df lower.CL upper.CL Cauc 244.1 2.09 42791 240.0 248.1 AfAm 127.2 3.60 42791 120.2 134.3 NaAm 293.3 12.35 42791 269.0 317.5 Asia 76.9 8.94 42791 59.4 94.5 Hisp 89.7 3.58 42791 82.7 96.7 Confidence level used: 0.95  # pairwise differences emm_cont %>% pairs(adjust = adjust_method) %>% print()  contrast estimate SE df t.ratio p.value Cauc - AfAm 116.8 4.16 42791 28.055 <.0001 Cauc - NaAm -49.2 12.53 42791 -3.927 0.0008 Cauc - Asia 167.1 9.18 42791 18.209 <.0001 Cauc - Hisp 154.4 4.15 42791 37.235 <.0001 AfAm - NaAm -166.0 12.87 42791 -12.903 <.0001 AfAm - Asia 50.3 9.64 42791 5.218 <.0001 AfAm - Hisp 37.5 5.08 42791 7.387 <.0001 NaAm - Asia 216.3 15.25 42791 14.188 <.0001 NaAm - Hisp 203.6 12.86 42791 15.827 <.0001 Asia - Hisp -12.8 9.63 42791 -1.325 0.6758 P value adjustment: tukey method for comparing a family of 5 estimates  # plot of means, CIs, and comparison arrows plot( emm_cont , comparisons = TRUE , adjust = adjust_method , horizontal = FALSE , ylab = "Ethnicity" ) All pairs of Ethnicities differ except Asian and Hispanic. • Cauc - AfAm: sig diff • Cauc - NaAm: sig diff • Cauc - Asia: sig diff • Cauc - Hisp: sig diff • AfAm - NaAm: sig diff • AfAm - Asia: sig diff • AfAm - Hisp: sig diff • NaAm - Asia: sig diff • NaAm - Hisp: sig diff • Asia - Hisp: NOT sig diff dat_EthCig_summary %>% arrange(m) # A tibble: 5 × 4 Ethnicity m s n <fct> <dbl> <dbl> <int> 1 Asia 76.9 194. 1332 2 Hisp 89.7 226. 8308 3 AfAm 127. 250. 8245 4 Cauc 244. 376. 24507 5 NaAm 293. 420. 701 Summarize results by ordering the means and grouping pairs that do not differ.  Ethnicity m s n <fct> <dbl> <dbl> <int> | 1 Asia 76.9 194. 1332 | 2 Hisp 89.7 226. 8308 | 3 AfAm 127. 250. 8245 | 4 Cauc 244. 376. 24507 | 5 NaAm 293. 420. 701 ## 6.15 Class 18, Nonparametric methods (separate worksheet) Note Find this assignment on the course website. ## 6.16 Class 19, Binomial and Multinomial tests (separate worksheet) Note Find this assignment on the course website. ## 6.17 Class 20-1, Two-way categorical tables (separate worksheet) Note Find this assignment on the course website. ## 6.18 Class 20-2, Simple linear regression (separate worksheet) Note Find this assignment on the course website. ## 6.19 Class 21, Two-way categorical and simple linear regression ### 6.19.1 Two-way categorical analysis 1. Two-way categorical analysis. Using two categorical variables with two to five levels each, specify a hypothesis test for homogeneity of proportions associated with your research questions. #### 6.19.1.1 (1 p) Specify the hypotheses in words and notation. • In words: “There is an association between NicotineDependence and Ethnicity.” • In notation: $$H_0: p(i \textrm{ and } j) = p(i)p(j)$$ for all row categories $$i$$ and column categories $$j$$, versus $$H_A: p(i \textrm{ and } j) \ne p(i)p(j)$$ for at least one row category $$i$$ and column category $$j$$. #### 6.19.1.2 (1 p) State the conclusion of the test in the context of the problem. # Tabulate by two categorical variables: tab_E_TD <- xtabs( ~ Ethnicity + NicotineDependence , data = nesarc_sub ) tab_E_TD  NicotineDependence Ethnicity Nicotine Dependence No Nicotine Dependence Cauc 3337 21170 AfAm 836 7409 NaAm 158 543 Asia 95 1237 Hisp 536 7772 # column proportions prop.table( tab_E_TD , margin = 1 )  NicotineDependence Ethnicity Nicotine Dependence No Nicotine Dependence Cauc 0.13616518 0.86383482 AfAm 0.10139478 0.89860522 NaAm 0.22539230 0.77460770 Asia 0.07132132 0.92867868 Hisp 0.06451613 0.93548387 # Chi-sq test chisq_E_TD <- chisq.test( tab_E_TD , correct = FALSE ) chisq_E_TD  Pearson's Chi-squared test data: tab_E_TD X-squared = 439.32, df = 4, p-value < 2.2e-16 The test statistic is $$X^2 = 439$$. Because the p-value $$= 8.84\times 10^{-94} < 0.05$$ we reject the null hypothesis concluding that there is an association between NicotineDependence and Ethnicity. # table of expected frequencies: chisq_E_TD$expected
         NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
Cauc          2821.89066             21685.1093
AfAm           949.38134              7295.6187
NaAm            80.71756               620.2824
Asia           153.37489              1178.6251
Hisp           956.63556              7351.3644
# smallest expected frequency:
min(chisq_E_TD$expected) [1] 80.71756 The model assumptions are met since the expected count for each cell is at least 5. ##### 6.19.1.2.1 Nicer version of contingency (chi-square) table for publication ## gtsummary tables tab_cross <- nesarc_sub %>% gtsummary::tbl_cross( row = "Ethnicity" , col = "NicotineDependence" , label = NULL , statistic = NULL , digits = NULL , percent = c("none", "column", "row", "cell")[3] , margin = c("column", "row") , missing = c("ifany", "always", "no")[1] , missing_text = "Unknown" , margin_text = "Total" ) %>% gtsummary::add_p() %>% # add p-values to the output comparing values across groups (tbl_cross options: test = c("chisq.test", "fisher.test")[2]) gtsummary::bold_labels() %>% # bold variable name labels gtsummary::italicize_levels() %>% # italicize levels gtsummary::modify_caption("Nicotine dependence by Ethnicity") tab_cross Nicotine dependence by Ethnicity NicotineDependence Total p-value1 Nicotine Dependence No Nicotine Dependence Ethnicity <0.001 Cauc 3,337 (14%) 21,170 (86%) 24,507 (100%) AfAm 836 (10%) 7,409 (90%) 8,245 (100%) NaAm 158 (23%) 543 (77%) 701 (100%) Asia 95 (7.1%) 1,237 (93%) 1,332 (100%) Hisp 536 (6.5%) 7,772 (94%) 8,308 (100%) Total 4,962 (12%) 38,131 (88%) 43,093 (100%) 1 Pearson's Chi-squared test #### 6.19.1.3 (1 p) Plot a mosaic plot of the data and Pearson residuals. # The Pearson residuals chisq_E_TD$residuals
         NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
Cauc            9.696820              -3.497990
AfAm           -3.679775               1.327427
NaAm            8.601947              -3.103031
Asia           -4.713559               1.700350
Hisp          -13.599806               4.905937
# The sum of the squared residuals is the chi-squared statistic:
#chisq_E_TD$residuals^2 #sum(chisq_E_TD$residuals^2)
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
mosaic(
tab_E_TD
, legend = TRUE
, abbreviate_labs = 4
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0, 0, 0, 2)      # offset the y-axis var name
#, margins = c(3, 3, 3, 14)           # order is: top, right, bottom, left
, main = "Fraction of smokers with and without\n nicotine dependence by Ethnicity"
, labeling_args = list(set_varnames =
, Ethnicity           = "Ethnicity")
)
)

#### 6.19.1.4 (1 p) Interpret the mosaic plot with reference to the Pearson residuals.

The primary cause of rejecting $$H_0$$ is that Hispanics and Asians have less nicotine dependence than expected while Caucasians and (especially) Native Americans have higher nicotine dependence than expected.

### 6.19.2 Simple linear regression

1. Simple linear regression.

Select two numerical variables.

(See below.)

#### 6.19.2.2 (0 p) Fit the simple linear regression model.

# fit model
#lm_fit <- lm(TotalCigsSmoked_sqrt ~ DaysSmoke, data = nesarc_sub)
lm_fit <- lm(TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)
coef(lm_fit)
   (Intercept) DaysSmoke_log2
0.4585303      1.6635414 

Note that prediction intervals should be bounded below at 0.

library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = DaysSmoke_log2, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm_fit, data.frame(DaysSmoke_log2)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm_fit, data.frame(DaysSmoke_log2)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(0.05), alpha = 0.01)
p <- p + labs(title = "Regression with confidence and prediction bands"
, x = "log2 of Days smoke"
, y = "log2 of Total cigarettes smoked"
, caption = "Blue line is fitted regression line.\nGray band is the confidence band.\nGreen band is the prediction band.")
print(p)
geom_smooth() using formula 'y ~ x'
Warning: Removed 25377 rows containing non-finite values (stat_smooth).
Warning: Removed 25377 rows containing missing values (geom_point).

#### 6.19.2.3 (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and $$x$$-value).

e_plot_lm_diagostics(
lm_fit
, sw_plot_set = "simple"
)

• Residuals vs x: each group (based on x-variable) of values is roughly symmetric and the y=0 line passes through the center of most groups. The x=0 group has many observations at the lowest value which is why the x=0 (or regression) is below the distribution that is spread above it.

#### 6.19.2.4 (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).

• QQ-Plot: On original scale, the pattern is non-normal with right-skewness in y-variable. After log2 transformation of both variables, we still do not have normality (but it is much better).

#### 6.19.2.5 (1 p) Assess the relative influence of points.

• Cook’s distance: several points with slightly higher influence, but none with extremely large influence.
• Cook’s distance vs Leverage: Those points with high influence also have high leverage, thus, those points that are extremes in the x-direction (least frequent or most frequent smokers) have the highest influence.

#### 6.19.2.6 (1 p) Test whether the slope is different from zero, $$H_A: \beta_1 \ne 0$$.

Though the model assumptions are not perfectly met, we will continue to interpret the model that we have.

summary(lm_fit)

Call:
lm(formula = TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)

Residuals:
Min      1Q  Median      3Q     Max
-3.7145 -0.4585  0.1924  0.6075  5.5859

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     0.45853    0.03447    13.3   <2e-16 ***
DaysSmoke_log2  1.66354    0.00746   223.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.175 on 17714 degrees of freedom
(25377 observations deleted due to missingness)
Multiple R-squared:  0.7373,    Adjusted R-squared:  0.7373
F-statistic: 4.973e+04 on 1 and 17714 DF,  p-value: < 2.2e-16

Because the p-value = $$0 < 0.05$$, we reject $$H_0$$ in favor of $$H_A$$ concluding that the slope is different from 0.

Interpretation (extra): For every unit increase in log2 of DaysSmoke, we expect an increase of 1.66 in the log2 of TotalCigsSmoked. On the original scale, this increase is scary.

#### 6.19.2.7 (1 p) Interpret the $$R^2$$ value.

The regression model explains $$R^2 = 0.737$$ proportion of the variation in the response above using the mean value alone.

##### 6.19.2.7.1 Nicer version of linear model parameter estimate table for publication
tab_fit_lm <-
lm_fit %>%
gtsummary::tbl_regression(
#  label           = NULL
#, exponentiate    = FALSE
#, include         = everything()
#, show_single_row = NULL
conf.level      = 0.95
, intercept       = TRUE
#, estimate_fun    = NULL
#, pvalue_fun      = NULL
#, tidy_fun        = NULL
, conf.int        = TRUE
) %>%
#gtsummary::add_q()                  %>%  # add a column of q values to control for multiple comparisons
gtsummary::bold_p(t = 0.05, q = FALSE) %>% # Bold significant p-values or q-values
gtsummary::add_glance_source_note() %>%  # adds statistics from broom::glance() as source note
#gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)])  %>%  # adds column of the variance inflation factors (VIF)
gtsummary::bold_labels()    %>%  # bold variable name labels
gtsummary::italicize_levels() %>% # italicize levels
gtsummary::modify_caption("Table title")

tab_fit_lm
Table title
Characteristic Beta 95% CI1 p-value
(Intercept) 0.46 0.39, 0.53 <0.001
DaysSmoke_log2 1.7 1.6, 1.7 <0.001
R² = 0.737; Adjusted R² = 0.737; Sigma = 1.18; Statistic = 49,726; p-value = <0.001; df = 1; Log-likelihood = -27,995; AIC = 55,995; BIC = 56,018; Deviance = 24,458; Residual df = 17,714; No. Obs. = 17,716
1 CI = Confidence Interval

## 6.20 Class 22, Logistic regression (separate worksheet)

Note

Find this assignment on the course website.

## 6.21 Class 23, Logistic regression

1. Logistic regression.

See below.

### 6.21.3 (1 p) Summarize the $$\hat{p}$$ values for each value of the $$x$$-variable. Also, calculate the empirical logits.

Summarize observed probability of success for each unique value of x.

dat_Nic_Cigs_sum <-
nesarc_sub %>%
group_by(
TotalCigsSmoked_smokers_log2
) %>%
summarize(
Success = sum(NicotineDependence01)
, Total   = n()
# estimated proportion of successes for each unique value of x
, p_hat   = Success / Total
, .groups = "drop_last"
) %>%
ungroup() %>%
na.omit()

# emperical logits
dat_Nic_Cigs_sum <-
dat_Nic_Cigs_sum %>%
mutate(
emp_logit = log((p_hat + 0.5/Total) / (1 - p_hat + 0.5/Total))
)

dat_Nic_Cigs_sum
# A tibble: 103 × 5
TotalCigsSmoked_smokers_log2 Success Total   p_hat emp_logit
<dbl>   <dbl> <int>   <dbl>     <dbl>
1                         0         17   356 0.0478      -2.97
2                         1          4   170 0.0235      -3.61
3                         1.32       1   110 0.00909     -4.29
4                         1.58       3    70 0.0429      -2.96
5                         2          1    23 0.0435      -2.71
6                         2.32      12   144 0.0833      -2.36
7                         2.58       8   172 0.0465      -2.96
8                         2.81       0     3 0           -1.95
9                         2.91       6    64 0.0938      -2.20
10                         3          0     2 0           -1.61
# … with 93 more rows

### 6.21.4 (1 p) Plot the $$\hat{p}$$ values vs the $$x$$-variable and plot the empirical logits vs the $$x$$-variable.

#### 6.21.4.1 Probability/proportion scale

Plots on the probability scale should follow a sigmoidal curve (a little difficult to see here). Note that the overlayed reference curve (red) is a weighted smoothed curve (loess), not the model fit.

library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))
p1 <- p1 + theme_bw()
# data
p1 <- p1 + geom_jitter(width = 0.05, height = 0.025, size = 0.5, alpha = 1/10, colour = "green")
# summaries
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))
p1 <- p1 + geom_smooth(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, weight = Total), se = FALSE, colour = "red")  # just for reference
# axes
p1 <- p1 + scale_y_continuous(breaks = seq(0, 1, by = 0.2))
p1 <- p1 + expand_limits(y = c(0, 1))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title   = "Proportion of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Observed data, smokers only"
, x       = "log2 Total cigarettes smoked"
, y       = "Nicotine dependence (0/1)\nObserved probability of nicotine dependence"
, caption = paste(
"Green = Indicator points of nicotine dependence (1) or not (0)."
, "Black = Observed proportions of nicotine dependence"
, "Red = Smoothed curve to proportions"
, sep = "\n" # separate by new lines
)
)
print(p1)
geom_smooth() using method = 'loess' and formula 'y ~ x'
Warning: Removed 25377 rows containing missing values (geom_point).

#### 6.21.4.2 Logit scale

On the logit scale, if points follow a straight line, then we can fit a simple logistic regression model. Note that the overlayed reference straight line (blue) is from weighted least squares (not the official fitted line), and the curve (red) is a weighted smoothed curve (loess). If these two lines are similar, that suggests a straight line on the logit scale is a good fit. Both lines are weighted by the total number of observations that each point represents, so that points representing few observations don’t contribute as much as points representing many observations, thus our decision should not be heavily influenced by random deviations where there is little data.

# plot on logit scale
library(ggplot2)
p1 <- ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))
p1 <- p1 + theme_bw()

# summaries
p1 <- p1 + geom_point(aes(size = Total))
p1 <- p1 + stat_smooth(aes(weight = Total), method = "lm", se = FALSE, linetype = 3)  # just for reference
p1 <- p1 + geom_smooth(aes(weight = Total), se = FALSE, colour = "red")  # just for reference

# axes
#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title   = "Logit of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Observed data, smokers only"
, x       = "log2 Total cigarettes smoked"
, y       = "Empirical logit of the probability of nicotine dependence"
, caption = paste(
"Black = Observed logit proportions of nicotine dependence"
, "Blue = Naive LM fit of logit proportions"
, "Red = Loess smooth curve of empirical logits"
, sep = "\n" # separate by new lines
)
)
print(p1)
geom_smooth() using formula 'y ~ x'
geom_smooth() using method = 'loess' and formula 'y ~ x'

### 6.21.5 (1 p) Describe the logit-vs-$$x$$ plot. Is it linear? If not, consider a transformation of $$x$$ to improve linearity; describe the transformation you chose if you needed one.

• Log2 of Total cigarettes smoked improved the linear relationship. A straight line describes the data well from 0 to 10, but does not fit the data when x is greater than 10.

### 6.21.6 (1 p) Fit the glm() model and assess the deviance lack-of-fit test.

The simple logistic regression model expresses the population proportion $$p$$ of individuals with a given attribute (called the probability of success) as a function of a single predictor variable $$X$$. The model assumes that $$p$$ is related to $$X$$ through $\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X$ or, equivalently, as $p = \frac{ \exp( \beta_0 + \beta_1 X ) }{ 1 + \exp( \beta_0 + \beta_1 X ) }.$ The logistic regression model is a binary response model, where the response for each case falls into one of two exclusive and exhaustive categories, success (cases with the attribute of interest) and failure (cases without the attribute of interest).

Fit the model.

# For our summarized data (with frequencies and totals for each age)
# The left-hand side of our formula binds two columns together with cbind():
#   the columns are the number of "successes" and "failures".
# For logistic regression with logit link we specify family = binomial,
#   where logit is the default link function for the binomial family.

glm_n_c <-
glm(
cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2
, family = binomial
, data   = dat_Nic_Cigs_sum
)

#### 6.21.6.1 Deviance statistic for lack-of-fit

Unfortunately, there aren’t many model diagnostics for logistic regression. Visual inspection of the empirical logits is good when you can summarize the data as we’ve done.

One simple test for lack-of-fit uses the deviance statistic. Under the null hypothesis (that you’ll state below), the residual deviance follows a $$\chi^2$$ distribution with the associated degrees-of-freedom.

Below, we calculate the deviance p-value and perform the test for lack-of-fit.

# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
glm_n_c$deviance [1] 310.0391 glm_n_c$df.residual
[1] 101
dev_p_val <- 1 - pchisq(glm_n_c$deviance, glm_n_c$df.residual)
dev_p_val
[1] 0

State the null and alternative hypotheses for lack-of-fit.

• $$H_0:$$ Model fits the data.

• $$H_A:$$ Model does not fit the data.

For your preferred model, the deviance statistic is

• D = 310 with

• 101 df, giving

• p-value = 0.

• Because the p-value is less than 0.05, we reject $$H_0$$ in favor of $$H_A$$ concluding that the model does not fit the data. However, for this assignment, we will continue with interpretation.

### 6.21.7 (1 p) Calculate the confidence bands around the model fit/predictions. Plot on both the logit and $$\hat{p}$$ scales.

The glm() statement creates an object which we can use to create the fitted probabilities and 95% CIs for the population proportions at the ages at first vaginal intercourse. The fitted probabilities and the limits are stored in columns labeled fit_p, fit_p_lower, and fit_p_upper, respectively.

Below I create confidence bands for the model and plot the fit against the data, first on the logit scale to assess model fit and then on the probability scale for interpretation.

# predict() uses all the Load values in dataset, including appended values
fit_logit_pred <-
predict(
glm_n_c
#, newdata = data.frame(TotalCigsSmoked_smokers_log2 = dat_Nic_Cigs_sum$TotalCigsSmoked_smokers_log2) , type = "link" , se.fit = TRUE ) %>% as_tibble() # put the fitted values in the data.frame dat_Nic_Cigs_sum <- dat_Nic_Cigs_sum %>% mutate( # logit scale values fit_logit = fit_logit_pred$fit
, fit_logit_se    = fit_logit_pred\$se.fit
, fit_logit_lower = fit_logit - 1.96 * fit_logit_se
, fit_logit_upper = fit_logit + 1.96 * fit_logit_se
# proportion scale values
, fit_p           = exp(fit_logit) / (1 + exp(fit_logit))
, fit_p_lower     = exp(fit_logit_lower) / (1 + exp(fit_logit_lower))
, fit_p_upper     = exp(fit_logit_upper) / (1 + exp(fit_logit_upper))
)

#### 6.21.7.1 Logit scale

# plot on logit scale
library(ggplot2)
p1 <- ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))
p1 <- p1 + theme_bw()

# MODEL FIT
# predicted curve and point-wise 95% CI
p1 <- p1 + geom_ribbon(aes(x = TotalCigsSmoked_smokers_log2, ymin = fit_logit_lower, ymax = fit_logit_upper), fill = "blue", linetype = 1, alpha = 0.2)
p1 <- p1 + geom_line(aes(x = TotalCigsSmoked_smokers_log2, y = fit_logit), colour = "blue", size = 1)
# fitted values
p1 <- p1 + geom_point(aes(y = fit_logit), colour = "blue", size = 2)

# summaries
p1 <- p1 + geom_point(aes(size = Total))

# axes
#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title   = "Logit of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Logistic model fit"
, x       = "log2 Total cigarettes smoked"
, y       = "Logit scale of the probability of nicotine dependence"
, caption = paste(
"Black = Observed logit proportions of nicotine dependence given log2 Total cigarettes smoked"
, "Blue = Logistic model fitted logit proportions"
, sep = "\n" # separate by new lines
)
)
print(p1)

#### 6.21.7.2 Probability/proportion scale

# plot on probability scale
library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))
p1 <- p1 + theme_bw()
# data
p1 <- p1 + geom_jitter(width = 0.05, height = 0.025, size = 0.5, alpha = 1/10, colour = "green")
# summaries
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))

# MODEL FIT
# predicted curve and point-wise 95% CI
p1 <- p1 + geom_ribbon(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p, ymin = fit_p_lower, ymax = fit_p_upper), fill = "blue", linetype = 1, alpha = 0.2)
p1 <- p1 + geom_line(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p), colour = "blue", size = 1)
# fitted values
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(y = fit_p), colour = "blue", size = 2)

# axes
p1 <- p1 + scale_y_continuous(breaks = seq(0, 1, by = 0.2))
p1 <- p1 + expand_limits(y = c(0, 1))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title   = "Proportion of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Logistic model fit"
, x       = "log2 Total cigarettes smoked"
, y       = "At least one pregnancy (0/1)\nObserved probability of nicotine dependence"
, caption = paste(
"Green = Indicator points of nicotine dependence (1) or not (0)."
, "Black = Observed proportions of nicotine dependence given log2 Total cigarettes smoked"
, "Blue = Logistic model fitted proportions"
, sep = "\n" # separate by new lines
)
)
print(p1)
Warning: Removed 25377 rows containing missing values (geom_point).

### 6.21.8 (1 p) Interpret the sign ($$+$$ or $$-$$) of the slope parameter and test whether the slope is different from zero, $$H_A: \beta_1 \ne 0$$.

The summary table gives MLEs and standard errors for the regression parameters. The z-value column is the parameter estimate divided by its standard error. The p-values are used to test whether the corresponding parameters of the logistic model are zero.

summary(glm_n_c)

Call:
glm(formula = cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2,
family = binomial, data = dat_Nic_Cigs_sum)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-6.9776  -1.1004  -0.6275   0.4655   3.8816

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                  -2.94584    0.08730  -33.74   <2e-16 ***
TotalCigsSmoked_smokers_log2  0.23882    0.01014   23.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1020.01  on 102  degrees of freedom
Residual deviance:  310.04  on 101  degrees of freedom
AIC: 607.54

Number of Fisher Scoring iterations: 4
  # see names(summary(glm_n_c)) to find the object that has the coefficients.
# can also use coef(glm_n_c)

Interpret the sign ($$+$$ or $$-$$) of the slope.

• Because our slope is positive, as (x) log2 Total cigarettes smoked increases, the probability of “success” (y) of nicotine dependence increases.

Hypothesis test

• $$H_0: \beta_1 = 0$$
• $$H_A: \beta_1 \ne 0$$
• The p-value = $$1.41\times 10^{-122} < 0.05$$, therefore we reject $$H_0$$ in favor of $$H_A$$ concluding that the slope is different from 0.
##### 6.21.8.0.1 Nicer version of logistic linear model parameter estimate table for publication
# Logit scale uses exponentiate = FALSE
tab_fit_glm <-
glm_n_c %>%
gtsummary::tbl_regression(
#  label           = NULL
exponentiate    = FALSE              ## do not exponentiate gives logit scale
#, include         = everything()
#, show_single_row = NULL
, conf.level      = 0.95
, intercept       = TRUE
#, estimate_fun    = NULL
#, pvalue_fun      = NULL
#, tidy_fun        = NULL
, conf.int        = TRUE
) %>%
#gtsummary::add_q()                  %>%  # add a column of q values to control for multiple comparisons
gtsummary::bold_p(t = 0.05, q = FALSE) %>% # Bold significant p-values or q-values
gtsummary::add_glance_source_note() %>%  # adds statistics from broom::glance() as source note
#gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)])  %>%  # adds column of the variance inflation factors (VIF)
gtsummary::bold_labels()    %>%  # bold variable name labels
gtsummary::italicize_levels() %>% # italicize levels
gtsummary::modify_caption("Logit scale results")

tab_fit_glm
Logit scale results
Characteristic log(OR)1 95% CI1 p-value
(Intercept) -2.9 -3.1, -2.8 <0.001
TotalCigsSmoked_smokers_log2 0.24 0.22, 0.26 <0.001
Null deviance = 1,020; Null df = 102; Log-likelihood = -302; AIC = 608; BIC = 613; Deviance = 310; Residual df = 101; No. Obs. = 103
1 OR = Odds Ratio, CI = Confidence Interval
# Odds ratio uses exponentiate = TRUE
tab_fit_glm <-
glm_n_c %>%
gtsummary::tbl_regression(
#  label           = NULL
exponentiate    = TRUE               ## exponentiate gives Odds Ratios (ORs)
#, include         = everything()
#, show_single_row = NULL
, conf.level      = 0.95
, intercept       = TRUE
#, estimate_fun    = NULL
#, pvalue_fun      = NULL
#, tidy_fun        = NULL
, conf.int        = TRUE
) %>%
#gtsummary::add_q()                  %>%  # add a column of q values to control for multiple comparisons
gtsummary::bold_p(t = 0.05, q = FALSE) %>% # Bold significant p-values or q-values
gtsummary::add_glance_source_note() %>%  # adds statistics from broom::glance() as source note
#gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)])  %>%  # adds column of the variance inflation factors (VIF)
gtsummary::bold_labels()    %>%  # bold variable name labels
gtsummary::italicize_levels() %>% # italicize levels
gtsummary::modify_caption("Odds ratio results")

tab_fit_glm
Odds ratio results
Characteristic OR1 95% CI1 p-value
(Intercept) 0.05 0.04, 0.06 <0.001
TotalCigsSmoked_smokers_log2 1.27 1.24, 1.30 <0.001
Null deviance = 1,020; Null df = 102; Log-likelihood = -302; AIC = 608; BIC = 613; Deviance = 310; Residual df = 101; No. Obs. = 103
1 OR = Odds Ratio, CI = Confidence Interval

# 7 Poster

## 7.1 Classes 24, 25, and 26: Poster Preparation

### 7.1.1 Class 24, Poster Preparation, research questions, data sources, analyses

See items under Class 26.

From the list in Class 26, complete Items 2, 3, 4, and 5.

### 7.1.2 Class 25, Poster Preparation, literature review, references, discussion, future work

See items under Class 26.

From the list in Class 26, complete Items 1, 6, 7, and 8.

#### 7.1.2.1 Class Literature review (detailed example)

A slightly more detailed literature review would typically be completed for a research project. Here I show what that might look like. However, in our class I’m not expecting so much detail. See the bullet points that I have used in the “Class 26 poster preparation” section.

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.

Research question:

1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
• Google scholar search: “nicotine dependence smoking frequency”
• Citation: pdf file available for Dierker et al. (2007)
• Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
• Others: Kandel and Chen (2000) and Caraballo, Novak, and Asman (2009)

You don’t need to include images in your literature review. I’m providing these figures and tables to illustrate what they look like.

1. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
• Google scholar search: “nicotine dependence depression”
• Citation: pdf file available for Naomi Breslau (1995)
• Interesting points: Table 2, Smoking and Nicotine Dependence both associated with Education. Table 3, Major depression associated with being nicotine dependent and Sex.
• Others: N. Breslau et al. (1998)

1. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
• In Question 2 we see differences by Education and Sex.

I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.

All required variables have been found and added to my personal codebook (by expanding Class 03).

Citations are at the bottom of this document in the “References” section.

### 7.1.3 Class 26, Poster Preparation, complete content

Rubric

Organize the content of your poster.

Complete the content for each of these sections:

Title: Smoking behavior is (barely) associated with major depression in young adults

1. (Class 25) (3 p) Introduction

• (Lit Review) 2-4 bullets describing the study, previous research.
• Smoking behavior is associated with major depression.

• Longitudinal investigations have shown that depression increases risk of later smoking .

• Many subsequent studies find a role of major depression in increasing the probability and amount of smoking .

• Smoking behavior and Nicotine dependence.

• While smoking is a necessary requirement for nicotine dependence, smoking behavior is a poor predictor of developing nicotine dependence .

• Heavy smokers may not have nicotine dependence, conversely light smokers may have nicotine dependence .

1. (Class 24) (2 p) Research Questions

• (Class 02, Personal Codebook) 2 bullets, one for each research question, stated as the alternative hypothesis.

The goals of the analysis include answering these two questions:

• Is there an association between major depression and nicotine dependence?

• Does the prevalence of nicotine dependence differ by ethnicity?

1. (Class 24) (2 p) Methods
• (Class 02, Personal Codebook) Data source(s).

A sample from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) (US adults). We focus on young adult (18–25) smokers (43093 of 43093 respondants).

In the Introduction to the National Epidemiologic Survey on Alcohol and Related Conditions they describe the sampling design. Adult respondants were sampled from a subset of households included in the US Census 2000 Supplementary Survey. To improve information about smaller groups, Blacks and Hispanics were oversampled by about 50%, as were young adults ages 18-24 by 125%. Using the US Census as a sampling frame helps to provide a representative sampling of the country. Oversampling smaller subsets of the population means that comparisons involving those subsets will be more reliable.

• (Class 02, Personal Codebook) Variables used.

• Lifetime major depression (Depression, 0=No, 1=Yes) .

• Total cigarettes per month (TotalCigsSmoked = days smoked times daily cigs smoked, range: 0–NA), a combination of these questions: “About how often did you usually smoke in the past year?” and “On the days that you smoked in the last year, about how many cigarettes did you usually smoke?”

• Nicotine dependence (NicotineDependence, 0=No, 1=Yes) in the last 12 months.

• (Various) Statistical methods used to answer the research questions.

• A two-sample $$t$$-test comparing the mean total cigarettes smoked by depression status.

• A $$\chi^2$$ analysis of a two-way contingency table of nicotine dependence by ethnicity.

1. (Class 24) (3 p) Results for your first research question.

• Plot and describe the data, as well as the statistical model. This can often be done in a single plot. Examples:

• ANOVA: A mean with CI bars is the statistical model overlayed on the data points.

• Contingency table: A mosaic plot with colored boxes relative to contribution to Pearson $$\chi^2$$ shows the data with evidence towards the alternative hypothesis.

• Simple linear regression: A regression line is the statistical model overlayed on the data points.

• Logistic regression: The logistic curve is the statistical model overlayed on the top/bottom histograms of the data.

• State the conclusion of the hypothesis test and interpret it in the context of the research question.

• “Is the population mean square-root total cigarettes smoked different for those with depression or not?”
• $$H_0: \mu_{D} = \mu_{ND}$$ versus $$H_A: \mu_{D} \ne \mu_{ND}$$
Warning: Removed 297 rows containing non-finite values (stat_bin).

Welch Two Sample t-test

data: TotalCigsSmoked_sqrt by Depression t = -19.317, df = 10794, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is not equal to 0 95 percent confidence interval: -3.176909 -2.591550 sample estimates: mean in group No Depression mean in group Yes Depression 7.360964 10.245194

Welch Two Sample t-test

data: TotalCigsSmoked by Depression t = -17.226, df = 10439, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is not equal to 0 95 percent confidence interval: -88.58909 -70.48702 sample estimates: mean in group No Depression mean in group Yes Depression 173.0296 252.5677

• Model assumptions are met, the the sampling distribution of the difference in means is normal.

• Because $$p=9.24\times 10^{-82} > 0.05$$ (with $$t_{s} = -19.32$$), we have insufficient evidence to reject $$H_0$$ at an $$\alpha=0.05$$ significance level, concluding that the total cigarettes smoked does not differ by depression status.

• The difference is very small, on the square-root scale it is 2.88 and on the natural scale it is 54.1837915, 104.9639984, a difference of 51 cigarettes each year.

1. (Class 24) (3 p) Results for your second research question.

• Plot and describe the data, as well as the statistical model. This can often be done in a single plot.

• State the conclusion of the hypothesis test and interpret it in the context of the research question.

• Is there an association between NicotineDependence and Ethnicity?

The model assumptions are met since the expected count for each cell is at least 5.

Because the p-value $$= 8.84\times 10^{-94} < 0.05$$ ($$X^2 = 439$$) we reject the null hypothesis concluding that there is an association between NicotineDependence and Ethnicity.

The primary cause of rejecting $$H_0$$ is that Hispanics have less nicotine dependence than expected by chance.

1. (Class 25) (4 p) Discussion

• Put the results you found for each research question in the context you provided in your introduction.

These results support the previous literature that individuals with major depression are more sensitive to the development of nicotine dependence regardless of how much they smoke. Thus, people with depression are an important population subgroup for targeted smoking intervention programs.

1. (Class 25) (1 p) Further directions or Future work or Next steps or something else that indicates there more to do and you’ve thought about it.

• What do these results lead you to want to investigate?

We can extend focus from young adults to all adults, and (with multiple regression techniques we’ll learn in ADA2) assess the particular depressed subpopulations most susceptible to developing nicotine dependence.

1. (Class 25) (2 p) References

• By citing sources in your introduction, this section will automatically have your bibliography.

References

References are supposed to appear here, but they may appear at the end of the document.

## 7.5 Class 30, Poster Presentations

[End]

Note

Most of you will not need to work with dates.

Finer points

## References

Breslau, Naomi. 1995. “Psychiatric Comorbidity of Smoking and Nicotine Dependence.” Behavior Genetics 25 (2): 95–101.
Breslau, N., E. L. Peterson, L. R. Schultz, H. D. Chilcoat, and P. Andreski. 1998. Archives of General Psychiatry 55 (2): 161–66.
Caraballo, Ralph S., Scott P. Novak, and Katherine Asman. 2009. “Linking Quantity and Frequency Profiles of Cigarette Smoking to the Presence of Nicotine Dependence Symptoms Among Adolescent Smokers: Findings from the 2004 National Youth Tobacco Survey.” Nicotine & Tobacco Research, January, ntn008. https://doi.org/10.1093/ntr/ntn008.
Dierker, Lisa C., Shelli Avenevoli, Abbie Goldberg, and Meyer Glantz. 2004. Prevention Science: The Official Journal of the Society for Prevention Research 5 (3): 169–83.
Dierker, Lisa C., Shelli Avenevoli, Kathleen R. Merikangas, Brian P. Flaherty, and Marilyn Stolar. 2001. “Association Between Psychiatric Disorders and the Progression of Tobacco Use Behaviors.” Journal of the American Academy of Child & Adolescent Psychiatry 40 (10): 1159–67. https://doi.org/10.1097/00004583-200110000-00009.
Dierker, Lisa C., Eric Donny, Stephen Tiffany, Suzanne M. Colby, Nicholas Perrine, and Richard R. Clayton. 2007. “The Association Between Cigarette Smoking and DSM-IV Nicotine Dependence Among First Year College Students.” Drug and Alcohol Dependence 86 (2–3): 106–14. https://doi.org/10.1016/j.drugalcdep.2006.05.025.
Grant, B. F., T. C. Harford, D. A. Dawson, P. S. Chou, and R. P. Pickering. 1995. Drug and Alcohol Dependence 39 (1): 37–44.
Grant, Bridget F., Deborah A. Dawson, Frederick S. Stinson, Patricia S. Chou, Ward Kay, and Roger Pickering. 2003. Drug and Alcohol Dependence 71 (1): 7–16.
Kandel, D. B., and K. Chen. 2000. Nicotine & Tobacco Research: Official Journal of the Society for Research on Nicotine and Tobacco 2 (3): 263–74.
Rohde, Paul, Christopher W. Kahler, Peter M. Lewinsohn, and Richard A. Brown. 2004. “Psychiatric Disorders, Familial Factors, and Cigarette Smoking: II. Associations with Progression to Daily Smoking.” Nicotine & Tobacco Research 6 (1): 119–32. https://doi.org/10.1080/14622200310001656948.
Rohde, Paul, Peter M. Lewinsohn, Richard A. Brown, Jeffrey M. Gau, and Christopher W. Kahler. 2003. “Psychiatric Disorders, Familial Factors and Cigarette Smoking: I. Associations with Smoking Initiation.” Nicotine & Tobacco Research 5 (1): 85–98. https://doi.org/10.1080/1462220031000070507.
Stanton, Warren R., John B. Lowe, and Phil A. Silva. 1995. “Antecedents of Vulnerability and Resilience to Smoking Among Adolescents.” Journal of Adolescent Health 16 (1): 71–77. https://doi.org/10.1016/1054-139X(94)00051-F.

## Citation

BibTeX citation:
@online{erhardt2022,
author = {Erik Erhardt},
editor = {},
title = {ADA1: {Erik’s} Cumulative Project File},
date = {2022-08-24},