# ADA1: Cumulative project file

NESARC, Smoking and depression in adults

Advanced Data Analysis 1, Stat 427/527, Fall 2022, Prof. Erik Erhardt, UNM

Author

Erik Erhardt

Published

August 13, 2022

# 1 Document overview

This document is organized by Class number. The worksheet assignments are indicated by the Tuesday and Thursday Class numbers.

• organize the document clearly (use this document as an example)
• label minor sections under each day (use this document as an example)
• For each thing you do, always have these three parts:
1. Say what you’re going to do and why.
2. Do it with code, and document your code.
3. Interpret the results.

## 1.1 Global code options

# I set some GLOBAL R chunk options here.
#   (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
#   and notes about the cache option.

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100) knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)

# Note: The "cache=TRUE" option will save the computations of code chunks
#   so R it doesn't recompute everything every time you recompile.
#   This can save _tons of time_ if you're working on a small section of code
#   at the bottom of the document.
#   Code chunks will be recompiled only if they are edited.
#   The autodep=TRUE will also update dependent code chunks.
#   A folder is created with the cache in it -- if you delete the folder, then
#   all the code chunks will recompute again.
#   ** If things are not working as expected, or I want to freshly compute everything,
#      I delete the *_cache folder.
knitr::opts_chunk$set(cache = FALSE) #, autodep=TRUE) #$

## 1.2 Document

### 1.2.1 Naming

Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

• ADA1_ALL_05.Rmd
• ADA1_ALL_06.Rmd
• ADA1_ALL_07.Rmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

• ADA1_ALL_20200903.Rmd
• ADA1_ALL_20200905.Rmd
• ADA1_ALL_20200910.Rmd

### 1.2.2 Structure

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

### 1.2.3 Classes not appearing in this document

These class assignments are in their own documents: 09, 11, 13, 14, …

# 2 Research Questions

## 2.1 Class 04, Personal Codebook

Rubric

1. (1 p) Is there a topic of interest?

2. (2 p) Are the variables relevant to a set of research questions?

3. (4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

• 1 categorical variable with only 2 levels
• 1 categorical variable with at least 3 levels
• 2 numerical variables with many possible unique values
• More variables are welcome and you’re likely to add to this later in the semester
4. (3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

5. Compile this Rmd file to an html, print/save to pdf, and upload to UNM Learn.

### 2.1.1 Topic and research questions

Topic:

Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

Research questions:

1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?

2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?

3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

# 3 Codebook

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

• Official site: https://www.niaaa.nih.gov/research/nesarc-iii
• Introduction: https://pubs.niaaa.nih.gov/publications/arh29-2/74-78.htm
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description

ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number

Sex
SEX
SEX
Nominal
18518 1. Male
24575 2. Female

Age
AGE
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older

SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

NicotineDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence

SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino

Depression
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes

IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars

Height_ft
S1Q24FT
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA

Height_in
S1Q24IN
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA

Weight_lb
S1Q24LB
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA

Additional variables were created from the original variables:

CREATED VARIABLES

DaysSmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.

TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.

TotalCigsSmoked_smoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Smokers, only! (replaced 0s with NAs)
Continuous
1-large.

CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters

SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never   = 7. Never

Height_inches
Total height in inches
Height_ft * 12 + Height_in

# 4 Data Management

## 4.1 Class 05, Data subset and numerical summaries

Rubric

1. (4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

• Scroll down to sections labeled “(Class 05)”.
2. (1 p) Output confirms the subset is correct (e.g., using dim() and str()).

3. (3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

• Scroll down to sections labeled “(Class 05)”.
4. (2 p) Provide numerical summaries for all variables (e.g., using summary()).

• Scroll down to sections labeled “(Class 05)”.

### 4.1.1 Data subset (Class 05)

First, the data is placed on the search path.

# data analysis packages
library(erikmisc)   # Helpful functions
library(tidyverse)  # Data manipulation and visualization suite
library(lubridate)  # Dates

## 2. Use the load() statement for the dataset you want to use.
# read data example

dim(NESARC)
[1] 43093  3008
# variables to include in our data subset
nesarc_sub <-
NESARC %>%
dplyr::select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)

# Check size and structure of data subset.
# Note that categorical variables are already Factor variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
[1] 43093    13
# structure of subset
str(nesarc_sub)
'data.frame':   43093 obs. of  13 variables:
$IDNUM : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...$ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$AGE : num 23 28 81 18 36 34 19 84 29 18 ...$ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$TAB12MDX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$S3AQ3C1 : num NA NA NA NA NA NA NA NA NA NA ...$ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...$ S1Q10A      : num  17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$S1Q24FT : num 5 5 5 5 6 5 5 5 5 5 ...$ S1Q24IN     : num  7 1 4 7 1 5 9 4 4 10 ...
$S1Q24LB : num 195 127 185 180 220 140 160 120 140 150 ... ### 4.1.2 Renaming Variables (Class 05) nesarc_sub <- nesarc_sub %>% dplyr::rename( ID = IDNUM , Sex = SEX , Age = AGE , SmokingStatus = CHECK321 , NicotineDependence = TAB12MDX , SmokingFreq = S3AQ3B1 , DailyCigsSmoked = S3AQ3C1 , Ethnicity = ETHRACE2A , Depression = MAJORDEPLIFE , IncomePersonal = S1Q10A , Height_ft = S1Q24FT , Height_in = S1Q24IN , Weight_lb = S1Q24LB ) str(nesarc_sub) 'data.frame': 43093 obs. of 13 variables:$ ID                : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...$ Age               : num  23 28 81 18 36 34 19 84 29 18 ...
$SmokingStatus : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...$ NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$SmokingFreq : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...$ DailyCigsSmoked   : num  NA NA NA NA NA NA NA NA NA NA ...
$Ethnicity : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...$ Depression        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$IncomePersonal : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...$ Height_ft         : num  5 5 5 5 6 5 5 5 5 5 ...
$Height_in : num 7 1 4 7 1 5 9 4 4 10 ...$ Weight_lb         : num  195 127 185 180 220 140 160 120 140 150 ...

### 4.1.3 Coding missing values (Class 06)

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

#### 4.1.3.1 Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

##### 4.1.3.1.1NAs recoded as numeric

The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.

S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
table(nesarc_sub$DailyCigsSmoked)  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40 17 18 19 20 21 22 23 24 25 27 28 29 30 33 34 35 22 59 5 5366 1 10 2 7 155 2 3 3 909 1 1 30 37 39 40 45 50 55 57 60 66 70 75 80 98 99 2 1 993 8 106 2 1 241 1 12 2 47 15 262  nesarc_sub <- nesarc_sub %>% replace_na( # replace NA with new numeric value list( DailyCigsSmoked = 0 ) ) table(nesarc_sub$DailyCigsSmoked)

0     1     2     3     4     5     6     7     8     9    10    11    12
25080   934   884   923   573  1070   463   269   299    49  3077    23   230
13    14    15    16    17    18    19    20    21    22    23    24    25
34    25   851    40    22    59     5  5366     1    10     2     7   155
27    28    29    30    33    34    35    37    39    40    45    50    55
2     3     3   909     1     1    30     2     1   993     8   106     2
57    60    66    70    75    80    98    99
1   241     1    12     2    47    15   262 
##### 4.1.3.1.2NAs recoded as categorical

The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.

CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never

S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
table(nesarc_sub$SmokingStatus )  1 2 9 9913 8078 22  table(nesarc_sub$SmokingFreq    )

1     2     3     4     5     6     9
14836   460   687   747   409   772   102 
nesarc_sub <-
nesarc_sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus   = as.character(SmokingStatus)
, SmokingFreq     = as.character(SmokingFreq  )
) %>%
replace_na(
# replace NA with new category values
list(
SmokingStatus = "3"
, SmokingFreq   = "7"
)
)

table(nesarc_sub$SmokingStatus )  1 2 3 9 9913 8078 25080 22  table(nesarc_sub$SmokingFreq    )

1     2     3     4     5     6     7     9
14836   460   687   747   409   772 25080   102 

#### 4.1.3.2 Coding 9s and 99s as NAs

Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

table(nesarc_sub$SmokingStatus )  1 2 3 9 9913 8078 25080 22  table(nesarc_sub$SmokingFreq    )

1     2     3     4     5     6     7     9
14836   460   687   747   409   772 25080   102 
table(nesarc_sub$DailyCigsSmoked)  0 1 2 3 4 5 6 7 8 9 10 11 12 25080 934 884 923 573 1070 463 269 299 49 3077 23 230 13 14 15 16 17 18 19 20 21 22 23 24 25 34 25 851 40 22 59 5 5366 1 10 2 7 155 27 28 29 30 33 34 35 37 39 40 45 50 55 2 3 3 909 1 1 30 2 1 993 8 106 2 57 60 66 70 75 80 98 99 1 241 1 12 2 47 15 262  nesarc_sub <- nesarc_sub %>% mutate( # replace missing values SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA) , SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA) , DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA) , Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA) , Height_in = replace(Height_in , Height_in %in% c( 99), NA) , Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA) # drop unused factor levels , SmokingStatus = fct_drop(SmokingStatus) , SmokingFreq = fct_drop(SmokingFreq ) ) table(nesarc_sub$SmokingStatus  )

1     2     3
9913  8078 25080 
table(nesarc_sub$SmokingFreq )  1 2 3 4 5 6 7 14836 460 687 747 409 772 25080  table(nesarc_sub$DailyCigsSmoked)

0     1     2     3     4     5     6     7     8     9    10    11    12
25080   934   884   923   573  1070   463   269   299    49  3077    23   230
13    14    15    16    17    18    19    20    21    22    23    24    25
34    25   851    40    22    59     5  5366     1    10     2     7   155
27    28    29    30    33    34    35    37    39    40    45    50    55
2     3     3   909     1     1    30     2     1   993     8   106     2
57    60    66    70    75    80    98
1   241     1    12     2    47    15 
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
Height_ft      Height_in        Weight_lb
Min.   :4.00   Min.   : 0.000   Min.   : 62.0
1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0
Median :5.00   Median : 5.000   Median :165.0
Mean   :5.11   Mean   : 5.279   Mean   :170.6
3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0
Max.   :7.00   Max.   :11.000   Max.   :500.0
NA's   :730    NA's   :761      NA's   :1376   

#### 4.1.3.3 Removing rows with missing values (don’t do this)

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc_sub)
[1] 43093    13
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
[1] 43093    13
summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
Height_ft      Height_in        Weight_lb
Min.   :4.00   Min.   : 0.000   Min.   : 62.0
1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0
Median :5.00   Median : 5.000   Median :165.0
Mean   :5.11   Mean   : 5.279   Mean   :170.6
3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0
Max.   :7.00   Max.   :11.000   Max.   :500.0
NA's   :730    NA's   :761      NA's   :1376   

Now we (would) have a dataset with no missing values.

### 4.1.4 Creating new variables (Class 06+)

#### 4.1.4.1 From categories to numeric

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30      # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5     # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1       # 6. Once a month or less
, SmokingFreq == 7 ~ 0       # 7. Never
)
)

summary(nesarc_sub$DaysSmoke)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 0.00 0.00 10.96 30.00 30.00 102  #### 4.1.4.2 From numeric to numeric Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable. nesarc_sub <- nesarc_sub %>% mutate( Height_inches = Height_ft * 12 + Height_in ) summary(nesarc_sub$Height_inches)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
48.0    64.0    66.0    66.6    70.0    84.0     761 

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke_log2       = log2(DaysSmoke)
, TotalCigsSmoked      = DaysSmoke * DailyCigsSmoked
, TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
, TotalCigsSmoked_smokers      = ifelse(TotalCigsSmoked > 0, TotalCigsSmoked, NA)
, TotalCigsSmoked_smokers_log2 = TotalCigsSmoked_log2
)

summary(nesarc_sub$TotalCigsSmoked)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0 0.0 0.0 187.5 300.0 2940.0 297  summary(nesarc_sub$TotalCigsSmoked_smokers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
1.0   150.0   360.0   453.1   600.0  2940.0   25377 

#### 4.1.4.3 From numeric to categories based on quantiles

nesarc_sub <-
nesarc_sub %>%
mutate(
CigsSmokedFac  =
case_when(
TotalCigsSmoked == 0                              ~ "No smoking (0)"
, (TotalCigsSmoked >  0) & (TotalCigsSmoked <=   6) ~ "Lowest smoking (1-6)"
, (TotalCigsSmoked >  6) & (TotalCigsSmoked <=  30) ~ "Low smoking (7-30)"
, (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
, (TotalCigsSmoked > 300)                           ~ "High smoking (301+)"
)
, CigsSmokedFac =
factor(
CigsSmokedFac
, levels = c(
"No smoking (0)"
, "Lowest smoking (1-6)"
, "Low smoking (7-30)"
, "Medium smoking (31-300)"
, "High smoking (301+)"
)
)
)

summary(nesarc_sub$TotalCigsSmoked)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0 0.0 0.0 187.5 300.0 2940.0 297  summary(nesarc_sub$CigsSmokedFac)
         No smoking (0)    Lowest smoking (1-6)      Low smoking (7-30)
25080                    1045                    1129
Medium smoking (31-300)     High smoking (301+)                    NA's
6456                    9086                     297 
table(nesarc_sub$CigsSmokedFac)  No smoking (0) Lowest smoking (1-6) Low smoking (7-30) 25080 1045 1129 Medium smoking (31-300) High smoking (301+) 6456 9086  #### 4.1.4.4 From many categories to a few The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier. ## SmokingFreq3 variable # notice that I'm doing this before I've assigned the labels # if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below # inspect original variable table(nesarc_sub$SmokingFreq)

1     2     3     4     5     6     7
14836   460   687   747   409   772 25080 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc_sub <-
nesarc_sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3")  # 1. daily
, "2" = c("4", "5")       # 2. weekly
, "3" = "6"               # 3. monthly
, "4" = "7"               # 7. never
)
)

# new variable
table(nesarc_sub$SmokingFreq3)  1 2 3 4 15983 1156 772 25080  # we will label the new factor variable below, with the others #### 4.1.4.5 Working with Dates How to calculate the duration between two dates. (You probably won’t need to use this for your project.) Dates can be tricky in R. The lubridate package makes them much easier. Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it. I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth day, month, and year. dat_sub <- NESARC %>% dplyr::select( AGE , CDAY , CMON , CYEAR , DOBD , DOBM , DOBY ) str(dat_sub) 'data.frame': 43093 obs. of 7 variables:$ AGE  : num  23 28 81 18 36 34 19 84 29 18 ...
$CDAY : num 14 12 23 9 18 7 1 18 9 16 ...$ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
$CYEAR: num 2001 2002 2001 2001 2001 ...$ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
$DOBM : num 9 12 12 6 9 12 3 11 4 3 ...$ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.

dat_sub <-
dat_sub %>%
mutate(
## interview date
# combine day, month, and year into one text string
cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR) # create date object by interpretting the day, month, year with dmy() , cdate_date = dmy(cdate_text) ## date of birth # combine day, month, and year into one text string , dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
#    such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
, dob_date = dmy(dob_text)
)
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966
dob_date
1 1977-09-27
2 1973-12-15
3 1919-12-28
4 1983-06-27
5 1965-09-11
6 1966-12-28

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

dat_sub <-
dat_sub %>%
mutate(
age_days  = cdate_date - dob_date
, age_years = as.numeric(age_days / 365.25)  # change from "difftime" to "numeric"
# difference between the age we calculated and the age in the original dataset
, age_diff  = age_years - AGE
)
head(dat_sub)
  AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date   dob_text
1  23   14    8  2001   27    9 1977  14 8 2001 2001-08-14  27 9 1977
2  28   12    1  2002   15   12 1973  12 1 2002 2002-01-12 15 12 1973
3  81   23   11  2001   28   12 1919 23 11 2001 2001-11-23 28 12 1919
4  18    9    9  2001   27    6 1983   9 9 2001 2001-09-09  27 6 1983
5  36   18   10  2001   11    9 1965 18 10 2001 2001-10-18  11 9 1965
6  34    7    9  2001   28   12 1966   7 9 2001 2001-09-07 28 12 1966
dob_date   age_days age_years   age_diff
1 1977-09-27  8722 days  23.87953 0.87953457
2 1973-12-15 10255 days  28.07666 0.07665982
3 1919-12-28 29916 days  81.90554 0.90554415
4 1983-06-27  6649 days  18.20397 0.20396988
5 1965-09-11 13186 days  36.10130 0.10130048
6 1966-12-28 12672 days  34.69405 0.69404517

Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!

# drop the original ones you don't need anymore
#   by select(-VAR), it excludes the VAR column
dat_sub <-
dat_sub %>%
select(
-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
# -c(cdate_text, cdate_date, dob_text, dob_date)
, -c(age_days, age_diff)
)
str(dat_sub)
'data.frame':   43093 obs. of  6 variables:
$AGE : num 23 28 81 18 36 34 19 84 29 18 ...$ cdate_text: chr  "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" ...
$cdate_date: Date, format: "2001-08-14" "2002-01-12" ...$ dob_text  : chr  "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" ...
$dob_date : Date, format: "1977-09-27" "1973-12-15" ...$ age_years : num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

(For this example, I’m going to delete the dat_sub object since it was just for this demonstration.

rm(dat_sub)

#### 4.1.4.6 Review results of new variables

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
Height_ft      Height_in        Weight_lb       DaysSmoke     Height_inches
Min.   :4.00   Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :48.0
1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0
Median :5.00   Median : 5.000   Median :165.0   Median : 0.00   Median :66.0
Mean   :5.11   Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   :66.6
3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0
Max.   :7.00   Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :84.0
NA's   :730    NA's   :761      NA's   :1376    NA's   :102     NA's   :761
DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2 TotalCigsSmoked_smokers
Min.   : -Inf   Min.   :   0.0   Min.   : 0.000       Min.   :   1.0
1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229       1st Qu.: 150.0
Median : -Inf   Median :   0.0   Median : 8.492       Median : 360.0
Mean   : -Inf   Mean   : 187.5   Mean   : 7.888       Mean   : 453.1
3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229       3rd Qu.: 600.0
Max.   :4.907   Max.   :2940.0   Max.   :11.522       Max.   :2940.0
NA's   :102     NA's   :297      NA's   :25377        NA's   :25377
TotalCigsSmoked_smokers_log2                 CigsSmokedFac   SmokingFreq3
Min.   : 0.000               No smoking (0)         :25080   1   :15983
1st Qu.: 7.229               Lowest smoking (1-6)   : 1045   2   : 1156
Median : 8.492               Low smoking (7-30)     : 1129   3   :  772
Mean   : 7.888               Medium smoking (31-300): 6456   4   :25080
3rd Qu.: 9.229               High smoking (301+)    : 9086   NA's:  102
Max.   :11.522               NA's                   :  297
NA's   :25377                                                            

### 4.1.5 Labeling Categorical variable levels (Class 06)

Look at summary of dataset to see the state of the labeling before we make changes.

summary(nesarc_sub)
       ID        Sex            Age       SmokingStatus NicotineDependence
1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131
2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962
3      :    1             Median :44.0   3   :25080
4      :    1             Mean   :46.4   NA's:   22
5      :    1             3rd Qu.:59.0
6      :    1             Max.   :98.0
(Other):43087
SmokingFreq    DailyCigsSmoked  Ethnicity Depression IncomePersonal
7      :25080   Min.   : 0.000   1:24507   0:35254    Min.   :      0
1      :14836   1st Qu.: 0.000   2: 8245   1: 7839    1st Qu.:   8800
6      :  772   Median : 0.000   3:  701              Median :  20000
4      :  747   Mean   : 6.462   4: 1332              Mean   :  28185
3      :  687   3rd Qu.:10.000   5: 8308              3rd Qu.:  36000
(Other):  869   Max.   :98.000                        Max.   :3000000
NA's   :  102   NA's   :262
Height_ft      Height_in        Weight_lb       DaysSmoke     Height_inches
Min.   :4.00   Min.   : 0.000   Min.   : 62.0   Min.   : 0.00   Min.   :48.0
1st Qu.:5.00   1st Qu.: 3.000   1st Qu.:140.0   1st Qu.: 0.00   1st Qu.:64.0
Median :5.00   Median : 5.000   Median :165.0   Median : 0.00   Median :66.0
Mean   :5.11   Mean   : 5.279   Mean   :170.6   Mean   :10.96   Mean   :66.6
3rd Qu.:5.00   3rd Qu.: 8.000   3rd Qu.:192.0   3rd Qu.:30.00   3rd Qu.:70.0
Max.   :7.00   Max.   :11.000   Max.   :500.0   Max.   :30.00   Max.   :84.0
NA's   :730    NA's   :761      NA's   :1376    NA's   :102     NA's   :761
DaysSmoke_log2  TotalCigsSmoked  TotalCigsSmoked_log2 TotalCigsSmoked_smokers
Min.   : -Inf   Min.   :   0.0   Min.   : 0.000       Min.   :   1.0
1st Qu.: -Inf   1st Qu.:   0.0   1st Qu.: 7.229       1st Qu.: 150.0
Median : -Inf   Median :   0.0   Median : 8.492       Median : 360.0
Mean   : -Inf   Mean   : 187.5   Mean   : 7.888       Mean   : 453.1
3rd Qu.:4.907   3rd Qu.: 300.0   3rd Qu.: 9.229       3rd Qu.: 600.0
Max.   :4.907   Max.   :2940.0   Max.   :11.522       Max.   :2940.0
NA's   :102     NA's   :297      NA's   :25377        NA's   :25377
TotalCigsSmoked_smokers_log2                 CigsSmokedFac   SmokingFreq3
Min.   : 0.000               No smoking (0)         :25080   1   :15983
1st Qu.: 7.229               Lowest smoking (1-6)   : 1045   2   : 1156
Median : 8.492               Low smoking (7-30)     : 1129   3   :  772
Mean   : 7.888               Medium smoking (31-300): 6456   4   :25080
3rd Qu.: 9.229               High smoking (301+)    : 9086   NA's:  102
Max.   :11.522               NA's                   :  297
NA's   :25377                                                            

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, NicotineDependence, and Sex.

# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <- factor( nesarc_sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)

nesarc_sub$SmokingStatus <- factor( nesarc_sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)

# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <- factor(nesarc_sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc_sub$Sex)  Female Male 24575 18518  nesarc_sub$NicotineDependence <-
factor(nesarc_sub$NicotineDependence , levels = c( 1 , 0 ) , labels = c( "Nicotine Dependence" , "No Nicotine Dependence" ) ) # Remember to include the new category level from the original NA values. nesarc_sub$SmokingFreq <-
factor(nesarc_sub$SmokingFreq , levels = c( 7 , 6 , 5 , 4 , 3 , 2 , 1 ) , labels = c( "Never" , "Once a month or less" , "2 to 3 Days/month" , "1 to 2 Days/week" , "3 to 4 Days/week" , "5 to 6 Days/week" , "Every Day" ) ) nesarc_sub$Ethnicity <-
factor(nesarc_sub$Ethnicity # shorter labels are helpful for plots , labels = c( "Cauc" , "AfAm" , "NaAm" , "Asia" , "Hisp" ) # , labels = c("Caucasian" # , "African American" # , "Native American" # , "Asian" # , "Hispanic" # ) ) nesarc_sub$SmokingFreq3 <-
factor(nesarc_sub$SmokingFreq3 , levels = c( 4 , 3 , 2 , 1 ) , labels = c( "Never" , "Monthly" , "Weekly" , "Daily" ) ) summary(nesarc_sub)  ID Sex Age 1 : 1 Female:24575 Min. :18.0 2 : 1 Male :18518 1st Qu.:32.0 3 : 1 Median :44.0 4 : 1 Mean :46.4 5 : 1 3rd Qu.:59.0 6 : 1 Max. :98.0 (Other):43087 SmokingStatus NicotineDependence Smoked past 12 months : 9913 Nicotine Dependence : 4962 Smoked prior to 12 months: 8078 No Nicotine Dependence:38131 Never smoked :25080 NA's : 22 SmokingFreq DailyCigsSmoked Ethnicity Never :25080 Min. : 0.000 Cauc:24507 Every Day :14836 1st Qu.: 0.000 AfAm: 8245 Once a month or less: 772 Median : 0.000 NaAm: 701 1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332 3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308 (Other) : 869 Max. :98.000 NA's : 102 NA's :262 Depression IncomePersonal Height_ft Height_in No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000 Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000 Median : 20000 Median :5.00 Median : 5.000 Mean : 28185 Mean :5.11 Mean : 5.279 3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000 Max. :3000000 Max. :7.00 Max. :11.000 NA's :730 NA's :761 Weight_lb DaysSmoke Height_inches DaysSmoke_log2 Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf Median :165.0 Median : 0.00 Median :66.0 Median : -Inf Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907 Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907 NA's :1376 NA's :102 NA's :761 NA's :102 TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_smokers Min. : 0.0 Min. : 0.000 Min. : 1.0 1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 150.0 Median : 0.0 Median : 8.492 Median : 360.0 Mean : 187.5 Mean : 7.888 Mean : 453.1 3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.: 600.0 Max. :2940.0 Max. :11.522 Max. :2940.0 NA's :297 NA's :25377 NA's :25377 TotalCigsSmoked_smokers_log2 CigsSmokedFac SmokingFreq3 Min. : 0.000 No smoking (0) :25080 Never :25080 1st Qu.: 7.229 Lowest smoking (1-6) : 1045 Monthly: 772 Median : 8.492 Low smoking (7-30) : 1129 Weekly : 1156 Mean : 7.888 Medium smoking (31-300): 6456 Daily :15983 3rd Qu.: 9.229 High smoking (301+) : 9086 NA's : 102 Max. :11.522 NA's : 297 NA's :25377  ### 4.1.6 Data subset rows ## 4.2 Data is complete (Class 06) ### 4.2.1 Plot entire dataset, show missing values {r} ggplot_missing(nesarc_sub) ### 4.2.2 Numerical summaries to assess correctness (Class 05) summary(nesarc_sub)  ID Sex Age 1 : 1 Female:24575 Min. :18.0 2 : 1 Male :18518 1st Qu.:32.0 3 : 1 Median :44.0 4 : 1 Mean :46.4 5 : 1 3rd Qu.:59.0 6 : 1 Max. :98.0 (Other):43087 SmokingStatus NicotineDependence Smoked past 12 months : 9913 Nicotine Dependence : 4962 Smoked prior to 12 months: 8078 No Nicotine Dependence:38131 Never smoked :25080 NA's : 22 SmokingFreq DailyCigsSmoked Ethnicity Never :25080 Min. : 0.000 Cauc:24507 Every Day :14836 1st Qu.: 0.000 AfAm: 8245 Once a month or less: 772 Median : 0.000 NaAm: 701 1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332 3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308 (Other) : 869 Max. :98.000 NA's : 102 NA's :262 Depression IncomePersonal Height_ft Height_in No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000 Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000 Median : 20000 Median :5.00 Median : 5.000 Mean : 28185 Mean :5.11 Mean : 5.279 3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000 Max. :3000000 Max. :7.00 Max. :11.000 NA's :730 NA's :761 Weight_lb DaysSmoke Height_inches DaysSmoke_log2 Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf Median :165.0 Median : 0.00 Median :66.0 Median : -Inf Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907 Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907 NA's :1376 NA's :102 NA's :761 NA's :102 TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_smokers Min. : 0.0 Min. : 0.000 Min. : 1.0 1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 150.0 Median : 0.0 Median : 8.492 Median : 360.0 Mean : 187.5 Mean : 7.888 Mean : 453.1 3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.: 600.0 Max. :2940.0 Max. :11.522 Max. :2940.0 NA's :297 NA's :25377 NA's :25377 TotalCigsSmoked_smokers_log2 CigsSmokedFac SmokingFreq3 Min. : 0.000 No smoking (0) :25080 Never :25080 1st Qu.: 7.229 Lowest smoking (1-6) : 1045 Monthly: 772 Median : 8.492 Low smoking (7-30) : 1129 Weekly : 1156 Mean : 7.888 Medium smoking (31-300): 6456 Daily :15983 3rd Qu.: 9.229 High smoking (301+) : 9086 NA's : 102 Max. :11.522 NA's : 297 NA's :25377  # 5 Graphing and Tabulating ## 5.1 Class 06, Plotting univariate Rubric 1. (3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe. 2. (3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers). 3. (2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g., str(), dim(), summary()). • Scroll up to sections labeled “(Class 06)”. 4. (2 p) Label levels of factor variables. • Scroll up to sections labeled “(Class 06)”. ## 5.2 Categorical variables ### 5.2.1 Tables for categorical variables # table() produces a one-variable frequency table table(nesarc_sub$NicotineDependence)

Nicotine Dependence No Nicotine Dependence
4962                  38131 
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()  Nicotine Dependence No Nicotine Dependence 0.1151463 0.8848537  tab_nicdep <- nesarc_sub %>% group_by( NicotineDependence ) %>% summarize( n = n() #, .groups = "drop_last" ) %>% mutate( prop = round(n / sum(n), 3) ) %>% #arrange( # desc(n) #) %>% ungroup() tab_nicdep # A tibble: 2 × 3 NicotineDependence n prop <fct> <int> <dbl> 1 Nicotine Dependence 4962 0.115 2 No Nicotine Dependence 38131 0.885 ### 5.2.2 Graphing frequency tables library(ggplot2) p1 <- ggplot(data = nesarc_sub, aes(x = NicotineDependence)) p1 <- p1 + geom_bar() p1 <- p1 + labs(x = "Nicotine Dependence" , y = "Total Number" , title = "Nicotine Dependence for Adults" ) print(p1) Interpretation: Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%). ## 5.3 Numeric variables ### 5.3.1 Graphing numeric variables p <- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked)) p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5) #p <- p + geom_rug() p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4) p <- p + labs(x = "Estimated Cigarettes Smoked per Month" , y = "" , title = "Monthly cigaretts smoked for Young Smoking Adults" ) p <- p + theme_bw() print(p) summary(nesarc_sub$DailyCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.000   0.000   0.000   6.462  10.000  98.000     262 
fivenum(nesarc_sub$DailyCigsSmoked) [1] 0 0 0 10 98 median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
[1] 0
IQR(nesarc_sub$DailyCigsSmoked, na.rm = TRUE) [1] 10 Interpretation: Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 15 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 14. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding. ## 5.4 Class 07, Plotting bivariate, numeric response Rubric 1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation): 1. Scatter plot (for regression): $$x$$ = numerical, $$y$$ = numerical, include axis labels and a title. Interpret the plot: describe the relationship. 2. Box plots (for ANOVA): $$x$$ = categorical, $$y$$ = numerical, include axis labels and a title. Interpret the plot: describe the relationship. ### 5.4.1 Scatter plot (for regression): x = numerical, y = numerical Interpretation: The relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line. {r} library(ggplot2) p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) #p <- p + geom_point(alpha = 1/10) p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10) p <- p + geom_smooth() p <- p + stat_smooth(method = lm, colour = "red") p <- p + labs( title = "Total Cigarettes Smoked vs Age" #, x = , y = "Total Cigarettes Smoked per month" , caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)." ) print(p) An example of height and weight. {r} library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb)) p <- p + geom_jitter(height = 0.5, alpha = 1/10) p <- p + stat_smooth(method = lm) p <- p + labs(title = "Weight vs Height") print(p) ### 5.4.2 Box plots (for ANOVA): x = categorical, y = numerical Interpretation: For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median. {r} library(ggplot2) p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked)) p <- p + geom_boxplot(width = 0.5, alpha = 0.5) p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100) # diamond at mean for each group p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6, colour = "red", alpha = 0.8) p <- p + labs( title = "Total Cigarettes Smoked by Depression" , x = "Depression status" , y = "Total Cigarettes Smoked per month" , caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)." ) print(p) ## 5.5 Class 08, Plotting bivariate, categorical response Rubric 1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation): 1. Mosaic plot or bivariate bar plots (for contingency tables): $$x$$ = categorical, $$y$$ = categorical, include axis labels and a title. Interpret the plot: describe the relationship. 2. Logistic scatter plot (for logistic regression): $$x$$ = numerical, $$y$$ = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship. ### 5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical tab_dep_nic <- nesarc_sub %>% group_by( Depression, NicotineDependence ) %>% summarize( n = n() ) %>% mutate( prop = round(n / sum(n), 3) ) %>% ungroup() tab_dep_nic # A tibble: 4 × 4 Depression NicotineDependence n prop <fct> <fct> <int> <dbl> 1 No Depression Nicotine Dependence 3193 0.091 2 No Depression No Nicotine Dependence 32061 0.909 3 Yes Depression Nicotine Dependence 1769 0.226 4 Yes Depression No Nicotine Dependence 6070 0.774 tab_nic_dep <- nesarc_sub %>% group_by( NicotineDependence, Depression ) %>% summarize( n = n() ) %>% mutate( prop = round(n / sum(n), 3) ) %>% ungroup() tab_nic_dep # A tibble: 4 × 4 NicotineDependence Depression n prop <fct> <fct> <int> <dbl> 1 Nicotine Dependence No Depression 3193 0.643 2 Nicotine Dependence Yes Depression 1769 0.357 3 No Nicotine Dependence No Depression 32061 0.841 4 No Nicotine Dependence Yes Depression 6070 0.159 Reversing color order: It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function forcats::fct_rev() on your fill= variable. library(ggplot2) p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence))) p <- p + geom_bar(position = "fill") p <- p + theme_bw() p <- p + labs(x = "Depression status" , y = "Proportion" , title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status" ) # the legend title can be modified, if desired (try this line) p <- p + scale_fill_discrete(name = "Nicotine Dependence\nStatus") print(p) Interpretation: When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%). ### 5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary) table(nesarc_sub$NicotineDependence)

Nicotine Dependence No Nicotine Dependence
4962                  38131 
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc_sub$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)  0 1 38131 4962  # # Note, sometimes it is necessary to make your variable a character type first. # # If the code above doesn't produce both 0s and 1s, then try this: # nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0) {r, fig.height = 4, fig.width = 8} library(ggplot2) p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01)) p <- p + theme_bw() p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4) # overlay a sigmodal curve to indicate direction of relationship # As of ggplot2 2.1.0, need to use this binomial_smooth() function: binomial_smooth <- function(...) { geom_smooth(method = "glm", method.args = list(family = "binomial"), ...) } p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE) p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1)) p <- p + labs(x = "Estimated Cigarettes Smoked per Month" , y = "Nicotine Dependence (1=Yes, 0=No)" , title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked" ) print(p) Interpretation: The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes. # 6 Statistical methods ## 6.1 Class 09, Simple linear regression (separate worksheet) ## 6.2 Class 10, Simple linear regression Rubric 1. With your previous (or new) bivariate scatter plot, add a regression line. • (2 p) plot with regression line, • (1 p) label axes and title. 2. Use lm() to fit the linear regression and interpret slope and $$R^2$$ (R-squared) values. • (2 p) lm summary() table is presented, • (2 p) slope is interpreted with respect to a per-unit increase of the $$x$$ variable in the context of the variables in the plot, • (2 p) $$R^2$$ is interpretted in a sentence. 3. (1 p) Interpret the intercept. Does it make sense in the context of your study? ### 6.2.1 1. Scatter plot, add a regression line. {r} library(ggplot2) p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) #p <- p + geom_point(alpha = 1/10) p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10) #p <- p + geom_smooth() p <- p + stat_smooth(method = lm, fullrange = TRUE) p <- p + labs( title = "Total Cigarettes Smoked vs Age" #, x = , y = "Total Cigarettes Smoked per month" , caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)." ) p <- p + xlim(c(0, NA)) print(p) ### 6.2.2 2. Fit the linear regression, interpret slope and $$R^2$$ (R-squared) values # fit the simple linear regression model lm_TCS_Age <- lm( formula = TotalCigsSmoked ~ Age , data = nesarc_sub ) # use summary() to parameters estimates (slope, intercept) and other summaries summary(lm_TCS_Age)  Call: lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub) Residuals: Min 1Q Median 3Q Max -332.1 -189.4 -139.1 106.3 2828.9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 57.92792 4.35989 13.29 <2e-16 *** Age 2.79797 0.08762 31.93 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 329.2 on 42794 degrees of freedom (297 observations deleted due to missingness) Multiple R-squared: 0.02327, Adjusted R-squared: 0.02325 F-statistic: 1020 on 1 and 42794 DF, p-value: < 2.2e-16 • Slope: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month. • $$R^2$$: The proportion of variance explained by the regression model, compared to the grand mean, is $$R^2 = 0.0233$$, which is very small. ### 6.2.3 3. Interpret the intercept. Does it make sense? For someone 0 years old, the expected number of cigarettes smoked per month is 57.93. This does not make sense since newborns do not smoke and because this is a large extrapolation from the data. ## 6.3 Class 11, Logarithm transformation (separate worksheet) ## 6.4 Class 12, Logarithm transformation Rubric 1. Try plotting the data on a logarithmic scale • (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale. 1. original scales 2. $$\log(x)$$-only 3. $$\log(y)$$-only 4. both $$\log(x)$$ and $$\log(y)$$ 2. What happened to your data when you transformed it? • (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?). • (1 p) Choose the best scale for a linear relationship and explain why. • (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not. ### 6.4.1 1. Try plotting on log scale (original scale, $$\log(x)$$-only, $$\log(y)$$-only, both $$\log(x)$$ and $$\log(y)$$) {r, fig.height = 8, fig.width = 8} # IncomePersonal #p <- p + scale_x_log10() #p <- p + scale_y_log10() library(ggplot2) p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) p1 <- p1 + theme_bw() p1 <- p1 + geom_point(alpha = 1/20) p1 <- p1 + stat_smooth(method = lm) p1 <- p1 + labs( title = "Original scale" , x = "Age" , y = "Total Cigarettes Smoked per month" ) #print(p1) library(ggplot2) p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) p2 <- p2 + theme_bw() p2 <- p2 + geom_point(alpha = 1/20) p2 <- p2 + stat_smooth(method = lm) p2 <- p2 + scale_x_log10() p2 <- p2 + labs( title = "Log(x)" , x = "Age" , y = "Total Cigarettes Smoked per month" ) #print(p2) library(ggplot2) p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) p3 <- p3 + theme_bw() p3 <- p3 + geom_point(alpha = 1/20) p3 <- p3 + stat_smooth(method = lm) p3 <- p3 + scale_y_log10() p3 <- p3 + labs( title = "Log(y)" , x = "Age" , y = "Total Cigarettes Smoked per month" ) #print(p3) library(ggplot2) p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked)) p4 <- p4 + theme_bw() p4 <- p4 + geom_point(alpha = 1/20) p4 <- p4 + stat_smooth(method = lm) p4 <- p4 + scale_x_log10() p4 <- p4 + scale_y_log10() p4 <- p4 + labs( title = "Log(x) and Log(y)" , x = "Age" , y = "Total Cigarettes Smoked per month" ) #print(p4) # grid.arrange() is a way to arrange several ggplot objects library(gridExtra) grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Age")  ### 6.4.2 2. What happened to your data when you transformed it? • Describe what happened to the relationship after each log transformation (compare transformed scale to original scale). 1. Original: Same 2. log(x): no improvement, age values became left skewed 3. log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data. 4. log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help. • Choose the best scale for a linear relationship and explain why. log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation. • Does your relationship benefit from a logarithmic transformation? Say why or why not. A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation. ##### 6.4.2.0.1 Extra plots Here’s an example where transforming both $$x$$ and $$y$$ with the log makes a big improvement. {r, fig.height = 8, fig.width = 8} # IncomePersonal #p <- p + scale_x_log10() #p <- p + scale_y_log10() library(ggplot2) p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked)) p1 <- p1 + theme_bw() p1 <- p1 + geom_point(alpha = 1/20) p1 <- p1 + stat_smooth(method = lm) p1 <- p1 + labs( title = "Original scale" , x = "Personal Income" , y = "Total Cigarettes Smoked per month" ) #print(p1) library(ggplot2) p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked)) p2 <- p2 + theme_bw() p2 <- p2 + geom_point(alpha = 1/20) p2 <- p2 + stat_smooth(method = lm) p2 <- p2 + scale_x_log10() p2 <- p2 + labs( title = "Log(x)" , x = "Personal Income" , y = "Total Cigarettes Smoked per month" ) #print(p2) library(ggplot2) p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked)) p3 <- p3 + theme_bw() p3 <- p3 + geom_point(alpha = 1/20) p3 <- p3 + stat_smooth(method = lm) p3 <- p3 + scale_y_log10() p3 <- p3 + labs( title = "Log(y)" , x = "Personal Income" , y = "Total Cigarettes Smoked per month" ) #print(p3) library(ggplot2) p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked)) p4 <- p4 + theme_bw() p4 <- p4 + geom_point(alpha = 1/20) p4 <- p4 + stat_smooth(method = lm) p4 <- p4 + scale_x_log10() p4 <- p4 + scale_y_log10() p4 <- p4 + labs( title = "Log(x) and Log(y)" , x = "Personal Income" , y = "Total Cigarettes Smoked per month" ) #print(p4) # grid.arrange() is a way to arrange several ggplot objects library(gridExtra) grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")  ## 6.5 Class 13, Correlation (separate worksheet) ## 6.6 Class 14, Categorical contingency tables (separate worksheet) ## 6.7 Class 15, Correlation and Categorical contingency tables Rubric 1. With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret. • (1 p) plot is repeated here or the plot is referenced an easy to find from a plot above, • (1 p) correlation is calculated, • (2 p) correlation is interpreted (direction, strength of LINEAR relationship). 2. With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret. • (1 p) frequency table of variables is given, • (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables, • (1 p) a well-labeled plot of the proportion table is given, • (2 p) the conditional proportions are interpreted and compared between conditions. ### 6.7.1 Correlation {r, fig.height = 5, fig.width = 5} library(ggplot2) p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2)) p <- p + theme_bw() p <- p + geom_point(alpha = 1/20) p <- p + stat_smooth(method = lm) p <- p + labs( title = "Log2 Cigarettes smoked vs Age" , x = "Age" , y = "Log2 Total Cigarettes Smoked per month" ) print(p) cor_A_C <- cor( nesarc_sub$Age
, nesarc_sub\$TotalCigsSmoked_log2
, use = "complete.obs"
)
cor_A_C
[1] 0.1375481

### 6.7.2 Interpretation of correlation

The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.

### 6.7.3 Contingency table

tab_S_D_N <-
nesarc_sub %>%
group_by(
Sex
, Depression
, NicotineDependence
) %>%
summarise(
Frequency = n()
) %>%
mutate(
Proportion = round(Frequency / sum(Frequency), 3)
) %>%
ungroup()

# tab_S_D_N

tab_S_D_N %>%
knitr::kable()
Sex Depression NicotineDependence Frequency Proportion
Female No Depression Nicotine Dependence 1455 0.076
Female No Depression No Nicotine Dependence 17706 0.924
Female Yes Depression Nicotine Dependence 1117 0.206
Female Yes Depression No Nicotine Dependence 4297 0.794
Male No Depression Nicotine Dependence 1738 0.108
Male No Depression No Nicotine Dependence 14355 0.892
Male Yes Depression Nicotine Dependence 652 0.269
Male Yes Depression No Nicotine Dependence 1773 0.731
library(ggplot2)
p <- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(
title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
, y = "Proportion of Nicotine Dependence"
, fill = "Nicotine Dependence"
)