- 1 Document overview
- 2 Research Questions
- 3 Codebook
- 4 Data Management
- 4.1 Class 05, Data subset and numerical summaries
- 4.2 Data is complete (Class 06)

- 5 Graphing and Tabulating
- 6 Statistical methods
- 6.1 Class 09, Simple linear regression (separate worksheet)
- 6.2 Class 10, Simple linear regression
- 6.3 Class 11, Logarithm transformation (separate worksheet)
- 6.4 Class 12, Logarithm transformation
- 6.5 Class 13, Correlation (separate worksheet)
- 6.6 Class 14, Categorical contingency tables (separate worksheet)
- 6.7 Class 15, Correlation and Categorical contingency tables
- 6.8 Class 16, Parameter estimation (one-sample) (separate worksheet)
- 6.9 Class 17, Inference and Parameter estimation (one-sample)
- 6.10 Class 18, Hypothesis testing (one- and two-sample) (separate worksheet)
- 6.11 Class 19, Paired data, assumption assessment (separate worksheet)
- 6.12 Class 20, Hypothesis testing (one- and two-sample)
- 6.13 Class 21, ANOVA, Pairwise comparisons (separate worksheet)
- 6.14 Class 22, ANOVA and Assessing Assumptions
- 6.15 Class 23, Nonparametric methods (separate worksheet)
- 6.16 Class 24, Binomial and Multinomial tests (separate worksheet)
- 6.17 Class 25, Two-way categorical tables (separate worksheet)
- 6.18 Class 26, Simple linear regression (separate worksheet)
- 6.19 Class 27, Two-way categorical and simple linear regression
- 6.19.1 Two-way categorical analysis
- 6.19.2 Simple linear regression
- 6.19.2.1 (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.
- 6.19.2.2 (0 p) Fit the simple linear regression model.
- 6.19.2.3 (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and \(x\)-value).
- 6.19.2.4 (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).
- 6.19.2.5 (1 p) Assess the relative influence of points.
- 6.19.2.6 (1 p) Test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).
- 6.19.2.7 (1 p) Interpret the \(R^2\) value.

- 6.20 Class 28, Logistic regression (separate worksheet)
- 6.21 Class 29, Logistic regression

This document is organized by Class number. The worksheet assignments are indicated by the Tuesday and Thursday Class numbers.

Consider your readers (graders):

- organize the document clearly (use this document as an example)
- label minor sections under each day (use this document as an example)
- For each thing you do, always have these three parts:
- Say what you’re going to do and why.
- Do it with code, and document your code.
- Interpret the results.

```
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
knitr
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are not working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
::opts_chunk$set(cache = FALSE) #, autodep=TRUE) #$ knitr
```

*Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.*

`ADA1_ALL_05.Rmd`

`ADA1_ALL_06.Rmd`

`ADA1_ALL_07.Rmd`

…

A version that I prefer is to use a date using Year-Month-Day, `YYYYMMDD`

:

`ADA1_ALL_20200903.Rmd`

`ADA1_ALL_20200905.Rmd`

`ADA1_ALL_20200910.Rmd`

…

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

These class assignments are in their own documents: 09, 11, 13, 14, …

**Rubric**

(1 p) Is there a topic of interest?

(2 p) Are the variables relevant to a set of research questions?

(4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

- 1 categorical variable with only 2 levels
- 1 categorical variable with at least 3 levels
- 2 numerical variables with many possible unique values
- More variables are welcome and you’re likely to add to this later in the semester

(3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

Compile this Rmd file to an html, print/save to pdf, and upload to UNM Learn.

**Topic:**

Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

**Research questions:**

Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?

Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?

Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

- Codebook: https://statacumen.com/teach/ADA1/PDS_data/NESARC_W1_CodeBook.pdf
- Official site: https://www.niaaa.nih.gov/research/nesarc-iii
- Introduction: https://pubs.niaaa.nih.gov/publications/arh29-2/74-78.htm

```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
Sex
SEX
SEX
Nominal
18518 1. Male
24575 2. Female
Age
AGE
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
NicotineDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
Depression
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars
Height_ft
S1Q24FT
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
Height_in
S1Q24IN
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
Weight_lb
S1Q24LB
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
```

Additional variables were created from the original variables:

```
CREATED VARIABLES
DaysSmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```

**Rubric**

(4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

- Scroll down to sections labeled “(Class 05)”.

(1 p) Output confirms the subset is correct (e.g., using

`dim()`

and`str()`

).(3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

- Scroll down to sections labeled “(Class 05)”.

(2 p) Provide numerical summaries for all variables (e.g., using

`summary()`

).- Scroll down to sections labeled “(Class 05)”.

First, the data is placed on the search path.

```
# data analysis packages
library(tidyverse) # Data manipulation and visualization suite
library(lubridate) # Dates
source("ada_functions.R")
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Use the load() statement for the dataset you want to use.
# read data example
load("NESARC.RData")
dim(NESARC)
```

`[1] 43093 3008`

```
# variables to include in our data subset
<-
nesarc_sub %>%
NESARC ::select(
dplyr
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)
# Check size and structure of data subset.
# Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
```

`[1] 43093 13`

```
# structure of subset
str(nesarc_sub)
```

```
'data.frame': 43093 obs. of 13 variables:
$ IDNUM : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SEX : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CHECK321 : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ TAB12MDX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ S3AQ3B1 : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ S3AQ3C1 : num NA NA NA NA NA NA NA NA NA NA ...
$ ETHRACE2A : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ S1Q10A : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ S1Q24FT : num 5 5 5 5 6 5 5 5 5 5 ...
$ S1Q24IN : num 7 1 4 7 1 5 9 4 4 10 ...
$ S1Q24LB : num 195 127 185 180 220 140 160 120 140 150 ...
```

```
<-
nesarc_sub %>%
nesarc_sub ::rename(
dplyrID = IDNUM
Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, NicotineDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
, IncomePersonal = S1Q10A
, Height_ft = S1Q24FT
, Height_in = S1Q24IN
, Weight_lb = S1Q24LB
,
)
str(nesarc_sub)
```

```
'data.frame': 43093 obs. of 13 variables:
$ ID : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ Age : num 23 28 81 18 36 34 19 84 29 18 ...
$ SmokingStatus : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ SmokingFreq : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ DailyCigsSmoked : num NA NA NA NA NA NA NA NA NA NA ...
$ Ethnicity : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ Depression : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ IncomePersonal : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ Height_ft : num 5 5 5 5 6 5 5 5 5 5 ...
$ Height_in : num 7 1 4 7 1 5 9 4 4 10 ...
$ Weight_lb : num 195 127 185 180 220 140 160 120 140 150 ...
```

There are two steps. The first step is to recode any existing `NA`

s to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual `NA`

.

`NA`

s as meaningful “missing”First step: the existing blank values with `NA`

mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

`NA`

s recoded as numericThe easiest is `DailyCigsSmoked`

where `NA`

means 0 cigarettes. We replace `NA`

with 0. The function `is.na()`

identifies each observation that “is an `NA`

”. I’ve updated the codebook to indicate that the original `NA`

values were changed.

```
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
```

`table(nesarc_sub$DailyCigsSmoked)`

```
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 27 28 29 30 33 34 35 37 39 40
934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40 22 59 5 5366 1 10 2 7 155 2 3 3 909 1 1 30 2 1 993
45 50 55 57 60 66 70 75 80 98 99
8 106 2 1 241 1 12 2 47 15 262
```

```
<-
nesarc_sub %>%
nesarc_sub replace_na(
# replace NA with new numeric value
list(
DailyCigsSmoked = 0
)
)
table(nesarc_sub$DailyCigsSmoked)
```

```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 27 28 29
25080 934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40 22 59 5 5366 1 10 2 7 155 2 3 3
30 33 34 35 37 39 40 45 50 55 57 60 66 70 75 80 98 99
909 1 1 30 2 1 993 8 106 2 1 241 1 12 2 47 15 262
```

`NA`

s recoded as categoricalThe next two variables require defining a new category level. Both `SmokingStatus`

and `SmokingFreq`

with a value of `NA`

means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original `NA`

values were changed.

```
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
```

`table(nesarc_sub$SmokingStatus )`

```
1 2 9
9913 8078 22
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 9
14836 460 687 747 409 772 102
```

```
<-
nesarc_sub %>%
nesarc_sub mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
SmokingFreq = as.character(SmokingFreq )
, %>%
) replace_na(
# replace NA with new category values
list(
SmokingStatus = "3"
SmokingFreq = "7"
,
)
)
table(nesarc_sub$SmokingStatus )
```

```
1 2 3 9
9913 8078 25080 22
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 7 9
14836 460 687 747 409 772 25080 102
```

`NA`

sSecond step: replace the coded missing values with `NA`

, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

`table(nesarc_sub$SmokingStatus )`

```
1 2 3 9
9913 8078 25080 22
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 7 9
14836 460 687 747 409 772 25080 102
```

`table(nesarc_sub$DailyCigsSmoked)`

```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 27 28 29
25080 934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40 22 59 5 5366 1 10 2 7 155 2 3 3
30 33 34 35 37 39 40 45 50 55 57 60 66 70 75 80 98 99
909 1 1 30 2 1 993 8 106 2 1 241 1 12 2 47 15 262
```

```
<-
nesarc_sub %>%
nesarc_sub mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
, Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA)
, Height_in = replace(Height_in , Height_in %in% c( 99), NA)
, Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA)
, # drop unused factor levels
SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
,
)
table(nesarc_sub$SmokingStatus )
```

```
1 2 3
9913 8078 25080
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
```

`table(nesarc_sub$DailyCigsSmoked)`

```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 27 28 29
25080 934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40 22 59 5 5366 1 10 2 7 155 2 3 3
30 33 34 35 37 39 40 45 50 55 57 60 66 70 75 80 98
909 1 1 30 2 1 993 8 106 2 1 241 1 12 2 47 15
```

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131 7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0 Min. :4.00
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962 1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800 1st Qu.:5.00
3 : 1 Median :44.0 3 :25080 6 : 772 Median : 0.000 3: 701 Median : 20000 Median :5.00
4 : 1 Mean :46.4 NA's: 22 4 : 747 Mean : 6.462 4: 1332 Mean : 28185 Mean :5.11
5 : 1 3rd Qu.:59.0 3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000 3rd Qu.:5.00
6 : 1 Max. :98.0 (Other): 869 Max. :98.000 Max. :3000000 Max. :7.00
(Other):43087 NA's : 102 NA's :262 NA's :730
Height_in Weight_lb
Min. : 0.000 Min. : 62.0
1st Qu.: 3.000 1st Qu.:140.0
Median : 5.000 Median :165.0
Mean : 5.279 Mean :170.6
3rd Qu.: 8.000 3rd Qu.:192.0
Max. :11.000 Max. :500.0
NA's :761 NA's :1376
```

Finally, consider using `na.omit()`

to remove any records with missing values if it won’t cause issues with analysis. The command `na.omit()`

removes a row if any columns have an `NA`

in that row. This can have **drastic** consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters `NA`

s! Then `na.omit()`

will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running `na.omit()`

, and that’s if you have a large dataset and only a few observations have a few `NA`

s. You can’t analyze data with `NA`

s, since any calculation involving `NA`

results in an `NA`

. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few `NA`

distributed throughout; first the observations with `NA`

for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`

for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running `na.omit()`

you will start with a dataset with no `NA`

s, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do **not run na.omit()** since I’m concerned about having lots of

`NA`

s for some variables.`dim(nesarc_sub)`

`[1] 43093 13`

```
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
```

`[1] 43093 13`

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131 7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0 Min. :4.00
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962 1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800 1st Qu.:5.00
3 : 1 Median :44.0 3 :25080 6 : 772 Median : 0.000 3: 701 Median : 20000 Median :5.00
4 : 1 Mean :46.4 NA's: 22 4 : 747 Mean : 6.462 4: 1332 Mean : 28185 Mean :5.11
5 : 1 3rd Qu.:59.0 3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000 3rd Qu.:5.00
6 : 1 Max. :98.0 (Other): 869 Max. :98.000 Max. :3000000 Max. :7.00
(Other):43087 NA's : 102 NA's :262 NA's :730
Height_in Weight_lb
Min. : 0.000 Min. : 62.0
1st Qu.: 3.000 1st Qu.:140.0
Median : 5.000 Median :165.0
Mean : 5.279 Mean :170.6
3rd Qu.: 8.000 3rd Qu.:192.0
Max. :11.000 Max. :500.0
NA's :761 NA's :1376
```

Now we (would) have a dataset with no missing values.

`DaysSmoke`

estimates the days per month a subject smokes by converting `SmokingFreq`

(a factor with 6 levels) to a numeric variable using `as.numeric()`

and multiplying by the midpoint of the range of `SmokingFreq`

.

```
<-
nesarc_sub %>%
nesarc_sub mutate(
DaysSmoke =
case_when(
== 1 ~ 30 # 1. Every day
SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
, SmokingFreq
)
)
summary(nesarc_sub$DaysSmoke)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 10.96 30.00 30.00 102
```

`Height_inches`

is the height of a person in inches. Since the data was collected in two variables (`Height_ft`

and `Height_in`

) we need to add them together for our new variable.

```
<-
nesarc_sub %>%
nesarc_sub mutate(
Height_inches = Height_ft * 12 + Height_in
)
summary(nesarc_sub$Height_inches)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
48.0 64.0 66.0 66.6 70.0 84.0 761
```

The variable `TotalCigsSmoked`

estimates the monthly number of cigarettes a subject smokes per month by multiplying `DaysSmoke`

times `DailyCigsSmoked`

(the usual quantity smoked per day).

```
<-
nesarc_sub %>%
nesarc_sub mutate(
DaysSmoke_log2 = log2(DaysSmoke)
TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
, TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
,
)
summary(nesarc_sub$TotalCigsSmoked)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.0 187.5 300.0 2940.0 297
```

```
<-
nesarc_sub %>%
nesarc_sub mutate(
CigsSmokedFac =
case_when(
== 0 ~ "No smoking (0)"
TotalCigsSmoked > 0) & (TotalCigsSmoked <= 6) ~ "Lowest smoking (1-6)"
, (TotalCigsSmoked > 6) & (TotalCigsSmoked <= 30) ~ "Low smoking (7-30)"
, (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
, (TotalCigsSmoked > 300) ~ "High smoking (301+)"
, (TotalCigsSmoked
)CigsSmokedFac =
, factor(
CigsSmokedFaclevels = c(
, "No smoking (0)"
"Lowest smoking (1-6)"
, "Low smoking (7-30)"
, "Medium smoking (31-300)"
, "High smoking (301+)"
,
)
)
)
summary(nesarc_sub$TotalCigsSmoked)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.0 187.5 300.0 2940.0 297
```

`summary(nesarc_sub$CigsSmokedFac)`

```
No smoking (0) Lowest smoking (1-6) Low smoking (7-30) Medium smoking (31-300) High smoking (301+) NA's
25080 1045 1129 6456 9086 297
```

`table(nesarc_sub$CigsSmokedFac)`

```
No smoking (0) Lowest smoking (1-6) Low smoking (7-30) Medium smoking (31-300) High smoking (301+)
25080 1045 1129 6456 9086
```

The variable `SmokingFreq3`

collapses the `SmokingFreq`

from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

```
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)
```

```
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
```

```
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
<-
nesarc_sub %>%
nesarc_sub mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq"1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
,
)
)
# new variable
table(nesarc_sub$SmokingFreq3)
```

```
1 2 3 4
15983 1156 772 25080
```

`# we will label the new factor variable below, with the others`

How to calculate the duration between two dates.

(You probably won’t need to use this for your project.)

Dates can be tricky in R. The `lubridate`

package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an `AGE`

variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (`CDAY`

) are the interview day, month, and year, and the “DOB” variables (`DOBD`

) are the date of birth day, month, and year.

```
<-
dat_sub %>%
NESARC ::select(
dplyr
AGE
, CDAY
, CMON
, CYEAR
, DOBD
, DOBM
, DOBY
)str(dat_sub)
```

```
'data.frame': 43093 obs. of 7 variables:
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CDAY : num 14 12 23 9 18 7 1 18 9 16 ...
$ CMON : num 8 1 11 9 10 9 9 9 1 8 ...
$ CYEAR: num 2001 2002 2001 2001 2001 ...
$ DOBD : num 27 15 28 27 11 28 16 10 7 25 ...
$ DOBM : num 9 12 12 6 9 12 3 11 4 3 ...
$ DOBY : num 1977 1973 1919 1983 1965 ...
```

Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.

```
<-
dat_sub %>%
dat_sub mutate(
## interview date
# combine day, month, and year into one text string
cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
# create date object by interpretting the day, month, year with dmy()
cdate_date = dmy(cdate_text)
,
## date of birth
# combine day, month, and year into one text string
dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
, # create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob_date = dmy(dob_text)
,
)head(dat_sub)
```

```
AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date dob_text dob_date
1 23 14 8 2001 27 9 1977 14 8 2001 2001-08-14 27 9 1977 1977-09-27
2 28 12 1 2002 15 12 1973 12 1 2002 2002-01-12 15 12 1973 1973-12-15
3 81 23 11 2001 28 12 1919 23 11 2001 2001-11-23 28 12 1919 1919-12-28
4 18 9 9 2001 27 6 1983 9 9 2001 2001-09-09 27 6 1983 1983-06-27
5 36 18 10 2001 11 9 1965 18 10 2001 2001-10-18 11 9 1965 1965-09-11
6 34 7 9 2001 28 12 1966 7 9 2001 2001-09-07 28 12 1966 1966-12-28
```

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

```
<-
dat_sub %>%
dat_sub mutate(
age_days = cdate_date - dob_date
age_years = as.numeric(age_days / 365.25) # change from "difftime" to "numeric"
, # difference between the age we calculated and the age in the original dataset
age_diff = age_years - AGE
,
)head(dat_sub)
```

```
AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date dob_text dob_date age_days age_years age_diff
1 23 14 8 2001 27 9 1977 14 8 2001 2001-08-14 27 9 1977 1977-09-27 8722 days 23.87953 0.87953457
2 28 12 1 2002 15 12 1973 12 1 2002 2002-01-12 15 12 1973 1973-12-15 10255 days 28.07666 0.07665982
3 81 23 11 2001 28 12 1919 23 11 2001 2001-11-23 28 12 1919 1919-12-28 29916 days 81.90554 0.90554415
4 18 9 9 2001 27 6 1983 9 9 2001 2001-09-09 27 6 1983 1983-06-27 6649 days 18.20397 0.20396988
5 36 18 10 2001 11 9 1965 18 10 2001 2001-10-18 11 9 1965 1965-09-11 13186 days 36.10130 0.10130048
6 34 7 9 2001 28 12 1966 7 9 2001 2001-09-07 28 12 1966 1966-12-28 12672 days 34.69405 0.69404517
```

Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!

```
# drop the original ones you don't need anymore
# by select(-VAR), it excludes the VAR column
<-
dat_sub %>%
dat_sub select(
-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
# -c(cdate_text, cdate_date, dob_text, dob_date)
-c(age_days, age_diff)
,
)str(dat_sub)
```

```
'data.frame': 43093 obs. of 6 variables:
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ cdate_text: chr "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" ...
$ cdate_date: Date, format: "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" ...
$ dob_text : chr "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" ...
$ dob_date : Date, format: "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" ...
$ age_years : num 23.9 28.1 81.9 18.2 36.1 ...
```

Now you’ve got an age (or other date object) you can use in your project.

(For this example, I’m going to delete the `dat_sub`

object since it was just for this demonstration.

`rm(dat_sub)`

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131 7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0 Min. :4.00
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962 1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800 1st Qu.:5.00
3 : 1 Median :44.0 3 :25080 6 : 772 Median : 0.000 3: 701 Median : 20000 Median :5.00
4 : 1 Mean :46.4 NA's: 22 4 : 747 Mean : 6.462 4: 1332 Mean : 28185 Mean :5.11
5 : 1 3rd Qu.:59.0 3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000 3rd Qu.:5.00
6 : 1 Max. :98.0 (Other): 869 Max. :98.000 Max. :3000000 Max. :7.00
(Other):43087 NA's : 102 NA's :262 NA's :730
Height_in Weight_lb DaysSmoke Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2 CigsSmokedFac SmokingFreq3
Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000 No smoking (0) :25080 1 :15983
1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229 Lowest smoking (1-6) : 1045 2 : 1156
Median : 5.000 Median :165.0 Median : 0.00 Median :66.0 Median : -Inf Median : 0.0 Median : 8.492 Low smoking (7-30) : 1129 3 : 772
Mean : 5.279 Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888 Medium smoking (31-300): 6456 4 :25080
3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229 High smoking (301+) : 9086 NA's: 102
Max. :11.000 Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522 NA's : 297
NA's :761 NA's :1376 NA's :102 NA's :761 NA's :102 NA's :297 NA's :25377
```

Look at summary of dataset to see the state of the labeling before we make changes.

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131 7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0 Min. :4.00
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962 1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800 1st Qu.:5.00
3 : 1 Median :44.0 3 :25080 6 : 772 Median : 0.000 3: 701 Median : 20000 Median :5.00
4 : 1 Mean :46.4 NA's: 22 4 : 747 Mean : 6.462 4: 1332 Mean : 28185 Mean :5.11
5 : 1 3rd Qu.:59.0 3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000 3rd Qu.:5.00
6 : 1 Max. :98.0 (Other): 869 Max. :98.000 Max. :3000000 Max. :7.00
(Other):43087 NA's : 102 NA's :262 NA's :730
Height_in Weight_lb DaysSmoke Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2 CigsSmokedFac SmokingFreq3
Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000 No smoking (0) :25080 1 :15983
1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229 Lowest smoking (1-6) : 1045 2 : 1156
Median : 5.000 Median :165.0 Median : 0.00 Median :66.0 Median : -Inf Median : 0.0 Median : 8.492 Low smoking (7-30) : 1129 3 : 772
Mean : 5.279 Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888 Medium smoking (31-300): 6456 4 :25080
3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229 High smoking (301+) : 9086 NA's: 102
Max. :11.000 Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522 NA's : 297
NA's :761 NA's :1376 NA's :102 NA's :761 NA's :102 NA's :297 NA's :25377
```

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables `SmokingFreq`

, `NicotineDependence`

, and `Sex`

.

```
# Label factors with meaningful names in the same order as the codebook
$Depression <-
nesarc_subfactor(
$Depression
nesarc_sublabels = c("No Depression"
, "Yes Depression"
,
)
)
$SmokingStatus <-
nesarc_subfactor(
$SmokingStatus
nesarc_sublabels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
,
)
)
# Specify the order of the levels to be different from codebook order
$Sex <-
nesarc_subfactor(nesarc_sub$Sex
levels = c( 2
, 1
,
)labels = c( "Female"
, "Male"
,
)
)# check ordering with a frequency table
table(nesarc_sub$Sex)
```

```
Female Male
24575 18518
```

```
$NicotineDependence <-
nesarc_subfactor(nesarc_sub$NicotineDependence
levels = c( 1
, 0
,
)labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
,
)
)
# Remember to include the new category level from the original `NA` values.
$SmokingFreq <-
nesarc_subfactor(nesarc_sub$SmokingFreq
levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
,
)labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
,
)
)
$Ethnicity <-
nesarc_subfactor(nesarc_sub$Ethnicity
# shorter labels are helpful for plots
labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
,
)# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
$SmokingFreq3 <-
nesarc_subfactor(nesarc_sub$SmokingFreq3
levels = c( 4
, 3
, 2
, 1
,
)labels = c( "Never"
, "Monthly"
, "Weekly"
, "Daily"
,
)
)
summary(nesarc_sub)
```

```
ID Sex Age SmokingStatus NicotineDependence SmokingFreq DailyCigsSmoked Ethnicity
1 : 1 Female:24575 Min. :18.0 Smoked past 12 months : 9913 Nicotine Dependence : 4962 Never :25080 Min. : 0.000 Cauc:24507
2 : 1 Male :18518 1st Qu.:32.0 Smoked prior to 12 months: 8078 No Nicotine Dependence:38131 Every Day :14836 1st Qu.: 0.000 AfAm: 8245
3 : 1 Median :44.0 Never smoked :25080 Once a month or less: 772 Median : 0.000 NaAm: 701
4 : 1 Mean :46.4 NA's : 22 1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
5 : 1 3rd Qu.:59.0 3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
6 : 1 Max. :98.0 (Other) : 869 Max. :98.000
(Other):43087 NA's : 102 NA's :262
Depression IncomePersonal Height_ft Height_in Weight_lb DaysSmoke Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2
No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000
Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229
Median : 20000 Median :5.00 Median : 5.000 Median :165.0 Median : 0.00 Median :66.0 Median : -Inf Median : 0.0 Median : 8.492
Mean : 28185 Mean :5.11 Mean : 5.279 Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888
3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229
Max. :3000000 Max. :7.00 Max. :11.000 Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522
NA's :730 NA's :761 NA's :1376 NA's :102 NA's :761 NA's :102 NA's :297 NA's :25377
CigsSmokedFac SmokingFreq3
No smoking (0) :25080 Never :25080
Lowest smoking (1-6) : 1045 Monthly: 772
Low smoking (7-30) : 1129 Weekly : 1156
Medium smoking (31-300): 6456 Daily :15983
High smoking (301+) : 9086 NA's : 102
NA's : 297
```

```
{r}
ggplot_missing(nesarc_sub)
```

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence SmokingFreq DailyCigsSmoked Ethnicity
1 : 1 Female:24575 Min. :18.0 Smoked past 12 months : 9913 Nicotine Dependence : 4962 Never :25080 Min. : 0.000 Cauc:24507
2 : 1 Male :18518 1st Qu.:32.0 Smoked prior to 12 months: 8078 No Nicotine Dependence:38131 Every Day :14836 1st Qu.: 0.000 AfAm: 8245
3 : 1 Median :44.0 Never smoked :25080 Once a month or less: 772 Median : 0.000 NaAm: 701
4 : 1 Mean :46.4 NA's : 22 1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
5 : 1 3rd Qu.:59.0 3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
6 : 1 Max. :98.0 (Other) : 869 Max. :98.000
(Other):43087 NA's : 102 NA's :262
Depression IncomePersonal Height_ft Height_in Weight_lb DaysSmoke Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2
No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000
Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229
Median : 20000 Median :5.00 Median : 5.000 Median :165.0 Median : 0.00 Median :66.0 Median : -Inf Median : 0.0 Median : 8.492
Mean : 28185 Mean :5.11 Mean : 5.279 Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888
3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229
Max. :3000000 Max. :7.00 Max. :11.000 Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522
NA's :730 NA's :761 NA's :1376 NA's :102 NA's :761 NA's :102 NA's :297 NA's :25377
CigsSmokedFac SmokingFreq3
No smoking (0) :25080 Never :25080
Lowest smoking (1-6) : 1045 Monthly: 772
Low smoking (7-30) : 1129 Weekly : 1156
Medium smoking (31-300): 6456 Daily :15983
High smoking (301+) : 9086 NA's : 102
NA's : 297
```

**Rubric**

(3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe.

(3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers).

(2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g.,

`str()`

,`dim()`

,`summary()`

).- Scroll up to sections labeled “(Class 06)”.

(2 p) Label levels of factor variables.

- Scroll up to sections labeled “(Class 06)”.

```
# table() produces a one-variable frequency table
table(nesarc_sub$NicotineDependence)
```

```
Nicotine Dependence No Nicotine Dependence
4962 38131
```

```
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()
```

```
Nicotine Dependence No Nicotine Dependence
0.1151463 0.8848537
```

```
<-
tab_nicdep %>%
nesarc_sub group_by(
NicotineDependence%>%
) summarize(
n = n()
#, .groups = "drop_last"
%>%
) mutate(
prop = round(n / sum(n), 3)
%>%
) #arrange(
# desc(n)
#) %>%
ungroup()
tab_nicdep
```

```
# A tibble: 2 x 3
NicotineDependence n prop
<fct> <int> <dbl>
1 Nicotine Dependence 4962 0.115
2 No Nicotine Dependence 38131 0.885
```

```
library(ggplot2)
<- ggplot(data = nesarc_sub, aes(x = NicotineDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = "Nicotine Dependence"
p1 y = "Total Number"
, title = "Nicotine Dependence for Adults"
,
)print(p1)
```

**Interpretation:** Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%).

```
<- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5)
p #p <- p + geom_rug()
<- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
p y = ""
, title = "Monthly cigaretts smoked for Young Smoking Adults"
,
)<- p + theme_bw()
p print(p)
```

`summary(nesarc_sub$DailyCigsSmoked)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 6.462 10.000 98.000 262
```

`fivenum(nesarc_sub$DailyCigsSmoked)`

`[1] 0 0 0 10 98`

`median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)`

`[1] 0`

`IQR(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)`

`[1] 10`

**Interpretation:** Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 15 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 14. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.

**Rubric**

Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

Scatter plot (for regression): \(x\) = numerical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

Box plots (for ANOVA): \(x\) = categorical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

**Interpretation:** The relationship between `Age`

and `TotalCigsSmoked`

, among people who smoke, is very weakly positive, but does not resemble a straight line. `TotalCigsSmoked`

is right-skewed which results in many extreme values above the regression line.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
print(p)
```

An example of height and weight.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_jitter(height = 0.5, alpha = 1/10)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
```

**Interpretation:** For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total Cigarettes Smoked per month"
, caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
)
print(p)
```

**Rubric**

Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

Mosaic plot or bivariate bar plots (for contingency tables): \(x\) = categorical, \(y\) = categorical, include axis labels and a title. Interpret the plot: describe the relationship.

Logistic scatter plot (for logistic regression): \(x\) = numerical, \(y\) = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship.

```
<-
tab_dep_nic %>%
nesarc_sub group_by(
Depression, NicotineDependence%>%
) summarize(
n = n()
%>%
) mutate(
prop = round(n / sum(n), 3)
%>%
) ungroup()
tab_dep_nic
```

```
# A tibble: 4 x 4
Depression NicotineDependence n prop
<fct> <fct> <int> <dbl>
1 No Depression Nicotine Dependence 3193 0.091
2 No Depression No Nicotine Dependence 32061 0.909
3 Yes Depression Nicotine Dependence 1769 0.226
4 Yes Depression No Nicotine Dependence 6070 0.774
```

```
<-
tab_nic_dep %>%
nesarc_sub group_by(
NicotineDependence, Depression%>%
) summarize(
n = n()
%>%
) mutate(
prop = round(n / sum(n), 3)
%>%
) ungroup()
tab_nic_dep
```

```
# A tibble: 4 x 4
NicotineDependence Depression n prop
<fct> <fct> <int> <dbl>
1 Nicotine Dependence No Depression 3193 0.643
2 Nicotine Dependence Yes Depression 1769 0.357
3 No Nicotine Dependence No Depression 32061 0.841
4 No Nicotine Dependence Yes Depression 6070 0.159
```

**Reversing color order:** It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function `forcats::fct_rev()`

on your `fill=`

variable.

```
library(ggplot2)
<- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
p y = "Proportion"
, title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
,
)# the legend title can be modified, if desired (try this line)
<- p + scale_fill_discrete(name = "Nicotine Dependence\nStatus")
p print(p)
```

**Interpretation:** When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%).

`table(nesarc_sub$NicotineDependence)`

```
Nicotine Dependence No Nicotine Dependence
4962 38131
```

```
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
nesarc_sub# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)
```

```
0 1
38131 4962
```

```
# # Note, sometimes it is necessary to make your variable a character type first.
# # If the code above doesn't produce both 0s and 1s, then try this:
# nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0)
```

```
{r, fig.height = 4, fig.width = 8}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)
p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = "Nicotine Dependence (1=Yes, 0=No)"
, title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
)
print(p)
```

**Interpretation:** The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.

**Rubric**

- With your previous (or new) bivariate scatter plot, add a regression line.
- (2 p) plot with regression line,
- (1 p) label axes and title.

- Use
`lm()`

to fit the linear regression and interpret slope and \(R^2\) (R-squared) values.- (2 p) lm
`summary()`

table is presented, - (2 p) slope is interpreted with respect to a per-unit increase of the \(x\) variable in the context of the variables in the plot,
- (2 p) \(R^2\) is interpretted in a sentence.

- (2 p) lm
- (1 p) Interpret the intercept. Does it make sense in the context of your study?

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
#p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, fullrange = TRUE)
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
p <- p + xlim(c(0, NA))
print(p)
```

```
# fit the simple linear regression model
<-
lm_TCS_Age lm(
formula = TotalCigsSmoked ~ Age
data = nesarc_sub
,
)# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)
```

```
Call:
lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub)
Residuals:
Min 1Q Median 3Q Max
-332.1 -189.4 -139.1 106.3 2828.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.92792 4.35989 13.29 <2e-16 ***
Age 2.79797 0.08762 31.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 329.2 on 42794 degrees of freedom
(297 observations deleted due to missingness)
Multiple R-squared: 0.02327, Adjusted R-squared: 0.02325
F-statistic: 1020 on 1 and 42794 DF, p-value: < 2.2e-16
```

**Slope**: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month.**\(R^2\)**: The proportion of variance explained by the regression model, compared to the grand mean, is \(R^2 = 0.0233\), which is very small.

For someone 0 years old, the expected number of cigarettes smoked per month is 57.93.

This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.

**Rubric**

- Try plotting the data on a logarithmic scale
- (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.

- original scales
- \(\log(x)\)-only
- \(\log(y)\)-only
- both \(\log(x)\) and \(\log(y)\)

- What happened to your data when you transformed it?
- (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
- (1 p) Choose the best scale for a linear relationship and explain why.
- (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.

```
{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()
library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)
library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)
library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Age")
```

- Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).

- Original: Same
- log(x): no improvement, age values became left skewed
- log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
- log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help.

- Choose the best scale for a linear relationship and explain why.

log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.

- Does your relationship benefit from a logarithmic transformation? Say why or why not.

A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.

Here’s an example where transforming both \(x\) and \(y\) with the log makes a big improvement.

```
{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()
library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)
library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)
library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")
```

**Rubric**

- With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
- (1 p) plot is repeated here or the plot is referenced an easy to find from a plot above,
- (1 p) correlation is calculated,
- (2 p) correlation is interpreted (direction, strength of LINEAR relationship).

- With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
- (1 p) frequency table of variables is given,
- (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
- (1 p) a well-labeled plot of the proportion table is given,
- (2 p) the conditional proportions are interpreted and compared between conditions.

```
{r, fig.height = 5, fig.width = 5}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_point(alpha = 1/20)
p <- p + stat_smooth(method = lm)
p <- p + labs(
title = "Log2 Cigarettes smoked vs Age"
, x = "Age"
, y = "Log2 Total Cigarettes Smoked per month"
)
print(p)
```

```
<-
cor_A_C cor(
$Age
nesarc_sub$TotalCigsSmoked_log2
, nesarc_subuse = "complete.obs"
,
) cor_A_C
```

`[1] 0.1375481`

The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.

```
<-
tab_S_D_N %>%
nesarc_sub group_by(
Sex
, Depression
, NicotineDependence%>%
) summarise(
Frequency = n()
%>%
) mutate(
Proportion = round(Frequency / sum(Frequency), 3)
%>%
) ungroup()
# tab_S_D_N
%>%
tab_S_D_N ::kable() knitr
```

Sex | Depression | NicotineDependence | Frequency | Proportion |
---|---|---|---|---|

Female | No Depression | Nicotine Dependence | 1455 | 0.076 |

Female | No Depression | No Nicotine Dependence | 17706 | 0.924 |

Female | Yes Depression | Nicotine Dependence | 1117 | 0.206 |

Female | Yes Depression | No Nicotine Dependence | 4297 | 0.794 |

Male | No Depression | Nicotine Dependence | 1738 | 0.108 |

Male | No Depression | No Nicotine Dependence | 14355 | 0.892 |

Male | Yes Depression | Nicotine Dependence | 652 | 0.269 |

Male | Yes Depression | No Nicotine Dependence | 1773 | 0.731 |

```
library(ggplot2)
<- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(
p title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
y = "Proportion of Nicotine Dependence"
, fill = "Nicotine Dependence"
,
)<- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_wrap(~ Sex)
p <- p + theme(legend.position = "bottom")
p print(p)
```

- For Females without depression, the proportion who are nicotine dependent is 0.076, while for those with depression it is 0.206 (nearly 3 times as much).
- For Males without depression, the proportion who are nicotine dependent is 0.108, while for those with depression it is 0.269 (nearly 3 times as much).

Males generally are more nicotine dependent than Females, but both are nearly 3 times more likely to be nicotine dependent if they are depressed compared to if they are not depressed.

**Rubric**

- Using a numerical variable, calculate and interpret a confidence interval for the population mean.
- (1 p) Identify and describe the variable,
- (1 p) use
`t.test()`

to calculate the mean and confidence interval, and - (1 p) interpret the confidence interval.
- (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.

- Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
- (1 p) Identify and describe the variable,
- (1 p) use
`binom.test()`

to calculate the sample proportion and confidence interval, and - (1 p) interpret the confidence interval.
- (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.

`TotalCigsSmoked_log2`

is the estimated number of cigarettes per month a subject smokes (`DaysSmoke`

* `DailyCigsSmoked`

) on the log2 scale.

```
# calculate mean and CI
<-
t_result t.test(
$TotalCigsSmoked_log2
nesarc_sub
) t_result
```

```
One Sample t-test
data: nesarc_sub$TotalCigsSmoked_log2
t = 457.94, df = 17715, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
7.854201 7.921726
sample estimates:
mean of x
7.887963
```

**Interpretation:**

- The mean of
`TotalCigsSmoked_log2`

is 7.89 and the 95% confidence interval is (7.85, 7.92). - In 95% of samples, a confidence interval constructed from the data will contain the true population mean. In our case, our CI was (7.85, 7.92). It either contains the true population mean or it doesn’t, but we don’t know.

```
# plot data and create table
library(ggplot2)
<- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_log2))
p <- p + geom_histogram(binwidth = 0.5, alpha = 1)
p #p <- p + geom_rug(alpha = 1/8)
# est
<- p + geom_vline(aes(xintercept = as.numeric(t_result$estimate)), colour = "blue", size = 1, linetype = 3)
p <- p + geom_rect(aes(
p xmin = t_result$conf.int[1]
xmax = t_result$conf.int[2]
, ymin = -150
, ymax = 0)
, fill = "blue", alpha = 1)
, <- p + labs(title = "Log2 Total Cigs Smoked\nblue = est +- 95% CI")
p print(p)
```

`Depression`

is a Yes/No variable indicating that the person has major depression (lifetime).

```
# proportions are of the last group_by() variable within combinations of the other variables
<-
tab_dep_long %>%
nesarc_sub group_by(Depression) %>%
summarise(Frequency = n()) %>%
mutate(Proportion = Frequency / sum(Frequency)) %>%
ungroup()
tab_dep_long
```

```
# A tibble: 2 x 3
Depression Frequency Proportion
<fct> <int> <dbl>
1 No Depression 35254 0.818
2 Yes Depression 7839 0.182
```

```
# prop.test() is an asymptotic (approximate) test for a binomial random variable
<-
p_summary prop.test(
x = tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency)
n = sum(tab_dep_long$Frequency)
, #, conf.level = 0.95
) p_summary
```

```
1-sample proportions test with continuity correction
data: tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency) out of sum(tab_dep_long$Frequency), null probability 0.5
X-squared = 17440, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.1782835 0.1855912
sample estimates:
p
0.1819089
```

**Interpretation:**

- The proportion of people who have
`Depression`

is 0.182 and the 95% confidence interval is (0.178, 0.186). - In 95% of samples, a confidence interval constructed from the data will contain the true population proportion In our case, our CI was (0.178, 0.186). It either contains the true population proportion or it doesn’t, but we don’t know.

```
{r}
library(ggplot2)
p <- ggplot(data = tab_dep_long %>% filter(Depression == "Yes Depression"), aes(x = Depression, y = Proportion))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
#p <- p + geom_bar(stat = "identity")
p <- p + geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25)
p <- p + geom_point(colour = "red", size = 6, pch = 18)
p <- p + scale_y_continuous(limits = c(0.17, 0.19))
# swap axes to take less vertical space
#p <- p + coord_flip()
print(p)
```

Set up the

**null and alternative hypotheses**in words and notation.- In words: ``The population mean for [what is being studied] is different from [value of \(\mu_0\)].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
- In notation: \(H_0: \mu=\mu_0\) versus \(H_A: \mu \ne \mu_0\) (where \(\mu_0\) is specified by the context of the problem).

Choose the

**significance level**of the test, such as \(\alpha=0.05\).Compute the

**test statistic**, such as \(t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}\), where \(SE_{\bar{Y}}=s/\sqrt{n}\) is the standard error.Determine the

**tail(s)**of the sampling distribution where the**\(p\)-value**from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, \(t_{s}\), with the**critical value**\(t_{\textrm{crit}}=t_{\alpha/2}\) in the direction of the alternative hypothesis from the \(t\)-distribution table with degrees of freedom \(df = n-1\).)State the

**conclusion**in terms of the problem.- Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
- Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT
*accept*\(H_0\).)

**Check assumptions**of the test (for now we skip this).

Adapted from **Significance Magazine**.

Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more).

In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph.

“The difference between the two groups turns out to be small (8%), while the probability (\(p\)) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is \(p = 0.003\) (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold (\(p < 0.05\)). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.”

**Rubric**

- Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample \(t\)-test associated with your research questions.
- (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
- (0 p) use
`t.test()`

to calculate the mean, test statistic, and p-value, - (2 p) state the significance level, test statistic, and p-value, and
- (2 p) state the conclusion in the context of the problem.
- (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
- (2 p) Provide an appropriate plot of the data and sample estimates in a well-labeled plot.
- (1 p) Check the assumptions of the test using the bootstrap and interpret the bootstrap sampling distribution plot.

- Set up the
**null and alternative hypotheses**in words and notation.- In words: “The population mean total cigarettes smoked (
`TotalCigsSmoked`

) is greater for people with depression than for those without depression (`Depression`

).” - In notation: \(H_0: \mu_D = \mu_N\) versus \(H_A: \mu_D > \mu_N\).

- In words: “The population mean total cigarettes smoked (

```
<-
t_summary_TCS_D t.test(
~ Depression
TotalCigsSmoked data = nesarc_sub
, alternative = "less"
,
) t_summary_TCS_D
```

```
Welch Two Sample t-test
data: TotalCigsSmoked by Depression
t = -17.226, df = 10439, p-value < 2.2e-16
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -71.94239
sample estimates:
mean in group No Depression mean in group Yes Depression
173.0296 252.5677
```

Choose the

**significance level**of the test, such as \(\alpha=0.05\).Compute the

**test statistic**, such as \(t_{s} = -17.2\).**\(p\)-value**is \(p = 6.85\times 10^{-66}\).State the

**conclusion**in terms of the problem.- Reject \(H_0\) in favor of \(H_A\) concluding that the population mean total cigarettes smoked (
`TotalCigsSmoked`

) is greater for people with depression than for those without depression (`Depression`

).

- Reject \(H_0\) in favor of \(H_A\) concluding that the population mean total cigarettes smoked (

- Because we rejected the Null hypothesis, we could have made a Type I error.

```
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
<-
est_mean_TCS_D %>%
nesarc_sub group_by(Depression) %>%
summarise(TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)) %>%
ungroup()
est_mean_TCS_D
```

```
# A tibble: 2 x 2
Depression TotalCigsSmoked
<fct> <dbl>
1 No Depression 173.
2 Yes Depression 253.
```

```
{R, fig.width=4, fig.height=6}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total cigarettes smoked per month"
)
print(p)
```

```
# histogram using ggplot
<- ggplot(nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 40)
p #p <- p + geom_rug()
<- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour = "red")
p <- p + facet_grid(Depression ~ .)
p <- p + labs(
p title = "Total Cigarettes Smoked by Depression"
x = "Depression status"
, y = "Total cigarettes smoked per month"
,
)print(p)
```