```
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
knitr
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are not working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
::opts_chunk$set(cache = FALSE) #, autodep=TRUE) #$ knitr
```

# ADA1: Cumulative project file

NESARC, Smoking and depression in adults

Advanced Data Analysis 1, Stat 427/527, Fall 2022, Prof. Erik Erhardt, UNM

# 1 Document overview

This document is organized by Class number. The worksheet assignments are indicated by the Tuesday and Thursday Class numbers.

Consider your readers (graders):

- organize the document clearly (use this document as an example)
- label minor sections under each day (use this document as an example)
- For each thing you do, always have these three parts:
- Say what you’re going to do and why.
- Do it with code, and document your code.
- Interpret the results.

## 1.1 Global code options

## 1.2 Document

### 1.2.1 Naming

*Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.*

`ADA1_ALL_05.Rmd`

`ADA1_ALL_06.Rmd`

`ADA1_ALL_07.Rmd`

…

A version that I prefer is to use a date using Year-Month-Day, `YYYYMMDD`

:

`ADA1_ALL_20200903.Rmd`

`ADA1_ALL_20200905.Rmd`

`ADA1_ALL_20200910.Rmd`

…

### 1.2.2 Structure

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

### 1.2.3 Classes not appearing in this document

These class assignments are in their own documents: 09, 11, 13, 14, …

# 2 Research Questions

## 2.1 Class 04, Personal Codebook

**Rubric**

(1 p) Is there a topic of interest?

(2 p) Are the variables relevant to a set of research questions?

(4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

- 1 categorical variable with only 2 levels
- 1 categorical variable with at least 3 levels
- 2 numerical variables with many possible unique values
- More variables are welcome and you’re likely to add to this later in the semester

(3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

Compile this Rmd file to an html, print/save to pdf, and upload to UNM Learn.

### 2.1.1 Topic and research questions

**Topic:**

Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

**Research questions:**

Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?

Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?

Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?

# 3 Codebook

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

- Codebook: https://statacumen.com/teach/ADA1/PDS_data/NESARC_W1_CodeBook.pdf
- Official site: https://www.niaaa.nih.gov/research/nesarc-iii
- Introduction: https://pubs.niaaa.nih.gov/publications/arh29-2/74-78.htm

```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
Sex
SEX
SEX
Nominal
18518 1. Male
24575 2. Female
Age
AGE
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
NicotineDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
Depression
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars
Height_ft
S1Q24FT
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
Height_in
S1Q24IN
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
Weight_lb
S1Q24LB
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
```

Additional variables were created from the original variables:

```
CREATED VARIABLES
DaysSmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
TotalCigsSmoked_smoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Smokers, only! (replaced 0s with NAs)
Continuous
1-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```

# 4 Data Management

## 4.1 Class 05, Data subset and numerical summaries

**Rubric**

(4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

- Scroll down to sections labeled “(Class 05)”.

(1 p) Output confirms the subset is correct (e.g., using

`dim()`

and`str()`

).(3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

- Scroll down to sections labeled “(Class 05)”.

(2 p) Provide numerical summaries for all variables (e.g., using

`summary()`

).- Scroll down to sections labeled “(Class 05)”.

### 4.1.1 Data subset (Class 05)

First, the data is placed on the search path.

```
# data analysis packages
library(erikmisc) # Helpful functions
library(tidyverse) # Data manipulation and visualization suite
library(lubridate) # Dates
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Use the load() statement for the dataset you want to use.
# read data example
load("NESARC.RData")
dim(NESARC)
```

`[1] 43093 3008`

```
# variables to include in our data subset
<-
nesarc_sub %>%
NESARC ::select(
dplyr
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)
# Check size and structure of data subset.
# Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
```

`[1] 43093 13`

```
# structure of subset
str(nesarc_sub)
```

```
'data.frame': 43093 obs. of 13 variables:
$ IDNUM : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SEX : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CHECK321 : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ TAB12MDX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ S3AQ3B1 : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ S3AQ3C1 : num NA NA NA NA NA NA NA NA NA NA ...
$ ETHRACE2A : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ S1Q10A : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ S1Q24FT : num 5 5 5 5 6 5 5 5 5 5 ...
$ S1Q24IN : num 7 1 4 7 1 5 9 4 4 10 ...
$ S1Q24LB : num 195 127 185 180 220 140 160 120 140 150 ...
```

### 4.1.2 Renaming Variables (Class 05)

```
<-
nesarc_sub %>%
nesarc_sub ::rename(
dplyrID = IDNUM
Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, NicotineDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
, IncomePersonal = S1Q10A
, Height_ft = S1Q24FT
, Height_in = S1Q24IN
, Weight_lb = S1Q24LB
,
)
str(nesarc_sub)
```

```
'data.frame': 43093 obs. of 13 variables:
$ ID : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ Age : num 23 28 81 18 36 34 19 84 29 18 ...
$ SmokingStatus : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ SmokingFreq : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ DailyCigsSmoked : num NA NA NA NA NA NA NA NA NA NA ...
$ Ethnicity : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ Depression : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ IncomePersonal : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ Height_ft : num 5 5 5 5 6 5 5 5 5 5 ...
$ Height_in : num 7 1 4 7 1 5 9 4 4 10 ...
$ Weight_lb : num 195 127 185 180 220 140 160 120 140 150 ...
```

### 4.1.3 Coding missing values (Class 06)

There are two steps. The first step is to recode any existing `NA`

s to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual `NA`

.

#### 4.1.3.1 Coding `NA`

s as meaningful “missing”

First step: the existing blank values with `NA`

mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

##### 4.1.3.1.1 `NA`

s recoded as numeric

The easiest is `DailyCigsSmoked`

where `NA`

means 0 cigarettes. We replace `NA`

with 0. The function `is.na()`

identifies each observation that “is an `NA`

”. I’ve updated the codebook to indicate that the original `NA`

values were changed.

```
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
```

`table(nesarc_sub$DailyCigsSmoked)`

```
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851 40
17 18 19 20 21 22 23 24 25 27 28 29 30 33 34 35
22 59 5 5366 1 10 2 7 155 2 3 3 909 1 1 30
37 39 40 45 50 55 57 60 66 70 75 80 98 99
2 1 993 8 106 2 1 241 1 12 2 47 15 262
```

```
<-
nesarc_sub %>%
nesarc_sub replace_na(
# replace NA with new numeric value
list(
DailyCigsSmoked = 0
)
)
table(nesarc_sub$DailyCigsSmoked)
```

```
0 1 2 3 4 5 6 7 8 9 10 11 12
25080 934 884 923 573 1070 463 269 299 49 3077 23 230
13 14 15 16 17 18 19 20 21 22 23 24 25
34 25 851 40 22 59 5 5366 1 10 2 7 155
27 28 29 30 33 34 35 37 39 40 45 50 55
2 3 3 909 1 1 30 2 1 993 8 106 2
57 60 66 70 75 80 98 99
1 241 1 12 2 47 15 262
```

##### 4.1.3.1.2 `NA`

s recoded as categorical

The next two variables require defining a new category level. Both `SmokingStatus`

and `SmokingFreq`

with a value of `NA`

means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original `NA`

values were changed.

```
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
```

`table(nesarc_sub$SmokingStatus )`

```
1 2 9
9913 8078 22
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 9
14836 460 687 747 409 772 102
```

```
<-
nesarc_sub %>%
nesarc_sub mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
SmokingFreq = as.character(SmokingFreq )
, %>%
) replace_na(
# replace NA with new category values
list(
SmokingStatus = "3"
SmokingFreq = "7"
,
)
)
table(nesarc_sub$SmokingStatus )
```

```
1 2 3 9
9913 8078 25080 22
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 7 9
14836 460 687 747 409 772 25080 102
```

#### 4.1.3.2 Coding 9s and 99s as `NA`

s

Second step: replace the coded missing values with `NA`

, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

`table(nesarc_sub$SmokingStatus )`

```
1 2 3 9
9913 8078 25080 22
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 7 9
14836 460 687 747 409 772 25080 102
```

`table(nesarc_sub$DailyCigsSmoked)`

```
0 1 2 3 4 5 6 7 8 9 10 11 12
25080 934 884 923 573 1070 463 269 299 49 3077 23 230
13 14 15 16 17 18 19 20 21 22 23 24 25
34 25 851 40 22 59 5 5366 1 10 2 7 155
27 28 29 30 33 34 35 37 39 40 45 50 55
2 3 3 909 1 1 30 2 1 993 8 106 2
57 60 66 70 75 80 98 99
1 241 1 12 2 47 15 262
```

```
<-
nesarc_sub %>%
nesarc_sub mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
, Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA)
, Height_in = replace(Height_in , Height_in %in% c( 99), NA)
, Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA)
, # drop unused factor levels
SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
,
)
table(nesarc_sub$SmokingStatus )
```

```
1 2 3
9913 8078 25080
```

`table(nesarc_sub$SmokingFreq )`

```
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
```

`table(nesarc_sub$DailyCigsSmoked)`

```
0 1 2 3 4 5 6 7 8 9 10 11 12
25080 934 884 923 573 1070 463 269 299 49 3077 23 230
13 14 15 16 17 18 19 20 21 22 23 24 25
34 25 851 40 22 59 5 5366 1 10 2 7 155
27 28 29 30 33 34 35 37 39 40 45 50 55
2 3 3 909 1 1 30 2 1 993 8 106 2
57 60 66 70 75 80 98
1 241 1 12 2 47 15
```

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb
Min. :4.00 Min. : 0.000 Min. : 62.0
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
Median :5.00 Median : 5.000 Median :165.0
Mean :5.11 Mean : 5.279 Mean :170.6
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
Max. :7.00 Max. :11.000 Max. :500.0
NA's :730 NA's :761 NA's :1376
```

#### 4.1.3.3 Removing rows with missing values (don’t do this)

Finally, consider using `na.omit()`

to remove any records with missing values if it won’t cause issues with analysis. The command `na.omit()`

removes a row if any columns have an `NA`

in that row. This can have **drastic** consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters `NA`

s! Then `na.omit()`

will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running `na.omit()`

, and that’s if you have a large dataset and only a few observations have a few `NA`

s. You can’t analyze data with `NA`

s, since any calculation involving `NA`

results in an `NA`

. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few `NA`

distributed throughout; first the observations with `NA`

for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`

for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running `na.omit()`

you will start with a dataset with no `NA`

s, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do **not run na.omit()** since I’m concerned about having lots of

`NA`

s for some variables.`dim(nesarc_sub)`

`[1] 43093 13`

```
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
```

`[1] 43093 13`

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb
Min. :4.00 Min. : 0.000 Min. : 62.0
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
Median :5.00 Median : 5.000 Median :165.0
Mean :5.11 Mean : 5.279 Mean :170.6
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
Max. :7.00 Max. :11.000 Max. :500.0
NA's :730 NA's :761 NA's :1376
```

Now we (would) have a dataset with no missing values.

### 4.1.4 Creating new variables (Class 06+)

#### 4.1.4.1 From categories to numeric

`DaysSmoke`

estimates the days per month a subject smokes by converting `SmokingFreq`

(a factor with 6 levels) to a numeric variable using `as.numeric()`

and multiplying by the midpoint of the range of `SmokingFreq`

.

```
<-
nesarc_sub %>%
nesarc_sub mutate(
DaysSmoke =
case_when(
== 1 ~ 30 # 1. Every day
SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
, SmokingFreq
)
)
summary(nesarc_sub$DaysSmoke)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 10.96 30.00 30.00 102
```

#### 4.1.4.2 From numeric to numeric

`Height_inches`

is the height of a person in inches. Since the data was collected in two variables (`Height_ft`

and `Height_in`

) we need to add them together for our new variable.

```
<-
nesarc_sub %>%
nesarc_sub mutate(
Height_inches = Height_ft * 12 + Height_in
)
summary(nesarc_sub$Height_inches)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
48.0 64.0 66.0 66.6 70.0 84.0 761
```

The variable `TotalCigsSmoked`

estimates the monthly number of cigarettes a subject smokes per month by multiplying `DaysSmoke`

times `DailyCigsSmoked`

(the usual quantity smoked per day).

```
<-
nesarc_sub %>%
nesarc_sub mutate(
DaysSmoke_log2 = log2(DaysSmoke)
TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
, TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
, TotalCigsSmoked_smokers = ifelse(TotalCigsSmoked > 0, TotalCigsSmoked, NA)
, TotalCigsSmoked_smokers_log2 = TotalCigsSmoked_log2
,
)
summary(nesarc_sub$TotalCigsSmoked)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.0 187.5 300.0 2940.0 297
```

`summary(nesarc_sub$TotalCigsSmoked_smokers)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 150.0 360.0 453.1 600.0 2940.0 25377
```

#### 4.1.4.3 From numeric to categories based on quantiles

```
<-
nesarc_sub %>%
nesarc_sub mutate(
CigsSmokedFac =
case_when(
== 0 ~ "No smoking (0)"
TotalCigsSmoked > 0) & (TotalCigsSmoked <= 6) ~ "Lowest smoking (1-6)"
, (TotalCigsSmoked > 6) & (TotalCigsSmoked <= 30) ~ "Low smoking (7-30)"
, (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
, (TotalCigsSmoked > 300) ~ "High smoking (301+)"
, (TotalCigsSmoked
)CigsSmokedFac =
, factor(
CigsSmokedFaclevels = c(
, "No smoking (0)"
"Lowest smoking (1-6)"
, "Low smoking (7-30)"
, "Medium smoking (31-300)"
, "High smoking (301+)"
,
)
)
)
summary(nesarc_sub$TotalCigsSmoked)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.0 187.5 300.0 2940.0 297
```

`summary(nesarc_sub$CigsSmokedFac)`

```
No smoking (0) Lowest smoking (1-6) Low smoking (7-30)
25080 1045 1129
Medium smoking (31-300) High smoking (301+) NA's
6456 9086 297
```

`table(nesarc_sub$CigsSmokedFac)`

```
No smoking (0) Lowest smoking (1-6) Low smoking (7-30)
25080 1045 1129
Medium smoking (31-300) High smoking (301+)
6456 9086
```

#### 4.1.4.4 From many categories to a few

The variable `SmokingFreq3`

collapses the `SmokingFreq`

from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

```
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)
```

```
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
```

```
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
<-
nesarc_sub %>%
nesarc_sub mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq"1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
,
)
)
# new variable
table(nesarc_sub$SmokingFreq3)
```

```
1 2 3 4
15983 1156 772 25080
```

`# we will label the new factor variable below, with the others`

#### 4.1.4.5 Working with Dates

How to calculate the duration between two dates.

(You probably won’t need to use this for your project.)

Dates can be tricky in R. The `lubridate`

package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an `AGE`

variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (`CDAY`

) are the interview day, month, and year, and the “DOB” variables (`DOBD`

) are the date of birth day, month, and year.

```
<-
dat_sub %>%
NESARC ::select(
dplyr
AGE
, CDAY
, CMON
, CYEAR
, DOBD
, DOBM
, DOBY
)str(dat_sub)
```

```
'data.frame': 43093 obs. of 7 variables:
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CDAY : num 14 12 23 9 18 7 1 18 9 16 ...
$ CMON : num 8 1 11 9 10 9 9 9 1 8 ...
$ CYEAR: num 2001 2002 2001 2001 2001 ...
$ DOBD : num 27 15 28 27 11 28 16 10 7 25 ...
$ DOBM : num 9 12 12 6 9 12 3 11 4 3 ...
$ DOBY : num 1977 1973 1919 1983 1965 ...
```

Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.

```
<-
dat_sub %>%
dat_sub mutate(
## interview date
# combine day, month, and year into one text string
cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
# create date object by interpretting the day, month, year with dmy()
cdate_date = dmy(cdate_text)
,
## date of birth
# combine day, month, and year into one text string
dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
, # create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob_date = dmy(dob_text)
,
)head(dat_sub)
```

```
AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date dob_text
1 23 14 8 2001 27 9 1977 14 8 2001 2001-08-14 27 9 1977
2 28 12 1 2002 15 12 1973 12 1 2002 2002-01-12 15 12 1973
3 81 23 11 2001 28 12 1919 23 11 2001 2001-11-23 28 12 1919
4 18 9 9 2001 27 6 1983 9 9 2001 2001-09-09 27 6 1983
5 36 18 10 2001 11 9 1965 18 10 2001 2001-10-18 11 9 1965
6 34 7 9 2001 28 12 1966 7 9 2001 2001-09-07 28 12 1966
dob_date
1 1977-09-27
2 1973-12-15
3 1919-12-28
4 1983-06-27
5 1965-09-11
6 1966-12-28
```

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

```
<-
dat_sub %>%
dat_sub mutate(
age_days = cdate_date - dob_date
age_years = as.numeric(age_days / 365.25) # change from "difftime" to "numeric"
, # difference between the age we calculated and the age in the original dataset
age_diff = age_years - AGE
,
)head(dat_sub)
```

```
AGE CDAY CMON CYEAR DOBD DOBM DOBY cdate_text cdate_date dob_text
1 23 14 8 2001 27 9 1977 14 8 2001 2001-08-14 27 9 1977
2 28 12 1 2002 15 12 1973 12 1 2002 2002-01-12 15 12 1973
3 81 23 11 2001 28 12 1919 23 11 2001 2001-11-23 28 12 1919
4 18 9 9 2001 27 6 1983 9 9 2001 2001-09-09 27 6 1983
5 36 18 10 2001 11 9 1965 18 10 2001 2001-10-18 11 9 1965
6 34 7 9 2001 28 12 1966 7 9 2001 2001-09-07 28 12 1966
dob_date age_days age_years age_diff
1 1977-09-27 8722 days 23.87953 0.87953457
2 1973-12-15 10255 days 28.07666 0.07665982
3 1919-12-28 29916 days 81.90554 0.90554415
4 1983-06-27 6649 days 18.20397 0.20396988
5 1965-09-11 13186 days 36.10130 0.10130048
6 1966-12-28 12672 days 34.69405 0.69404517
```

Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!

```
# drop the original ones you don't need anymore
# by select(-VAR), it excludes the VAR column
<-
dat_sub %>%
dat_sub select(
-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
# -c(cdate_text, cdate_date, dob_text, dob_date)
-c(age_days, age_diff)
,
)str(dat_sub)
```

```
'data.frame': 43093 obs. of 6 variables:
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ cdate_text: chr "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" ...
$ cdate_date: Date, format: "2001-08-14" "2002-01-12" ...
$ dob_text : chr "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" ...
$ dob_date : Date, format: "1977-09-27" "1973-12-15" ...
$ age_years : num 23.9 28.1 81.9 18.2 36.1 ...
```

Now you’ve got an age (or other date object) you can use in your project.

(For this example, I’m going to delete the `dat_sub`

object since it was just for this demonstration.

`rm(dat_sub)`

#### 4.1.4.6 Review results of new variables

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb DaysSmoke Height_inches
Min. :4.00 Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. :48.0
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0
Median :5.00 Median : 5.000 Median :165.0 Median : 0.00 Median :66.0
Mean :5.11 Mean : 5.279 Mean :170.6 Mean :10.96 Mean :66.6
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0
Max. :7.00 Max. :11.000 Max. :500.0 Max. :30.00 Max. :84.0
NA's :730 NA's :761 NA's :1376 NA's :102 NA's :761
DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_smokers
Min. : -Inf Min. : 0.0 Min. : 0.000 Min. : 1.0
1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 150.0
Median : -Inf Median : 0.0 Median : 8.492 Median : 360.0
Mean : -Inf Mean : 187.5 Mean : 7.888 Mean : 453.1
3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.: 600.0
Max. :4.907 Max. :2940.0 Max. :11.522 Max. :2940.0
NA's :102 NA's :297 NA's :25377 NA's :25377
TotalCigsSmoked_smokers_log2 CigsSmokedFac SmokingFreq3
Min. : 0.000 No smoking (0) :25080 1 :15983
1st Qu.: 7.229 Lowest smoking (1-6) : 1045 2 : 1156
Median : 8.492 Low smoking (7-30) : 1129 3 : 772
Mean : 7.888 Medium smoking (31-300): 6456 4 :25080
3rd Qu.: 9.229 High smoking (301+) : 9086 NA's: 102
Max. :11.522 NA's : 297
NA's :25377
```

### 4.1.5 Labeling Categorical variable levels (Class 06)

Look at summary of dataset to see the state of the labeling before we make changes.

`summary(nesarc_sub)`

```
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb DaysSmoke Height_inches
Min. :4.00 Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. :48.0
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0
Median :5.00 Median : 5.000 Median :165.0 Median : 0.00 Median :66.0
Mean :5.11 Mean : 5.279 Mean :170.6 Mean :10.96 Mean :66.6
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0
Max. :7.00 Max. :11.000 Max. :500.0 Max. :30.00 Max. :84.0
NA's :730 NA's :761 NA's :1376 NA's :102 NA's :761
DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_smokers
Min. : -Inf Min. : 0.0 Min. : 0.000 Min. : 1.0
1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 150.0
Median : -Inf Median : 0.0 Median : 8.492 Median : 360.0
Mean : -Inf Mean : 187.5 Mean : 7.888 Mean : 453.1
3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.: 600.0
Max. :4.907 Max. :2940.0 Max. :11.522 Max. :2940.0
NA's :102 NA's :297 NA's :25377 NA's :25377
TotalCigsSmoked_smokers_log2 CigsSmokedFac SmokingFreq3
Min. : 0.000 No smoking (0) :25080 1 :15983
1st Qu.: 7.229 Lowest smoking (1-6) : 1045 2 : 1156
Median : 8.492 Low smoking (7-30) : 1129 3 : 772
Mean : 7.888 Medium smoking (31-300): 6456 4 :25080
3rd Qu.: 9.229 High smoking (301+) : 9086 NA's: 102
Max. :11.522 NA's : 297
NA's :25377
```

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables `SmokingFreq`

, `NicotineDependence`

, and `Sex`

.

```
# Label factors with meaningful names in the same order as the codebook
$Depression <-
nesarc_subfactor(
$Depression
nesarc_sublabels = c("No Depression"
, "Yes Depression"
,
)
)
$SmokingStatus <-
nesarc_subfactor(
$SmokingStatus
nesarc_sublabels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
,
)
)
# Specify the order of the levels to be different from codebook order
$Sex <-
nesarc_subfactor(nesarc_sub$Sex
levels = c( 2
, 1
,
)labels = c( "Female"
, "Male"
,
)
)# check ordering with a frequency table
table(nesarc_sub$Sex)
```

```
Female Male
24575 18518
```

```
$NicotineDependence <-
nesarc_subfactor(nesarc_sub$NicotineDependence
levels = c( 1
, 0
,
)labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
,
)
)
# Remember to include the new category level from the original `NA` values.
$SmokingFreq <-
nesarc_subfactor(nesarc_sub$SmokingFreq
levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
,
)labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
,
)
)
$Ethnicity <-
nesarc_subfactor(nesarc_sub$Ethnicity
# shorter labels are helpful for plots
labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
,
)# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
$SmokingFreq3 <-
nesarc_subfactor(nesarc_sub$SmokingFreq3
levels = c( 4
, 3
, 2
, 1
,
)labels = c( "Never"
, "Monthly"
, "Weekly"
, "Daily"
,
)
)
summary(nesarc_sub)
```

```
ID Sex Age
1 : 1 Female:24575 Min. :18.0
2 : 1 Male :18518 1st Qu.:32.0
3 : 1 Median :44.0
4 : 1 Mean :46.4
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingStatus NicotineDependence
Smoked past 12 months : 9913 Nicotine Dependence : 4962
Smoked prior to 12 months: 8078 No Nicotine Dependence:38131
Never smoked :25080
NA's : 22
SmokingFreq DailyCigsSmoked Ethnicity
Never :25080 Min. : 0.000 Cauc:24507
Every Day :14836 1st Qu.: 0.000 AfAm: 8245
Once a month or less: 772 Median : 0.000 NaAm: 701
1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
(Other) : 869 Max. :98.000
NA's : 102 NA's :262
Depression IncomePersonal Height_ft Height_in
No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000
Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000
Median : 20000 Median :5.00 Median : 5.000
Mean : 28185 Mean :5.11 Mean : 5.279
3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000
Max. :3000000 Max. :7.00 Max. :11.000
NA's :730 NA's :761
Weight_lb DaysSmoke Height_inches DaysSmoke_log2
Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf
1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf
Median :165.0 Median : 0.00 Median :66.0 Median : -Inf
Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf
3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907
Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907
NA's :1376 NA's :102 NA's :761 NA's :102
TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_smokers
Min. : 0.0 Min. : 0.000 Min. : 1.0
1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 150.0
Median : 0.0 Median : 8.492 Median : 360.0
Mean : 187.5 Mean : 7.888 Mean : 453.1
3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.: 600.0
Max. :2940.0 Max. :11.522 Max. :2940.0
NA's :297 NA's :25377 NA's :25377
TotalCigsSmoked_smokers_log2 CigsSmokedFac SmokingFreq3
Min. : 0.000 No smoking (0) :25080 Never :25080
1st Qu.: 7.229 Lowest smoking (1-6) : 1045 Monthly: 772
Median : 8.492 Low smoking (7-30) : 1129 Weekly : 1156
Mean : 7.888 Medium smoking (31-300): 6456 Daily :15983
3rd Qu.: 9.229 High smoking (301+) : 9086 NA's : 102
Max. :11.522 NA's : 297
NA's :25377
```

### 4.1.6 Data subset rows

## 4.2 Data is complete (Class 06)

### 4.2.1 Plot entire dataset, show missing values

```
{r}
ggplot_missing(nesarc_sub)
```

### 4.2.2 Numerical summaries to assess correctness (Class 05)

`summary(nesarc_sub)`

```
ID Sex Age
1 : 1 Female:24575 Min. :18.0
2 : 1 Male :18518 1st Qu.:32.0
3 : 1 Median :44.0
4 : 1 Mean :46.4
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingStatus NicotineDependence
Smoked past 12 months : 9913 Nicotine Dependence : 4962
Smoked prior to 12 months: 8078 No Nicotine Dependence:38131
Never smoked :25080
NA's : 22
SmokingFreq DailyCigsSmoked Ethnicity
Never :25080 Min. : 0.000 Cauc:24507
Every Day :14836 1st Qu.: 0.000 AfAm: 8245
Once a month or less: 772 Median : 0.000 NaAm: 701
1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
(Other) : 869 Max. :98.000
NA's : 102 NA's :262
Depression IncomePersonal Height_ft Height_in
No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000
Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000
Median : 20000 Median :5.00 Median : 5.000
Mean : 28185 Mean :5.11 Mean : 5.279
3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000
Max. :3000000 Max. :7.00 Max. :11.000
NA's :730 NA's :761
Weight_lb DaysSmoke Height_inches DaysSmoke_log2
Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : -Inf
1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: -Inf
Median :165.0 Median : 0.00 Median :66.0 Median : -Inf
Mean :170.6 Mean :10.96 Mean :66.6 Mean : -Inf
3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.:4.907
Max. :500.0 Max. :30.00 Max. :84.0 Max. :4.907
NA's :1376 NA's :102 NA's :761 NA's :102
TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_smokers
Min. : 0.0 Min. : 0.000 Min. : 1.0
1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 150.0
Median : 0.0 Median : 8.492 Median : 360.0
Mean : 187.5 Mean : 7.888 Mean : 453.1
3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.: 600.0
Max. :2940.0 Max. :11.522 Max. :2940.0
NA's :297 NA's :25377 NA's :25377
TotalCigsSmoked_smokers_log2 CigsSmokedFac SmokingFreq3
Min. : 0.000 No smoking (0) :25080 Never :25080
1st Qu.: 7.229 Lowest smoking (1-6) : 1045 Monthly: 772
Median : 8.492 Low smoking (7-30) : 1129 Weekly : 1156
Mean : 7.888 Medium smoking (31-300): 6456 Daily :15983
3rd Qu.: 9.229 High smoking (301+) : 9086 NA's : 102
Max. :11.522 NA's : 297
NA's :25377
```

# 5 Graphing and Tabulating

## 5.1 Class 06, Plotting univariate

**Rubric**

(3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe.

(3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers).

(2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g.,

`str()`

,`dim()`

,`summary()`

).- Scroll up to sections labeled “(Class 06)”.

(2 p) Label levels of factor variables.

- Scroll up to sections labeled “(Class 06)”.

## 5.2 Categorical variables

### 5.2.1 Tables for categorical variables

```
# table() produces a one-variable frequency table
table(nesarc_sub$NicotineDependence)
```

```
Nicotine Dependence No Nicotine Dependence
4962 38131
```

```
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()
```

```
Nicotine Dependence No Nicotine Dependence
0.1151463 0.8848537
```

```
<-
tab_nicdep %>%
nesarc_sub group_by(
NicotineDependence%>%
) summarize(
n = n()
#, .groups = "drop_last"
%>%
) mutate(
prop = round(n / sum(n), 3)
%>%
) #arrange(
# desc(n)
#) %>%
ungroup()
tab_nicdep
```

```
# A tibble: 2 × 3
NicotineDependence n prop
<fct> <int> <dbl>
1 Nicotine Dependence 4962 0.115
2 No Nicotine Dependence 38131 0.885
```

### 5.2.2 Graphing frequency tables

```
library(ggplot2)
<- ggplot(data = nesarc_sub, aes(x = NicotineDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = "Nicotine Dependence"
p1 y = "Total Number"
, title = "Nicotine Dependence for Adults"
,
)print(p1)
```

**Interpretation:** Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%).

## 5.3 Numeric variables

### 5.3.1 Graphing numeric variables

```
<- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5)
p #p <- p + geom_rug()
<- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
p y = ""
, title = "Monthly cigaretts smoked for Young Smoking Adults"
,
)<- p + theme_bw()
p print(p)
```

`summary(nesarc_sub$DailyCigsSmoked)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 6.462 10.000 98.000 262
```

`fivenum(nesarc_sub$DailyCigsSmoked)`

`[1] 0 0 0 10 98`

`median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)`

`[1] 0`

`IQR(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)`

`[1] 10`

**Interpretation:** Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 15 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 14. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.

## 5.4 Class 07, Plotting bivariate, numeric response

**Rubric**

Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

Scatter plot (for regression): \(x\) = numerical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

Box plots (for ANOVA): \(x\) = categorical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

### 5.4.1 Scatter plot (for regression): x = numerical, y = numerical

**Interpretation:** The relationship between `Age`

and `TotalCigsSmoked`

, among people who smoke, is very weakly positive, but does not resemble a straight line. `TotalCigsSmoked`

is right-skewed which results in many extreme values above the regression line.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
print(p)
```

An example of height and weight.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_jitter(height = 0.5, alpha = 1/10)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
```

### 5.4.2 Box plots (for ANOVA): x = categorical, y = numerical

**Interpretation:** For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total Cigarettes Smoked per month"
, caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
)
print(p)
```

## 5.5 Class 08, Plotting bivariate, categorical response

**Rubric**

Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):

Mosaic plot or bivariate bar plots (for contingency tables): \(x\) = categorical, \(y\) = categorical, include axis labels and a title. Interpret the plot: describe the relationship.

Logistic scatter plot (for logistic regression): \(x\) = numerical, \(y\) = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship.

### 5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical

```
<-
tab_dep_nic %>%
nesarc_sub group_by(
Depression, NicotineDependence%>%
) summarize(
n = n()
%>%
) mutate(
prop = round(n / sum(n), 3)
%>%
) ungroup()
tab_dep_nic
```

```
# A tibble: 4 × 4
Depression NicotineDependence n prop
<fct> <fct> <int> <dbl>
1 No Depression Nicotine Dependence 3193 0.091
2 No Depression No Nicotine Dependence 32061 0.909
3 Yes Depression Nicotine Dependence 1769 0.226
4 Yes Depression No Nicotine Dependence 6070 0.774
```

```
<-
tab_nic_dep %>%
nesarc_sub group_by(
NicotineDependence, Depression%>%
) summarize(
n = n()
%>%
) mutate(
prop = round(n / sum(n), 3)
%>%
) ungroup()
tab_nic_dep
```

```
# A tibble: 4 × 4
NicotineDependence Depression n prop
<fct> <fct> <int> <dbl>
1 Nicotine Dependence No Depression 3193 0.643
2 Nicotine Dependence Yes Depression 1769 0.357
3 No Nicotine Dependence No Depression 32061 0.841
4 No Nicotine Dependence Yes Depression 6070 0.159
```

**Reversing color order:** It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function `forcats::fct_rev()`

on your `fill=`

variable.

```
library(ggplot2)
<- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
p y = "Proportion"
, title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
,
)# the legend title can be modified, if desired (try this line)
<- p + scale_fill_discrete(name = "Nicotine Dependence\nStatus")
p print(p)
```

**Interpretation:** When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%).

### 5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)

`table(nesarc_sub$NicotineDependence)`

```
Nicotine Dependence No Nicotine Dependence
4962 38131
```

```
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
nesarc_sub# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)
```

```
0 1
38131 4962
```

```
# # Note, sometimes it is necessary to make your variable a character type first.
# # If the code above doesn't produce both 0s and 1s, then try this:
# nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0)
```

```
{r, fig.height = 4, fig.width = 8}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)
p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = "Nicotine Dependence (1=Yes, 0=No)"
, title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
)
print(p)
```

**Interpretation:** The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.

# 6 Statistical methods

## 6.1 Class 09, Simple linear regression (separate worksheet)

## 6.2 Class 10, Simple linear regression

**Rubric**

- With your previous (or new) bivariate scatter plot, add a regression line.
- (2 p) plot with regression line,
- (1 p) label axes and title.

- Use
`lm()`

to fit the linear regression and interpret slope and \(R^2\) (R-squared) values.- (2 p) lm
`summary()`

table is presented, - (2 p) slope is interpreted with respect to a per-unit increase of the \(x\) variable in the context of the variables in the plot,
- (2 p) \(R^2\) is interpretted in a sentence.

- (2 p) lm
- (1 p) Interpret the intercept. Does it make sense in the context of your study?

### 6.2.1 1. Scatter plot, add a regression line.

```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
#p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, fullrange = TRUE)
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
p <- p + xlim(c(0, NA))
print(p)
```

### 6.2.2 2. Fit the linear regression, interpret slope and \(R^2\) (R-squared) values

```
# fit the simple linear regression model
<-
lm_TCS_Age lm(
formula = TotalCigsSmoked ~ Age
data = nesarc_sub
,
)# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)
```

```
Call:
lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub)
Residuals:
Min 1Q Median 3Q Max
-332.1 -189.4 -139.1 106.3 2828.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.92792 4.35989 13.29 <2e-16 ***
Age 2.79797 0.08762 31.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 329.2 on 42794 degrees of freedom
(297 observations deleted due to missingness)
Multiple R-squared: 0.02327, Adjusted R-squared: 0.02325
F-statistic: 1020 on 1 and 42794 DF, p-value: < 2.2e-16
```

**Slope**: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month.**\(R^2\)**: The proportion of variance explained by the regression model, compared to the grand mean, is \(R^2 = 0.0233\), which is very small.

### 6.2.3 3. Interpret the intercept. Does it make sense?

For someone 0 years old, the expected number of cigarettes smoked per month is 57.93.

This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.

## 6.3 Class 11, Logarithm transformation (separate worksheet)

## 6.4 Class 12, Logarithm transformation

**Rubric**

- Try plotting the data on a logarithmic scale
- (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.

- original scales
- \(\log(x)\)-only
- \(\log(y)\)-only
- both \(\log(x)\) and \(\log(y)\)

- What happened to your data when you transformed it?
- (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
- (1 p) Choose the best scale for a linear relationship and explain why.
- (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.

### 6.4.1 1. Try plotting on log scale (original scale, \(\log(x)\)-only, \(\log(y)\)-only, both \(\log(x)\) and \(\log(y)\))

```
{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()
library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)
library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)
library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Age")
```

### 6.4.2 2. What happened to your data when you transformed it?

- Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).

- Original: Same
- log(x): no improvement, age values became left skewed
- log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
- log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help.

- Choose the best scale for a linear relationship and explain why.

log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.

- Does your relationship benefit from a logarithmic transformation? Say why or why not.

A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.

##### 6.4.2.0.1 Extra plots

Here’s an example where transforming both \(x\) and \(y\) with the log makes a big improvement.

```
{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()
library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)
library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)
library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")
```

## 6.5 Class 13, Correlation (separate worksheet)

## 6.6 Class 14, Categorical contingency tables (separate worksheet)

## 6.7 Class 15, Correlation and Categorical contingency tables

**Rubric**

- With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
- (1 p) plot is repeated here or the plot is referenced an easy to find from a plot above,
- (1 p) correlation is calculated,
- (2 p) correlation is interpreted (direction, strength of LINEAR relationship).

- With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
- (1 p) frequency table of variables is given,
- (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
- (1 p) a well-labeled plot of the proportion table is given,
- (2 p) the conditional proportions are interpreted and compared between conditions.

### 6.7.1 Correlation

```
{r, fig.height = 5, fig.width = 5}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_point(alpha = 1/20)
p <- p + stat_smooth(method = lm)
p <- p + labs(
title = "Log2 Cigarettes smoked vs Age"
, x = "Age"
, y = "Log2 Total Cigarettes Smoked per month"
)
print(p)
```

```
<-
cor_A_C cor(
$Age
nesarc_sub$TotalCigsSmoked_log2
, nesarc_subuse = "complete.obs"
,
) cor_A_C
```

`[1] 0.1375481`

### 6.7.2 Interpretation of correlation

The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.

### 6.7.3 Contingency table

```
<-
tab_S_D_N %>%
nesarc_sub group_by(
Sex
, Depression
, NicotineDependence%>%
) summarise(
Frequency = n()
%>%
) mutate(
Proportion = round(Frequency / sum(Frequency), 3)
%>%
) ungroup()
# tab_S_D_N
%>%
tab_S_D_N ::kable() knitr
```

Sex | Depression | NicotineDependence | Frequency | Proportion |
---|---|---|---|---|

Female | No Depression | Nicotine Dependence | 1455 | 0.076 |

Female | No Depression | No Nicotine Dependence | 17706 | 0.924 |

Female | Yes Depression | Nicotine Dependence | 1117 | 0.206 |

Female | Yes Depression | No Nicotine Dependence | 4297 | 0.794 |

Male | No Depression | Nicotine Dependence | 1738 | 0.108 |

Male | No Depression | No Nicotine Dependence | 14355 | 0.892 |

Male | Yes Depression | Nicotine Dependence | 652 | 0.269 |

Male | Yes Depression | No Nicotine Dependence | 1773 | 0.731 |

```
library(ggplot2)
<- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(
p title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
y = "Proportion of Nicotine Dependence"
, fill = "Nicotine Dependence"
,
)
```