---
title: "ADA1: Cumulative project file"
subtitle: "NESARC, Smoking and depression in adults"
author: "Erik Erhardt"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
number_sections: true
toc_depth: 5
code_folding: show
#df_print: paged
#df_print: kable
#toc_float: true
#collapsed: false
#smooth_scroll: TRUE
theme: cosmo #spacelab #yeti #united
highlight: tango
pdf_document:
df_print: kable
#latex_engine: xelatex
#sansfont: IBM Plex Sans
#classoption: landscape
fontsize: 12pt
geometry: margin=0.25in
always_allow_html: yes
---
----------------------------------------
# Document overview
This document is organized by Class number.
The worksheet assignments are indicated by the Tuesday and Thursday Class numbers.
Consider your readers (graders):
* organize the document clearly (use this document as an example)
* label minor sections under each day (use this document as an example)
* For each thing you do, always have these three parts:
1. Say what you're going to do and why.
2. Do it with code, and document your code.
3. Interpret the results.
## Global code options
```{R}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are not working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
knitr::opts_chunk$set(cache = FALSE) #, autodep=TRUE) #$
```
## Document
### Naming
*Note: Each class save this file with a new name,
updating the last two digits to the class number.
Then, you'll have a record of your progress,
as well as which files you turned in for grading.*
* `ADA1_ALL_05.Rmd`
* `ADA1_ALL_06.Rmd`
* `ADA1_ALL_07.Rmd` ...
A version that I prefer is to use a date using Year-Month-Day, `YYYYMMDD`:
* `ADA1_ALL_20200903.Rmd`
* `ADA1_ALL_20200905.Rmd`
* `ADA1_ALL_20200910.Rmd` ...
### Structure
We will include all of our assignments together in this document to retain the
relevant information needed for subsequent assignments since our analysis is cumulative.
You will also have an opportunity to revisit previous parts to make changes or improvements,
such as updating your codebook, recoding variables, and improving tables and plots.
I've provided an initial predicted organization of our
sections and subsections using the \# and \#\# symbols.
A table of contents is automatically generated using the "toc: true" in the yaml
and can headings in the table of contents are clickable to jump down
to each (sub)section.
### Classes not appearing in this document
These class assignments are in their own documents: 09, 11, 13, 14, ...
----------------------------------------
# Research Questions
## Class 04, Personal Codebook
__Rubric__
1. (1 p) Is there a topic of interest?
2. (2 p) Are the variables relevant to a set of research questions?
3. (4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 "data" variables)?
* 1 categorical variable with only 2 levels
* 1 categorical variable with at least 3 levels
* 2 numerical variables with many possible unique values
* More variables are welcome and you're likely to add to this later in the semester
4. (3 p) For each variable, is there a variable description, a data type, and coded value descriptions?
5. Compile this Rmd file to an html, print/save to pdf, and upload to UNM Learn.
### Topic and research questions
**Topic:**
Is there an association between nicotine dependence
and the frequency and quantity of smoking in adults?
The association may differ by ethnicity, age, gender, and other factors
(though we won't be able to test these additional associations until next semester in ADA2).
**Research questions:**
1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
# Codebook
National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)
* Codebook: https://statacumen.com/teach/ADA1/PDS_data/NESARC_W1_CodeBook.pdf
* Official site: https://www.niaaa.nih.gov/research/nesarc-iii
* Introduction: https://pubs.niaaa.nih.gov/publications/arh29-2/74-78.htm
```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
Sex
SEX
SEX
Nominal
18518 1. Male
24575 2. Female
Age
AGE
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
NicotineDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
Depression
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars
Height_ft
S1Q24FT
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
Height_in
S1Q24IN
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
Weight_lb
S1Q24LB
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
```
Additional variables were created from the original variables:
```
CREATED VARIABLES
DaysSmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
TotalCigsSmoked_smoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Smokers, only! (replaced 0s with NAs)
Continuous
1-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```
----------------------------------------
# Data Management
## Class 05, Data subset and numerical summaries
__Rubric__
1. (4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.
* Scroll down to sections labeled "(Class 05)".
2. (1 p) Output confirms the subset is correct (e.g., using `dim()` and `str()`).
3. (3 p) Rename your variables to descriptive names (e.g., from "S3AQ3B1" to "SmokingFreq").
* Scroll down to sections labeled "(Class 05)".
4. (2 p) Provide numerical summaries for all variables (e.g., using `summary()`).
* Scroll down to sections labeled "(Class 05)".
### Data subset (Class 05)
First, the data is placed on the search path.
```{R}
# data analysis packages
library(tidyverse) # Data manipulation and visualization suite
library(lubridate) # Dates
library(erikmisc) # Helpful functions
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Use the load() statement for the dataset you want to use.
# read data example
load("NESARC.RData")
dim(NESARC)
```
```{r}
# variables to include in our data subset
nesarc_sub <-
NESARC %>%
dplyr::select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)
# Check size and structure of data subset.
# Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc_sub)
# structure of subset
str(nesarc_sub)
```
### Renaming Variables (Class 05)
```{r}
nesarc_sub <-
nesarc_sub %>%
dplyr::rename(
ID = IDNUM
, Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, NicotineDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
, IncomePersonal = S1Q10A
, Height_ft = S1Q24FT
, Height_in = S1Q24IN
, Weight_lb = S1Q24LB
)
str(nesarc_sub)
```
### Coding missing values (Class 06)
There are two steps.
The first step is to recode any existing `NA`s to actual values, if necessary.
The method for doing this differs for numeric and categorical variables.
The second step is to recode any coded missing values, such as 9s or 99s, as actual `NA`.
#### Coding `NA`s as meaningful "missing"
First step: the existing blank values with `NA` mean "never",
and "never" has a meaning different from "missing".
For each variable we need to decide what "never" means
and code it appropriately.
##### `NA`s recoded as numeric
The easiest is `DailyCigsSmoked` where `NA` means 0 cigarettes.
We replace `NA` with 0.
The function `is.na()` identifies each observation that "is an `NA`".
I've updated the codebook to indicate that the original `NA` values were changed.
```
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
```
```{R}
table(nesarc_sub$DailyCigsSmoked)
nesarc_sub <-
nesarc_sub %>%
replace_na(
# replace NA with new numeric value
list(
DailyCigsSmoked = 0
)
)
table(nesarc_sub$DailyCigsSmoked)
```
##### `NA`s recoded as categorical
The next two variables require defining a new category level.
Both `SmokingStatus` and `SmokingFreq` with a value of `NA` means "never",
so we need to create a new "Never" category and put in order
with the rest of the levels.
I've updated the codebook to indicate that the original `NA` values were changed.
```
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
```
```{R}
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
nesarc_sub <-
nesarc_sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
, SmokingFreq = as.character(SmokingFreq )
) %>%
replace_na(
# replace NA with new category values
list(
SmokingStatus = "3"
, SmokingFreq = "7"
)
)
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
```
#### Coding 9s and 99s as `NA`s
Second step: replace the coded missing values with `NA`,
then remove the remaining records with missing values.
Not all of these variables need this operation, since the previous subset commands
removed missing values by chance.
However, I'm going to run each as an extra precaution.
Afterall, later, if we were to make different previous data decisions,
it's possible we could end up with missing values at this point that
are incorrectly coded without this step!
```{R}
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
table(nesarc_sub$DailyCigsSmoked)
nesarc_sub <-
nesarc_sub %>%
mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
, SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
, Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA)
, Height_in = replace(Height_in , Height_in %in% c( 99), NA)
, Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA)
# drop unused factor levels
, SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
)
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
table(nesarc_sub$DailyCigsSmoked)
summary(nesarc_sub)
```
#### Removing rows with missing values (don't do this)
Finally, consider using `na.omit()` to remove any records with missing values
if it won't cause issues with analysis.
The command `na.omit()` removes a row if any columns have an `NA` in that row.
This can have __drastic__ consequences.
For example, imagine that you have 7 variables and 6 variables have complete data,
but the 7th variable is about three-quarters `NA`s!
Then `na.omit()` will drop three-quarters of your data,
even though most of your variables are complete!
There's only one advantage I can think of for running `na.omit()`,
and that's if you have a large dataset and only a few observations have a few `NA`s.
You can't analyze data with `NA`s,
since any calculation involving `NA` results in an `NA`.
Therefore imagine two scenarios:
(1) An analysis with Variables 1 and 2, each with a few `NA` distributed throughout;
first the observations with `NA` for either Variable 1 or 2 will be removed, then
the analysis will proceed.
(2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`
for different observations than Variable 2 --- therefore, different observations
will be removed before analysis.
In summary, these two analyses will include different observations from the dataset.
Therefore, by running `na.omit()` you will start with a dataset with no `NA`s,
thus every analysis will be of exactly the same set of data
(at the cost of not using all the possible data for each individual analysis).
Below, I do __not run `na.omit()`__ since I'm concerned about having lots of `NA`s for some variables.
```{R}
dim(nesarc_sub)
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
summary(nesarc_sub)
```
Now we (would) have a dataset with no missing values.
### Creating new variables (Class 06+)
#### From categories to numeric
`DaysSmoke` estimates the days per month a subject smokes by converting `SmokingFreq` (a factor with 6 levels) to a numeric variable using `as.numeric()` and multiplying by the midpoint of the range of `SmokingFreq`.
```{r}
nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30 # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
)
)
summary(nesarc_sub$DaysSmoke)
```
#### From numeric to numeric
`Height_inches` is the height of a person in inches. Since the data was collected in two variables (`Height_ft` and `Height_in`) we need to add them together for our new variable.
```{r}
nesarc_sub <-
nesarc_sub %>%
mutate(
Height_inches = Height_ft * 12 + Height_in
)
summary(nesarc_sub$Height_inches)
```
The variable `TotalCigsSmoked` estimates the monthly number of cigarettes a subject smokes per month by multiplying `DaysSmoke` times `DailyCigsSmoked` (the usual quantity smoked per day).
```{r}
nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke_log2 = log2(DaysSmoke)
, TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
, TotalCigsSmoked_log2 = ifelse(TotalCigsSmoked > 0, log2(TotalCigsSmoked), NA)
, TotalCigsSmoked_smokers = ifelse(TotalCigsSmoked > 0, TotalCigsSmoked, NA)
, TotalCigsSmoked_smokers_log2 = TotalCigsSmoked_log2
)
summary(nesarc_sub$TotalCigsSmoked)
summary(nesarc_sub$TotalCigsSmoked_smokers)
```
#### From numeric to categories based on quantiles
```{r}
nesarc_sub <-
nesarc_sub %>%
mutate(
CigsSmokedFac =
case_when(
TotalCigsSmoked == 0 ~ "No smoking (0)"
, (TotalCigsSmoked > 0) & (TotalCigsSmoked <= 6) ~ "Lowest smoking (1-6)"
, (TotalCigsSmoked > 6) & (TotalCigsSmoked <= 30) ~ "Low smoking (7-30)"
, (TotalCigsSmoked > 30) & (TotalCigsSmoked <= 300) ~ "Medium smoking (31-300)"
, (TotalCigsSmoked > 300) ~ "High smoking (301+)"
)
, CigsSmokedFac =
factor(
CigsSmokedFac
, levels = c(
"No smoking (0)"
, "Lowest smoking (1-6)"
, "Low smoking (7-30)"
, "Medium smoking (31-300)"
, "High smoking (301+)"
)
)
)
summary(nesarc_sub$TotalCigsSmoked)
summary(nesarc_sub$CigsSmokedFac)
table(nesarc_sub$CigsSmokedFac)
```
#### From many categories to a few
The variable `SmokingFreq3` collapses the `SmokingFreq` from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.
```{r}
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
nesarc_sub <-
nesarc_sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
)
)
# new variable
table(nesarc_sub$SmokingFreq3)
# we will label the new factor variable below, with the others
```
#### Working with Dates
How to calculate the duration between two dates.
(You probably won't need to use this for your project.)
Dates can be tricky in R. The `lubridate` package makes them much easier.
Below I'm going to create a variable for the person's age in years
based on the difference between the person's date of birth and the interview date.
There is an `AGE` variable which I'm already using,
but if there wasn't, or I wanted to be more precise about their age (such as, 24.34 years old),
then this is a way to do it.
I'm going to create a separate data frame just for this example to show how to do it;
if you needed these variables, they should be part of your original subset above.
Below, the "C" variables (`CDAY`) are the interview day, month, and year,
and the "DOB" variables (`DOBD`) are the date of birth day, month, and year.
```{R}
dat_sub <-
NESARC %>%
dplyr::select(
AGE
, CDAY
, CMON
, CYEAR
, DOBD
, DOBM
, DOBY
)
str(dat_sub)
```
Combine the date fields together, then convert them into a date object.
I'll do this for both the interview date and the date of birth.
I print the head (first 6 observations) after all the operations are complete.
```{R}
dat_sub <-
dat_sub %>%
mutate(
## interview date
# combine day, month, and year into one text string
cdate_text = paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
# create date object by interpretting the day, month, year with dmy()
, cdate_date = dmy(cdate_text)
## date of birth
# combine day, month, and year into one text string
, dob_text = paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
, dob_date = dmy(dob_text)
)
head(dat_sub)
```
Now that we have the interview date and date of birth as date objects,
we calculate the difference to give the person's age at the time of the interview.
Because the difference of dates is in the unit "days", we convert to years.
```{R}
dat_sub <-
dat_sub %>%
mutate(
age_days = cdate_date - dob_date
, age_years = as.numeric(age_days / 365.25) # change from "difftime" to "numeric"
# difference between the age we calculated and the age in the original dataset
, age_diff = age_years - AGE
)
head(dat_sub)
```
Finally,
keep new age variable in your data frame,
drop (unselect) the variables you don't need anymore,
and you're ready to go!
```{R}
# drop the original ones you don't need anymore
# by select(-VAR), it excludes the VAR column
dat_sub <-
dat_sub %>%
select(
-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)
# -c(cdate_text, cdate_date, dob_text, dob_date)
, -c(age_days, age_diff)
)
str(dat_sub)
```
Now you've got an age (or other date object) you can use in your project.
(For this example, I'm going to delete the `dat_sub` object
since it was just for this demonstration.
```{R}
rm(dat_sub)
```
#### Review results of new variables
```{r}
summary(nesarc_sub)
```
### Labeling Categorical variable levels (Class 06)
Look at summary of dataset to see the state of the labeling before we make changes.
```{R}
summary(nesarc_sub)
```
Informative labels are given to the factor levels.
The order of the levels is also rearranged for the variables
`SmokingFreq`, `NicotineDependence`, and `Sex`.
```{R}
# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <-
factor(
nesarc_sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)
nesarc_sub$SmokingStatus <-
factor(
nesarc_sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)
# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <-
factor(nesarc_sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc_sub$Sex)
nesarc_sub$NicotineDependence <-
factor(nesarc_sub$NicotineDependence
, levels = c( 1
, 0
)
, labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
)
)
# Remember to include the new category level from the original `NA` values.
nesarc_sub$SmokingFreq <-
factor(nesarc_sub$SmokingFreq
, levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
)
, labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
)
nesarc_sub$Ethnicity <-
factor(nesarc_sub$Ethnicity
# shorter labels are helpful for plots
, labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
)
# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
nesarc_sub$SmokingFreq3 <-
factor(nesarc_sub$SmokingFreq3
, levels = c( 4
, 3
, 2
, 1
)
, labels = c( "Never"
, "Monthly"
, "Weekly"
, "Daily"
)
)
summary(nesarc_sub)
```
### Data subset rows
## Data is complete (Class 06)
### Plot entire dataset, show missing values
```
{r}
ggplot_missing(nesarc_sub)
```
### Numerical summaries to assess correctness (Class 05)
```{r}
summary(nesarc_sub)
```
--------------------------------------------------------------------------------
# Graphing and Tabulating
## Class 06, Plotting univariate
__Rubric__
1. (3 p) For one categorical variable, a barplot is plotted with axis labels and a title.
Interpret the plot: describe the relationship between categories you observe.
2. (3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title.
Interpret the plot: describe the distribution (shape, center, spread, outliers).
3. (2 p) Code missing variables, remove records with missing values,
indicate with R output that this was done correctly (e.g., `str()`, `dim()`, `summary()`).
* Scroll up to sections labeled "(Class 06)".
4. (2 p) Label levels of factor variables.
* Scroll up to sections labeled "(Class 06)".
## Categorical variables
### Tables for categorical variables
```{r}
# table() produces a one-variable frequency table
table(nesarc_sub$NicotineDependence)
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$NicotineDependence) %>% prop.table()
tab_nicdep <-
nesarc_sub %>%
group_by(
NicotineDependence
) %>%
summarize(
n = n()
#, .groups = "drop_last"
) %>%
mutate(
prop = round(n / sum(n), 3)
) %>%
#arrange(
# desc(n)
#) %>%
ungroup()
tab_nicdep
```
### Graphing frequency tables
```{r}
library(ggplot2)
p1 <- ggplot(data = nesarc_sub, aes(x = NicotineDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = "Nicotine Dependence"
, y = "Total Number"
, title = "Nicotine Dependence for Adults"
)
print(p1)
```
__Interpretation:__
Among adults
(n = `r table(nesarc_sub$NicotineDependence) %>% sum()`),
there are about 8 times more people (88%) who are not nicotine dependent
compared to those who are (12%).
## Numeric variables
### Graphing numeric variables
```{r}
p <- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 5)
#p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 4)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Monthly cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
print(p)
```
```{r}
summary(nesarc_sub$DailyCigsSmoked)
fivenum(nesarc_sub$DailyCigsSmoked)
median(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
IQR(nesarc_sub$DailyCigsSmoked, na.rm = TRUE)
```
__Interpretation:__
Among smokers, the number of cigarettes smoked is right skewed (shape)
with a median (center) of 15
and the IQR (spread) (interquartile range, middle 50%) for the distribution is 14.
There are two distinct modes (peaks) in the distribution,
the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack).
There are several outlying values (outliers) greater than 40
(representing 40 cigarettes per day)
that are possibly overestimates from the people responding.
----------------------------------------
## Class 07, Plotting bivariate, numeric response
__Rubric__
1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):
1. Scatter plot (for regression): $x$ = numerical, $y$ = numerical, include axis labels and a title.
Interpret the plot: describe the relationship.
2. Box plots (for ANOVA): $x$ = categorical, $y$ = numerical, include axis labels and a title.
Interpret the plot: describe the relationship.
### Scatter plot (for regression): x = numerical, y = numerical
__Interpretation:__
The relationship between `Age` and `TotalCigsSmoked`, among people who smoke, is very weakly positive, but does not resemble a straight line.
`TotalCigsSmoked` is right-skewed which results in many extreme values above the regression line.
```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, colour = "red")
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
print(p)
```
An example of height and weight.
```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_jitter(height = 0.5, alpha = 1/10)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
```
### Box plots (for ANOVA): x = categorical, y = numerical
__Interpretation:__
For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed.
Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.
```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/100)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total Cigarettes Smoked per month"
, caption = "Red diamond is the mean.\nFiltered to include only smokers (cig > 0)."
)
print(p)
```
## Class 08, Plotting bivariate, categorical response
__Rubric__
1. Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):
1. Mosaic plot or bivariate bar plots (for contingency tables): $x$ = categorical, $y$ = categorical, include axis labels and a title.
Interpret the plot: describe the relationship.
2. Logistic scatter plot (for logistic regression): $x$ = numerical, $y$ = categorical (binary), include axis labels and a title.
Interpret the plot: describe the relationship.
### Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
```{r}
tab_dep_nic <-
nesarc_sub %>%
group_by(
Depression, NicotineDependence
) %>%
summarize(
n = n()
) %>%
mutate(
prop = round(n / sum(n), 3)
) %>%
ungroup()
tab_dep_nic
tab_nic_dep <-
nesarc_sub %>%
group_by(
NicotineDependence, Depression
) %>%
summarize(
n = n()
) %>%
mutate(
prop = round(n / sum(n), 3)
) %>%
ungroup()
tab_nic_dep
```
__Reversing color order:__ It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function `forcats::fct_rev()` on your `fill=` variable.
```{r, fig.height = 3, fig.width = 5}
library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
, y = "Proportion"
, title = "Proportion of young adult smokers with and without\nnicotine dependence by depression status"
)
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name = "Nicotine Dependence\nStatus")
print(p)
```
__Interpretation:__
When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%).
### Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
```{r}
table(nesarc_sub$NicotineDependence)
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc_sub$NicotineDependence01 <- ifelse(nesarc_sub$NicotineDependence == "Nicotine Dependence", 1, 0)
# check that the frequencies are correct before and after
table(nesarc_sub$NicotineDependence01)
# # Note, sometimes it is necessary to make your variable a character type first.
# # If the code above doesn't produce both 0s and 1s, then try this:
# nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0)
```
```
{r, fig.height = 4, fig.width = 8}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))
p <- p + theme_bw()
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)
p <- p + scale_y_continuous(breaks = seq(0, 1, by=0.2), minor_breaks = seq(0, 1, by=0.1))
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = "Nicotine Dependence (1=Yes, 0=No)"
, title = "Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked"
)
print(p)
```
__Interpretation:__ The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.
# Statistical methods
## Class 09, Simple linear regression (separate worksheet)
## Class 10, Simple linear regression
__Rubric__
1. With your previous (or new) bivariate scatter plot, add a regression line.
* (2 p) plot with regression line,
* (1 p) label axes and title.
2. Use `lm()` to fit the linear regression and interpret slope and $R^2$ (R-squared) values.
* (2 p) lm `summary()` table is presented,
* (2 p) slope is interpreted with respect to a per-unit increase of the $x$ variable
in the context of the variables in the plot,
* (2 p) $R^2$ is interpretted in a sentence.
3. (1 p) Interpret the intercept. Does it make sense in the context of your study?
### 1. Scatter plot, add a regression line.
```
{r}
library(ggplot2)
p <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
#p <- p + geom_point(alpha = 1/10)
p <- p + geom_jitter(width = 0.25, height = 5, alpha = 1/10)
#p <- p + geom_smooth()
p <- p + stat_smooth(method = lm, fullrange = TRUE)
p <- p + labs(
title = "Total Cigarettes Smoked vs Age"
#, x =
, y = "Total Cigarettes Smoked per month"
, caption = "Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)."
)
p <- p + xlim(c(0, NA))
print(p)
```
### 2. Fit the linear regression, interpret slope and $R^2$ (R-squared) values
```{r}
# fit the simple linear regression model
lm_TCS_Age <-
lm(
formula = TotalCigsSmoked ~ Age
, data = nesarc_sub
)
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)
```
* __Slope__:
For every 1 year increase in Age,
we expect an increase of `r lm_TCS_Age$coefficients[2] %>% signif(4)`
total cigarettes smoked per month.
* __$R^2$__:
The proportion of variance explained by the regression model, compared to the grand mean, is $R^2 = `r summary(lm_TCS_Age)$r.squared %>% signif(3)`$, which is very small.
### 3. Interpret the intercept. Does it make sense?
For someone 0 years old, the expected number of cigarettes smoked per month is `r lm_TCS_Age$coefficients[1] %>% signif(4)`.
This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.
## Class 11, Logarithm transformation (separate worksheet)
## Class 12, Logarithm transformation
__Rubric__
1. Try plotting the data on a logarithmic scale
* (6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.
1. original scales
2. $\log(x)$-only
3. $\log(y)$-only
4. both $\log(x)$ and $\log(y)$
2. What happened to your data when you transformed it?
* (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
* (1 p) Choose the best scale for a linear relationship and explain why.
* (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.
### 1. Try plotting on log scale (original scale, $\log(x)$-only, $\log(y)$-only, both $\log(x)$ and $\log(y)$)
```
{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()
library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)
library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)
library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = Age, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Age"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Age")
```
### 2. What happened to your data when you transformed it?
* Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).
1. Original: Same
2. log(x): no improvement, age values became left skewed
3. log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
4. log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn't help.
* Choose the best scale for a linear relationship and explain why.
log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.
* Does your relationship benefit from a logarithmic transformation? Say why or why not.
A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.
##### Extra plots
Here's an example where transforming both $x$ and $y$ with the log makes a big improvement.
```
{r, fig.height = 8, fig.width = 8}
# IncomePersonal
#p <- p + scale_x_log10()
#p <- p + scale_y_log10()
library(ggplot2)
p1 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p1 <- p1 + theme_bw()
p1 <- p1 + geom_point(alpha = 1/20)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + labs(
title = "Original scale"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p2 <- p2 + theme_bw()
p2 <- p2 + geom_point(alpha = 1/20)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + labs(
title = "Log(x)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p2)
library(ggplot2)
p3 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p3 <- p3 + theme_bw()
p3 <- p3 + geom_point(alpha = 1/20)
p3 <- p3 + stat_smooth(method = lm)
p3 <- p3 + scale_y_log10()
p3 <- p3 + labs(
title = "Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p3)
library(ggplot2)
p4 <- ggplot(nesarc_sub %>% filter(TotalCigsSmoked > 0), aes(x = IncomePersonal, y = TotalCigsSmoked))
p4 <- p4 + theme_bw()
p4 <- p4 + geom_point(alpha = 1/20)
p4 <- p4 + stat_smooth(method = lm)
p4 <- p4 + scale_x_log10()
p4 <- p4 + scale_y_log10()
p4 <- p4 + labs(
title = "Log(x) and Log(y)"
, x = "Personal Income"
, y = "Total Cigarettes Smoked per month"
)
#print(p4)
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")
```
## Class 13, Correlation (separate worksheet)
## Class 14, Categorical contingency tables (separate worksheet)
## Class 15, Correlation and Categorical contingency tables
__Rubric__
1. With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
* (1 p) plot is repeated here or the plot is referenced an easy to find from a plot above,
* (1 p) correlation is calculated,
* (2 p) correlation is interpreted (direction, strength of LINEAR relationship).
2. With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
* (1 p) frequency table of variables is given,
* (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
* (1 p) a well-labeled plot of the proportion table is given,
* (2 p) the conditional proportions are interpreted and compared between conditions.
### Correlation
```
{r, fig.height = 5, fig.width = 5}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))
p <- p + theme_bw()
p <- p + geom_point(alpha = 1/20)
p <- p + stat_smooth(method = lm)
p <- p + labs(
title = "Log2 Cigarettes smoked vs Age"
, x = "Age"
, y = "Log2 Total Cigarettes Smoked per month"
)
print(p)
```
```{R}
cor_A_C <-
cor(
nesarc_sub$Age
, nesarc_sub$TotalCigsSmoked_log2
, use = "complete.obs"
)
cor_A_C
```
### Interpretation of correlation
The correlation is `r cor_A_C %>% signif(3)`,
this is a positive moderately-weak linear relationship
between Age and Log2 total cigarettes smoked per month,
meaning that as people get older they tend slightly to
increase the Log2 Total Number of Cigarettes smoked per month.
### Contingency table
```{r, results = 'asis', fig.height = 5, fig.width = 6}
tab_S_D_N <-
nesarc_sub %>%
group_by(
Sex
, Depression
, NicotineDependence
) %>%
summarise(
Frequency = n()
) %>%
mutate(
Proportion = round(Frequency / sum(Frequency), 3)
) %>%
ungroup()
# tab_S_D_N
tab_S_D_N %>%
knitr::kable()
library(ggplot2)
p <- ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + labs(
title = "Proportion of Nicotine Dependence by Depression status\nfor each Gender"
, y = "Proportion of Nicotine Dependence"
, fill = "Nicotine Dependence"
)
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_wrap(~ Sex)
p <- p + theme(legend.position = "bottom")
print(p)
```
### Interpretation of conditional proportions
* For Females without depression, the proportion who are nicotine dependent is 0.076, while for those with depression it is 0.206 (nearly 3 times as much).
* For Males without depression, the proportion who are nicotine dependent is 0.108, while for those with depression it is 0.269 (nearly 3 times as much).
Males generally are more nicotine dependent than Females,
but both are nearly 3 times more likely to be nicotine dependent
if they are depressed compared to if they are not depressed.
## Class 16, Parameter estimation (one-sample) (separate worksheet)
## Class 17, Inference and Parameter estimation (one-sample)
__Rubric__
1. Using a numerical variable, calculate and interpret a confidence interval for the population mean.
* (1 p) Identify and describe the variable,
* (1 p) use `t.test()` to calculate the mean and confidence interval, and
* (1 p) interpret the confidence interval.
* (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
2. Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
* (1 p) Identify and describe the variable,
* (1 p) use `binom.test()` to calculate the sample proportion and confidence interval, and
* (1 p) interpret the confidence interval.
* (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
### Numeric variable confidence interval for mean $\mu$
`TotalCigsSmoked_log2` is the estimated number of cigarettes per month a subject smokes (`DaysSmoke` * `DailyCigsSmoked`) on the log2 scale.
```{r}
# calculate mean and CI
t_result <-
t.test(
nesarc_sub$TotalCigsSmoked_log2
)
t_result
```
__Interpretation:__
* The mean of `TotalCigsSmoked_log2` is `r t_result$estimate %>% signif(3)` and
the 95% confidence interval is
(`r t_result$conf.int[1] %>% signif(3)`, `r t_result$conf.int[2] %>% signif(3)`).
* In 95% of samples, a confidence interval constructed from the data
will contain the true population mean. In our case, our CI was
(`r t_result$conf.int[1] %>% signif(3)`, `r t_result$conf.int[2] %>% signif(3)`).
It either contains the true population mean or it doesn't, but we don't know.
```{r, fig.height = 4, fig.width = 8}
# plot data and create table
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_log2))
p <- p + geom_histogram(binwidth = 0.5, alpha = 1)
#p <- p + geom_rug(alpha = 1/8)
# est
p <- p + geom_vline(aes(xintercept = as.numeric(t_result$estimate)), colour = "blue", size = 1, linetype = 3)
p <- p + geom_rect(aes(
xmin = t_result$conf.int[1]
, xmax = t_result$conf.int[2]
, ymin = -150
, ymax = 0)
, fill = "blue", alpha = 1)
p <- p + labs(title = "Log2 Total Cigs Smoked\nblue = est +- 95% CI")
print(p)
```
### Categorical variable confidence interval for proportion $p$
`Depression` is a Yes/No variable indicating that the person has major depression (lifetime).
```{r}
# proportions are of the last group_by() variable within combinations of the other variables
tab_dep_long <-
nesarc_sub %>%
group_by(Depression) %>%
summarise(Frequency = n()) %>%
mutate(Proportion = Frequency / sum(Frequency)) %>%
ungroup()
tab_dep_long
# prop.test() is an asymptotic (approximate) test for a binomial random variable
p_summary <-
prop.test(
x = tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency)
, n = sum(tab_dep_long$Frequency)
#, conf.level = 0.95
)
p_summary
```
__Interpretation:__
* The proportion of people who have `Depression` is `r p_summary$estimate %>% signif(3)` and
the 95% confidence interval is
(`r p_summary$conf.int[1] %>% signif(3)`, `r p_summary$conf.int[2] %>% signif(3)`).
* In 95% of samples, a confidence interval constructed from the data
will contain the true population proportion In our case, our CI was
(`r p_summary$conf.int[1] %>% signif(3)`, `r p_summary$conf.int[2] %>% signif(3)`).
It either contains the true population proportion or it doesn't, but we don't know.
```
{r}
library(ggplot2)
p <- ggplot(data = tab_dep_long %>% filter(Depression == "Yes Depression"), aes(x = Depression, y = Proportion))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
#p <- p + geom_bar(stat = "identity")
p <- p + geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25)
p <- p + geom_point(colour = "red", size = 6, pch = 18)
p <- p + scale_y_continuous(limits = c(0.17, 0.19))
# swap axes to take less vertical space
#p <- p + coord_flip()
print(p)
```
## Class 18, Hypothesis testing (one- and two-sample) (separate worksheet)
## Class 19, Paired data, assumption assessment (separate worksheet)
## Class 20, Hypothesis testing (one- and two-sample)
### Mechanics of a hypothesis test (review)
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``The population mean for [what is being studied] is different from [value of $\mu_0$].''
(Note that the statement in words is in terms of the alternative hypothesis.)
* In notation: $H_0: \mu=\mu_0$ versus $H_A: \mu \ne \mu_0$
(where $\mu_0$ is specified by the context of the problem).
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}$, where $SE_{\bar{Y}}=s/\sqrt{n}$ is the standard error.
4. Determine the __tail(s)__ of the sampling distribution where the __$p$-value__ from the test statistic will be calculated
(for example, both tails, right tail, or left tail).
(Historically, we would compare the observed test statistic, $t_{s}$,
with the __critical value__ $t_{\textrm{crit}}=t_{\alpha/2}$
in the direction of the alternative hypothesis from the
$t$-distribution table with degrees of freedom $df = n-1$.)
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
6. __Check assumptions__ of the test (for now we skip this).
### What do we do about "significance"?
Adapted from **[Significance Magazine](https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2019.01295.x)**.
Recent calls have been made to abandon the term "statistical significance".
The American Statistical Association (ASA) issued its
[statement](https://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108) and
[recommendation](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913)
on p-values (see the [special issue of p-values](https://www.tandfonline.com/toc/utas20/73/sup1?nav=tocList) for more).
In summary, the problem of "significance" is one of misuse, misunderstanding, and misinterpretation.
The recommendation in this class is that it is no longer sufficient to say that a
result is "statistically significant" or "non-significant"
depending on whether a p-value is less than a threshold.
Instead, we will be looking for wording as in the following paragraph.
"The difference between the two groups turns out to be small (8%),
while the probability ($p$) of observing a result at least as extreme as this
under the null hypothesis of no difference between the two groups
is $p = 0.003$ (that is, 0.3%).
This p-value is statistically significant as it is below our pre-defined
threshold ($p < 0.05$).
However, the p-value tells us only that the 8% difference between the
two groups is somewhat unlikely given our hypothesis and null model's assumptions.
More research is required, or other considerations may be needed, to conclude
that the difference is of practical importance and reproducible."
### Two-sample $t$-test
__Rubric__
1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample $t$-test associated with your research questions.
* (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
* (0 p) use `t.test()` to calculate the mean, test statistic, and p-value,
* (2 p) state the significance level, test statistic, and p-value, and
* (2 p) state the conclusion in the context of the problem.
* (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
* (2 p) Provide an appropriate plot of the data and sample estimates in a well-labeled plot.
* (1 p) Check the assumptions of the test using the bootstrap and interpret the bootstrap sampling distribution plot.
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The population mean total cigarettes smoked (`TotalCigsSmoked`) is greater for people with depression than for those without depression (`Depression`)."
* In notation: $H_0: \mu_D = \mu_N$ versus $H_A: \mu_D > \mu_N$.
```{r}
t_summary_TCS_D <-
t.test(
TotalCigsSmoked ~ Depression
, data = nesarc_sub
, alternative = "less"
)
t_summary_TCS_D
```
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $t_{s} = `r t_summary_TCS_D$statistic %>% signif(3)`$.
4. __$p$-value__ is $p = `r t_summary_TCS_D$p.value %>% signif(3)`$.
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ concluding that the population mean total cigarettes smoked (`TotalCigsSmoked`) is greater for people with depression than for those without depression (`Depression`).
* Because we rejected the Null hypothesis, we could have made a Type I error.
```{r}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
est_mean_TCS_D <-
nesarc_sub %>%
group_by(Depression) %>%
summarise(TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)) %>%
ungroup()
est_mean_TCS_D
```
```
{R, fig.width=4, fig.height=6}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total cigarettes smoked per month"
)
print(p)
```
```{R, fig.width=6, fig.height=4}
# histogram using ggplot
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 40)
#p <- p + geom_rug()
p <- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour = "red")
p <- p + facet_grid(Depression ~ .)
p <- p + labs(
title = "Total Cigarettes Smoked by Depression"
, x = "Depression status"
, y = "Total cigarettes smoked per month"
)
print(p)
```
```{R, fig.width=6, fig.height=6}
e_plot_bs_two_samp_diff_dist(
nesarc_sub %>% filter(Depression == "No Depression") %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked)
, nesarc_sub %>% filter(Depression == "Yes Depression") %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked)
, N = 1000
)
```
* The bootstrap sampling distribution of the difference in means is close to normal, so the model assumptions are met and we can rely on the decision we made in the hypothesis test above.
## Class 21, ANOVA, Pairwise comparisons (separate worksheet)
## Class 22, ANOVA and Assessing Assumptions
__Rubric__
1. Using a numerical response variable and a categorical variable with three to five levels (or a categorical variable you can reduce to three to five levels), specify an ANOVA hypothesis associated with your research questions.
* (1 p) Specify the ANOVA hypotheses in words and notation,
* (1 p) plot the data in a way that is consistent with hypothesis test (comparing means, assess equal variance assumption),
* (1 p) use `aov()` to calculate the hypothesis test statistic and p-value,
* (1 p) state the significance level, test statistic, and p-value,
* (1 p) state the conclusion in the context of the problem,
* (2 p) assess the normality assumption of the residuals using appropriate methods (QQ-plot and Anderson-Darling test), and
* (1 p) assess the assumption of equal variance between your groups using an appropriate test (also mention standard deviations of each group).
* (2 p) If you rejected the ANOVA null hypothesis, perform follow-up pairwise comparisons using Tukey's HSD to indicate which groups have statistically different means and summarize the results.
### Hypothesis and plot
Let $\mu_j$ = pop mean number of the total cigarettes smoked for the five ethnicities identified in the dataset:
Caucasian,
African American,
Native American,
Asian, and
Hispanic,
numbered $(j = 1, 2, 3, 4, 5)$, respectively.
We wish to test $H_0: \mu_1 = \cdots = \mu_5$ against $H_A: \textrm{not } H_0$ (at least one pair of means differ).
A table of the ethnicities to compare is below.
```{R}
summary(nesarc_sub$Ethnicity)
```
Plot the data in a way that compares the means.
Error bars are 95% confidence intervals of the mean.
```{R}
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Ethnicity, y = TotalCigsSmoked))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(nesarc_sub$TotalCigsSmoked),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_violin(width = 0.5, alpha = 0.25)
p <- p + geom_boxplot(width = 0.25, alpha = 0.25)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
# diamond at mean for each group
p <- p + stat_summary(fun = mean, geom = "point", shape = 18, size = 4,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Total cigarettes smoked by Ethnicity")
p <- p + ylab("Total Cigs Smoked")
print(p)
```
### (NOT USED) Transform the response variable to satisfy assumptions
### ANOVA Hypothesis test
__Hypothesis test__
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The population mean total cigarettes smoked is different between ethnicities."
* In notation: $H_0: \mu_1 = \cdots = \mu_5$ versus $H_A: \textrm{not } H_0$ (at least one pair of means differ).
2. Let the significance level of the test be $\alpha = 0.05$.
3. Compute the __test statistic__.
```{R}
aov_summary <-
aov(
TotalCigsSmoked ~ Ethnicity
, data = nesarc_sub
)
summary(aov_summary)
```
The $F$-statistic for the ANOVA is $F = `r unlist(summary(aov_summary))["F value1"] %>% signif(3)`$.
4. Compute the __$p$-value__ from the test statistic.
The p-value for testing the null hypothesis is
$p = `r unlist(summary(aov_summary))["Pr(>F)1"] %>% signif(3)`$.
5. State the __conclusion__ in terms of the problem.
Because $p = `r unlist(summary(aov_summary))["Pr(>F)1"] %>% signif(3)` < 0.05$,
we reject $H_0$ in favor of $H_1$ concluding that
the mean total cigarettes smoked is different between ethnicities.
### Check assumptions
6. __Check assumptions__ of the test.
a. Residuals are normal
b. Populations have equal variances.
* Check whether residuals are normal.
- Plot the residuals and assess whether they appear normal.
```{R, fig.height = 4, fig.width = 5}
# Plot the data using ggplot
df_res <- data.frame(res = aov_summary$residuals)
library(ggplot2)
p <- ggplot(df_res, aes(x = res))
p <- p + geom_histogram(aes(y = ..density..), binwidth = 50)
p <- p + geom_density(colour = "blue", adjust = 5)
#p <- p + geom_rug()
p <- p + stat_function(fun = dnorm, colour = "red", args = list(mean = mean(df_res$res), sd = sd(df_res$res)))
p <- p + labs(title = "ANOVA Residuals"
, caption = "Blue = Kernal density curve\nRed = Normal distribution")
print(p)
```
The plot of the residuals is very right skewed (not normal).
```{R, fig.height = 5, fig.width = 5}
# QQ plot
par(mfrow=c(1,1))
#library(car)
car::qqPlot(
aov_summary$residuals
, las = 1
, id = list(n = 4, cex = 1)
, lwd = 1
, main="QQ Plot"
)
```
The QQ-plot of the residuals versus the normal quantiles is very right skewed (U-shaped, not normal).
- A formal test of normality on the residuals tests the hypothesis
$H_0:$ The distribution is Normal vs
$H_1:$ The distribution is not Normal.
We can test the distribution of the residuals.
```{R}
#shapiro.test(aov_summary$residuals)
#library(nortest)
nortest::ad.test(aov_summary$residuals)
nortest::cvm.test(aov_summary$residuals)
```
The formal normality tests of the residuals reject $H_0$ in favor of $H_A$, concluding that the data are not normal.
* Check whether populations have equal variances.
- Look at the numerical summaries below.
```{R}
# calculate summaries
dat_EthCig_summary <-
nesarc_sub %>%
group_by(Ethnicity) %>%
summarize(
m = mean(TotalCigsSmoked, na.rm = TRUE)
, s = sd(TotalCigsSmoked, na.rm = TRUE)
, n = n()
, .groups = "drop_last"
) %>%
ungroup()
dat_EthCig_summary
```
The standard deviations appear different between groups, in particular, the Asian group has less than half the standard deviation of the Native American group.
- Formal tests for equal variances.
We can test whether the variances are equal between our three groups.
This is similar to the ANOVA hypothesis, but instead of testing means we're tesing variances.
$H_0: \sigma^2_1 = \cdots = \sigma^2_5$
versus $H_A: \textrm{not } H_0$ (at least one pair of variances differ).
```{R}
## Test equal variance
# assumes populations are normal
bartlett.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
# does not assume normality, requires car package
#library(car)
car::leveneTest(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
# nonparametric test
fligner.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
```
Because the data were not normal, we interpret the Levene test.
The p-value is less than 0.05, therefore we reject $H_0$ of equal variances in favor of $H_A$ that the variances are not equal.
### Post Hoc pairwise comparison tests
__EMM plot interpretation__
This __EMM plot (Estimated Marginal Means, aka Least-Squares Means)__
is only available when conditioning on one variable.
The blue bars are confidence intervals for the EMMs;
don't ever use confidence intervals for
EMMs to perform comparisons -- they can be very misleading.
The red arrows are for the comparisons among means;
the degree to which the "comparison arrows" overlap reflects as much as
possible the significance of the comparison of the two estimates.
If an arrow from one mean overlaps an arrow from
another group, the difference is not significant, based on the adjust setting
(which defaults to "tukey").
```{R, fig.height = 5, fig.width = 3}
## Contrasts
adjust_method <- c("none", "tukey", "bonferroni")[2]
library(emmeans)
emm_cont <-
emmeans::emmeans(
aov_summary
, specs = "Ethnicity"
)
# means and CIs
emm_cont %>% print()
# pairwise differences
emm_cont %>% pairs(adjust = adjust_method) %>% print()
# plot of means, CIs, and comparison arrows
plot(
emm_cont
, comparisons = TRUE
, adjust = adjust_method
, horizontal = FALSE
, ylab = "Ethnicity"
)
```
All pairs of Ethnicities differ except Asian and Hispanic.
* `Cauc - AfAm`: sig diff
* `Cauc - NaAm`: sig diff
* `Cauc - Asia`: sig diff
* `Cauc - Hisp`: sig diff
* `AfAm - NaAm`: sig diff
* `AfAm - Asia`: sig diff
* `AfAm - Hisp`: sig diff
* `NaAm - Asia`: sig diff
* `NaAm - Hisp`: sig diff
* `Asia - Hisp`: NOT sig diff
```{r}
dat_EthCig_summary %>% arrange(m)
```
Summarize results by ordering the means and grouping pairs that do not differ.
```
Ethnicity m s n
| 1 Asia 76.9 194. 1332
| 2 Hisp 89.7 226. 8308
| 3 AfAm 127. 250. 8245
| 4 Cauc 244. 376. 24507
| 5 NaAm 293. 420. 701
```
## Class 23, Nonparametric methods (separate worksheet)
## Class 24, Binomial and Multinomial tests (separate worksheet)
## Class 25, Two-way categorical tables (separate worksheet)
## Class 26, Simple linear regression (separate worksheet)
## Class 27, Two-way categorical and simple linear regression
### Two-way categorical analysis
1. Two-way categorical analysis.
Using two categorical variables with two to five levels each, specify a hypothesis test for homogeneity of proportions associated with your research questions.
#### (1 p) Specify the hypotheses in words and notation.
* In words: "There is an association between `NicotineDependence` and `Ethnicity`."
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ for all row categories $i$ and column categories $j$, versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$ for at least one row category $i$ and column category $j$.
#### (1 p) State the conclusion of the test in the context of the problem.
```{R}
# Tabulate by two categorical variables:
tab_E_TD <-
xtabs(
~ Ethnicity + NicotineDependence
, data = nesarc_sub
)
tab_E_TD
# column proportions
prop.table(
tab_E_TD
, margin = 1
)
```
```{R}
# Chi-sq test
chisq_E_TD <-
chisq.test(
tab_E_TD
, correct = FALSE
)
chisq_E_TD
```
The test statistic is $X^2 = `r chisq_E_TD$statistic %>% signif(3)`$.
Because the p-value $= `r chisq_E_TD$p.value %>% signif(3)` < 0.05$
we reject the null hypothesis
concluding that there is an association between
`NicotineDependence` and `Ethnicity`.
```{R}
# table of expected frequencies:
chisq_E_TD$expected
# smallest expected frequency:
min(chisq_E_TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
#### (1 p) Plot a mosaic plot of the data and Pearson residuals.
```{R}
# The Pearson residuals
chisq_E_TD$residuals
# The sum of the squared residuals is the chi-squared statistic:
#chisq_E_TD$residuals^2
#sum(chisq_E_TD$residuals^2)
```
```{R, fig.height = 6, fig.width = 8}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
mosaic(
tab_E_TD
, shade = TRUE
, legend = TRUE
, abbreviate_labs = 4
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0, 0, 0, 2) # offset the y-axis var name
#, margins = c(3, 3, 3, 14) # order is: top, right, bottom, left
, main = "Fraction of smokers with and without\n nicotine dependence by Ethnicity"
, labeling_args = list(set_varnames = c(NicotineDependence = "Nicotine Dependence/Addiction Status"
, Ethnicity = "Ethnicity")
)
)
```
#### (1 p) Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting $H_0$ is that
Hispanics and Asians have less nicotine dependence than expected
while Caucasians and (especially) Native Americans have higher nicotine dependence than expected.
### Simple linear regression
2. Simple linear regression.
Select two numerical variables.
#### (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.
(See below.)
#### (0 p) Fit the simple linear regression model.
```{R}
# fit model
#lm_fit <- lm(TotalCigsSmokedSqrt ~ DaysSmoke, data = nesarc_sub)
lm_fit <- lm(TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)
coef(lm_fit)
```
Note that prediction intervals should be bounded below at 0.
```{R, fig.height = 6, fig.width = 8}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = DaysSmoke_log2, y = TotalCigsSmoked_log2))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm_fit, data.frame(DaysSmoke_log2)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm_fit, data.frame(DaysSmoke_log2)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(0.05), alpha = 0.01)
p <- p + labs(title = "Regression with confidence and prediction bands"
, x = "log2 of Days smoke"
, y = "log2 of Total cigarettes smoked"
, caption = "Blue line is fitted regression line.\nGray band is the confidence band.\nGreen band is the prediction band.")
print(p)
```
#### (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and $x$-value).
```{R, fig.height = 3, fig.width = 8}
e_plot_lm_diagostics(
lm_fit
, sw_plot_set = "simple"
)
```
* Residuals vs x: each group (based on x-variable) of values is roughly symmetric and the y=0 line passes through the center of most groups. The x=0 group has many observations at the lowest value which is why the x=0 (or regression) is below the distribution that is spread above it.
#### (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).
* QQ-Plot: On original scale, the pattern is non-normal with right-skewness in y-variable. After log2 transformation of both variables, we still do not have normality (but it is much better).
#### (1 p) Assess the relative influence of points.
* Cook's distance: several points with slightly higher influence, but none with extremely large influence.
* Cook's distance vs Leverage: Those points with high influence also have high leverage, thus, those points that are extremes in the x-direction (least frequent or most frequent smokers) have the highest influence.
#### (1 p) Test whether the slope is different from zero, $H_A: \beta_1 \ne 0$.
Though the model assumptions are not perfectly met, we will continue to interpret the model that we have.
```{R}
summary(lm_fit)
```
Because the p-value = $`r summary(lm_fit)$coefficients[2,4] %>% signif(3)` < 0.05$, we reject $H_0$ in favor of $H_A$ concluding that the slope is different from 0.
Interpretation (extra): For every unit increase in log2 of `DaysSmoke`, we expect an
increase of `r summary(lm_fit)$coefficients[2,1] %>% signif(3)`
in the log2 of `TotalCigsSmoked`.
On the original scale, this increase is scary.
#### (1 p) Interpret the $R^2$ value.
The regression model explains
$R^2 = `r signif(summary(lm_fit)$r.squared, 3)`$
proportion of the variation
in the response
above using the mean value alone.
## Class 28, Logistic regression (separate worksheet)
## Class 29, Logistic regression
1. Logistic regression.
### Select a binary response and continuous explanatory/predictor variable.
### (1 p) Plot the data.
See below.
### (1 p) Summarize the $\hat{p}$ values for each value of the $x$-variable. Also, calculate the empirical logits.
Summarize observed probability of success
for each unique value of x.
```{R}
dat_Nic_Cigs_sum <-
nesarc_sub %>%
group_by(
TotalCigsSmoked_smokers_log2
) %>%
summarize(
Success = sum(NicotineDependence01)
, Total = n()
# estimated proportion of successes for each unique value of x
, p_hat = Success / Total
, .groups = "drop_last"
) %>%
ungroup() %>%
na.omit()
# emperical logits
dat_Nic_Cigs_sum <-
dat_Nic_Cigs_sum %>%
mutate(
emp_logit = log((p_hat + 0.5/Total) / (1 - p_hat + 0.5/Total))
)
dat_Nic_Cigs_sum
```
### (1 p) Plot the $\hat{p}$ values vs the $x$-variable and plot the empirical logits vs the $x$-variable.
#### Probability/proportion scale
Plots on the probability scale should follow a sigmoidal curve
(a little difficult to see here).
Note that the overlayed reference
curve (red) is a weighted smoothed curve (loess),
not the model fit.
```{R, fig.height = 6, fig.width = 8}
library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))
p1 <- p1 + theme_bw()
# data
p1 <- p1 + geom_jitter(width = 0.05, height = 0.025, size = 0.5, alpha = 1/10, colour = "green")
# summaries
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))
p1 <- p1 + geom_smooth(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, weight = Total), se = FALSE, colour = "red") # just for reference
# axes
p1 <- p1 + scale_y_continuous(breaks = seq(0, 1, by = 0.2))
p1 <- p1 + expand_limits(y = c(0, 1))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title = "Proportion of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Observed data, smokers only"
, x = "log2 Total cigarettes smoked"
, y = "Nicotine dependence (0/1)\nObserved probability of nicotine dependence"
, caption = paste(
"Green = Indicator points of nicotine dependence (1) or not (0)."
, "Black = Observed proportions of nicotine dependence"
, "Red = Smoothed curve to proportions"
, sep = "\n" # separate by new lines
)
)
print(p1)
```
#### Logit scale
On the logit scale, if points follow a straight line,
then we can fit a simple logistic regression model.
Note that the overlayed reference
straight line (blue) is from weighted least squares (not the official fitted line),
and the curve (red) is a weighted smoothed curve (loess).
If these two lines are similar, that suggests a straight line on the logit scale
is a good fit.
Both lines are weighted by the total number of observations that each point represents,
so that points representing few observations don't contribute as much as
points representing many observations,
thus our decision should not be heavily influenced by random deviations where there is little data.
```{R, fig.height = 6, fig.width = 8}
# plot on logit scale
library(ggplot2)
p1 <- ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))
p1 <- p1 + theme_bw()
# summaries
p1 <- p1 + geom_point(aes(size = Total))
p1 <- p1 + stat_smooth(aes(weight = Total), method = "lm", se = FALSE, linetype = 3) # just for reference
p1 <- p1 + geom_smooth(aes(weight = Total), se = FALSE, colour = "red") # just for reference
# axes
#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title = "Logit of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Observed data, smokers only"
, x = "log2 Total cigarettes smoked"
, y = "Empirical logit of the probability of nicotine dependence"
, caption = paste(
"Black = Observed logit proportions of nicotine dependence"
, "Blue = Naive LM fit of logit proportions"
, "Red = Loess smooth curve of empirical logits"
, sep = "\n" # separate by new lines
)
)
print(p1)
```
### (1 p) Describe the logit-vs-$x$ plot. Is it linear? If not, consider a transformation of $x$ to improve linearity; describe the transformation you chose if you needed one.
* Log2 of Total cigarettes smoked improved the linear relationship.
A straight line describes the data well from 0 to 10, but does not fit the data when x is greater than 10.
### (1 p) Fit the `glm()` model and assess the deviance lack-of-fit test.
The simple logistic regression model expresses the population proportion $p$ of
individuals with a given attribute (called the probability of success) as a
function of a single predictor variable $X$. The model assumes that $p$ is
related to $X$ through
$$
\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X
$$
or, equivalently, as
$$
p = \frac{ \exp( \beta_0 + \beta_1 X ) }{ 1 + \exp( \beta_0 + \beta_1 X ) }.
$$
The logistic regression model is a __binary response model__, where the
response for each case falls into one of two exclusive and exhaustive categories,
success (cases with the attribute of interest) and
failure (cases without the attribute of interest).
__Fit the model.__
```{R}
# For our summarized data (with frequencies and totals for each age)
# The left-hand side of our formula binds two columns together with cbind():
# the columns are the number of "successes" and "failures".
# For logistic regression with logit link we specify family = binomial,
# where logit is the default link function for the binomial family.
glm_n_c <-
glm(
cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2
, family = binomial
, data = dat_Nic_Cigs_sum
)
```
#### Deviance statistic for lack-of-fit
Unfortunately, there aren't many model diagnostics for logistic regression.
Visual inspection of the empirical logits is good when you can summarize the data as we've done.
One simple test for lack-of-fit uses the deviance statistic.
Under the null hypothesis (that you'll state below),
the residual deviance follows a $\chi^2$ distribution with
the associated degrees-of-freedom.
Below, we calculate the deviance p-value and perform the test for lack-of-fit.
```{R}
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
glm_n_c$deviance
glm_n_c$df.residual
dev_p_val <- 1 - pchisq(glm_n_c$deviance, glm_n_c$df.residual)
dev_p_val
```
State the null and alternative hypotheses for lack-of-fit.
* $H_0:$ Model fits the data.
* $H_A:$ Model does not fit the data.
For your preferred model, the deviance statistic is
* D = `r glm_n_c$deviance %>% signif(3)` with
* `r glm_n_c$df.residual` df, giving
* p-value = `r dev_p_val %>% signif(3)`.
* Because the p-value is less than 0.05, we reject $H_0$ in favor of $H_A$ concluding that the model does not fit the data. However, for this assignment, we will continue with interpretation.
### (1 p) Calculate the confidence bands around the model fit/predictions. Plot on both the logit and $\hat{p}$ scales.
The `glm()` statement creates an object which we can use to create
the fitted probabilities
and 95\% CIs for the population proportions at the ages at first vaginal intercourse.
The fitted probabilities and the limits are stored in columns labeled
`fit_p`, `fit_p_lower`, and `fit_p_upper`, respectively.
Below I create confidence bands for the model
and plot the fit against the data,
first on the logit scale to assess model fit
and then on the probability scale for interpretation.
```{R}
# predict() uses all the Load values in dataset, including appended values
fit_logit_pred <-
predict(
glm_n_c
#, newdata = data.frame(TotalCigsSmoked_smokers_log2 = dat_Nic_Cigs_sum$TotalCigsSmoked_smokers_log2)
, type = "link"
, se.fit = TRUE
) %>%
as_tibble()
# put the fitted values in the data.frame
dat_Nic_Cigs_sum <-
dat_Nic_Cigs_sum %>%
mutate(
# logit scale values
fit_logit = fit_logit_pred$fit
, fit_logit_se = fit_logit_pred$se.fit
, fit_logit_lower = fit_logit - 1.96 * fit_logit_se
, fit_logit_upper = fit_logit + 1.96 * fit_logit_se
# proportion scale values
, fit_p = exp(fit_logit) / (1 + exp(fit_logit))
, fit_p_lower = exp(fit_logit_lower) / (1 + exp(fit_logit_lower))
, fit_p_upper = exp(fit_logit_upper) / (1 + exp(fit_logit_upper))
)
```
#### Logit scale
```{R, fig.height = 6, fig.width = 8}
# plot on logit scale
library(ggplot2)
p1 <- ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))
p1 <- p1 + theme_bw()
# MODEL FIT
# predicted curve and point-wise 95% CI
p1 <- p1 + geom_ribbon(aes(x = TotalCigsSmoked_smokers_log2, ymin = fit_logit_lower, ymax = fit_logit_upper), fill = "blue", linetype = 1, alpha = 0.2)
p1 <- p1 + geom_line(aes(x = TotalCigsSmoked_smokers_log2, y = fit_logit), colour = "blue", size = 1)
# fitted values
p1 <- p1 + geom_point(aes(y = fit_logit), colour = "blue", size = 2)
# summaries
p1 <- p1 + geom_point(aes(size = Total))
# axes
#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title = "Logit of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Logistic model fit"
, x = "log2 Total cigarettes smoked"
, y = "Logit scale of the probability of nicotine dependence"
, caption = paste(
"Black = Observed logit proportions of nicotine dependence given log2 Total cigarettes smoked"
, "Blue = Logistic model fitted logit proportions"
, sep = "\n" # separate by new lines
)
)
print(p1)
```
#### Probability/proportion scale
```{R, fig.height = 6, fig.width = 8}
# plot on probability scale
library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))
p1 <- p1 + theme_bw()
# data
p1 <- p1 + geom_jitter(width = 0.05, height = 0.025, size = 0.5, alpha = 1/10, colour = "green")
# summaries
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))
# MODEL FIT
# predicted curve and point-wise 95% CI
p1 <- p1 + geom_ribbon(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p, ymin = fit_p_lower, ymax = fit_p_upper), fill = "blue", linetype = 1, alpha = 0.2)
p1 <- p1 + geom_line(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p), colour = "blue", size = 1)
# fitted values
p1 <- p1 + geom_point(data = dat_Nic_Cigs_sum, aes(y = fit_p), colour = "blue", size = 2)
# axes
p1 <- p1 + scale_y_continuous(breaks = seq(0, 1, by = 0.2))
p1 <- p1 + expand_limits(y = c(0, 1))
#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))
# labels
p1 <- p1 + labs(
title = "Proportion of nicotine dependence by log2 Total cigarettes smoked"
, subtitle= "Logistic model fit"
, x = "log2 Total cigarettes smoked"
, y = "At least one pregnancy (0/1)\nObserved probability of nicotine dependence"
, caption = paste(
"Green = Indicator points of nicotine dependence (1) or not (0)."
, "Black = Observed proportions of nicotine dependence given log2 Total cigarettes smoked"
, "Blue = Logistic model fitted proportions"
, sep = "\n" # separate by new lines
)
)
print(p1)
```
### (1 p) Interpret the sign ($+$ or $-$) of the slope parameter and test whether the slope is different from zero, $H_A: \beta_1 \ne 0$.
The summary table gives MLEs and standard errors for the regression parameters.
The z-value column is the parameter estimate divided by its standard error.
The p-values are used to test whether the corresponding parameters of the
logistic model are zero.
```{R}
summary(glm_n_c)
# see names(summary(glm_n_c)) to find the object that has the coefficients.
# can also use coef(glm_n_c)
```
__Interpret the sign ($+$ or $-$) of the slope.__
* Because our slope is positive, as (x) log2 Total cigarettes smoked increases, the probability of "success" (y) of nicotine dependence increases.
__Hypothesis test__
* $H_0: \beta_1 = 0$
* $H_A: \beta_1 \ne 0$
* The p-value = $`r summary(glm_n_c)$coefficients[2,4] %>% signif(3)` < 0.05$, therefore we reject $H_0$ in favor of $H_A$ concluding that the slope is different from 0.
[End]