---
title: "ADA1: Example cumulative project file"
subtitle: "Erik's NESARC Project"
author: "Erik Erhardt"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
number_sections: true
toc_depth: 5
code_folding: show
#df_print: paged
#df_print: kable
#toc_float: true
#collapsed: false
#smooth_scroll: TRUE
theme: cosmo #spacelab #yeti #united #cosmo
highlight: tango
pdf_document:
df_print: kable
#latex_engine: xelatex
#sansfont: IBM Plex Sans
#classoption: landscape
fontsize: 12pt
geometry: margin=0.25in
always_allow_html: yes
bibliography: NESARC_NicotineDependence_LitReview.bib
---
----------------------------------------
# Document overview
This document is organized by Week and Class number.
The in-class assignments are indicated by the Tuesday and Thursday Class numbers.
Each week's homework is often a combination of Tuesday and Thursday,
with a small extension.
Therefore, "fleshing out" the Tuesday and Thursday sections with a little addition
is often sufficient for your homework assignment;
that is, you won't need a separate "homework" section for a week but
just extend the in-class assignments.
Rarely, the homework assignment is different from the in-class assignments
and requires it's own section in this document.
Consider your readers (graders):
* organize the document clearly (use this document as an example)
* label minor sections under each day (use this document as an example)
* For each thing you do, always have these three parts:
1. Say what you're going to do and why.
2. Do it with code, and document your code.
3. Interpret the results.
## Global code options
```{R}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE)
```
## Document
### Naming
*Note: Each class save this file with a new name,
updating the last two digits to the class number.
Then, you'll have a record of your progress,
as well as which files you turned in for grading.*
* `ADA1_HW_ALL_NESARC_Project_05.Rmd`
* `ADA1_HW_ALL_NESARC_Project_06.Rmd`
* `ADA1_HW_ALL_NESARC_Project_07.Rmd` ...
A version that I prefer is to use a date using Year-Month-Day, `YYYYMMDD`:
* `ADA1_HW_ALL_NESARC_Project_20190903.Rmd`
* `ADA1_HW_ALL_NESARC_Project_20190905.Rmd`
* `ADA1_HW_ALL_NESARC_Project_20190910.Rmd` ...
### Structure
Starting in Week03, we will concatenate all our Homework assignments together to retain the
relevant information needed for subsequent classes.
You will also have an opportunity to revisit previous parts to make changes or improvements,
such as updating your codebook, modifying your research questions, improving tables and plots.
I've provided an initial predicted organization of our
sections and subsections using the \# and \#\# symbols.
A table of contents is automatically generated using the "toc: true" in the yaml
and can headings in the table of contents are clickable to jump down
to each (sub)section.
----------------------------------------
# Research Questions
## Week01: Personal Codebook
### Class 02 Rmd, codebook
__Dataset__: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook wv1codebook.pdf.
__Initial thinking__: _While nicotine dependence is a good starting point, I
need to determine what it is about nicotine dependence that I am interested in.
It strikes me that friends and acquaintances that I have known through the
years that became hooked on cigarettes did so across very different periods of
time. Some seemed to be dependent soon after their first few experiences with
smoking and others after many years of generally irregular smoking behavior. I
decide that I am most interested in exploring the association between level of
smoking and nicotine dependence. I add to my codebook variables reflecting
smoking levels (e.g., smoking quantity and frequency)._
__Topic of interest__:
Is there an association between nicotine dependence
and the frequency and quantity of smoking on people up to 25 years old?
The association may differ by ethnicity, age, gender, and other factors
(though we won't be able to test these additional associations until next semester in ADA2).
__How I did it__: I look through the codebook wv1codebook.pdf and find some
variables of interest. I searched the text with "Ctrl-F" (find) to find these
variables.
For each variable, I copy/paste the description here, then formatted so it's
organized. You can choose to use a table or an outline format. I found this
text format to be very easy to format. I retained the "frequency" of each
response because it's interesting to know, and because it was already in the
codebook --- this value is not required for your codebook.
### Codebook
```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
Sex
SEX
SEX
Nominal
18518 1. Male
24575 2. Female
Age
AGE
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
TobaccoDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
Depression
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars
Height_ft
S1Q24FT
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
Height_in
S1Q24IN
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
Weight_lb
S1Q24LB
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
```
Additional variables were created from the original variables:
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```
----------------------------------------
## Week02: Literature Review
### Tuesday ---------
### Class 03 Research questions
__Dataset__:
National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),
with codebook NESARC_W1_CodeBook.pdf.
Research question:
1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
### Thursday ---------
### Class 04 Citations and Literature review
### Citations
* @beck2014 said an interesting thing about X and Y and I will try that relationship with my variables A and B.
* @rich2013 said something similar, but it was with adults instead of adolescents.
* There were other citations that are related and I want to read them more carefully [@murp2012; @dean2014; @beck2014]
For more documentation on bibliographies and citations with R Markdown, see
[http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html).
For general help with R Markdown, see
[https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).
Citations to appear in parentheses are cited in brackets like this [@beck2014; @rich2013; @murp2012; @dean2014].
A citation you want to refer to in text, such as @beck2014, doesn't have the brackets.
### Week 02 Homework Literature review
__Dataset__:
National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),
with codebook NESARC_W1_CodeBook.pdf.
Research question:
1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
* Google scholar search: "nicotine dependence smoking frequency"
* Citation: pdf file available for @dierker_association_2007
* Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
* Others: @kandel_extent_2000 and @caraballo_linking_2009
**You don't need to include images in your literature review.
I'm providing these figures and tables to illustrate what they look like.**
![](ADA1_WS_06_LiteratureReview_image3.JPG)
2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
* Google scholar search: "nicotine dependence depression"
* Citation: pdf file available for @breslau1995psychiatric
* Interesting points: Table 2, Smoking and Nicotine Dependence both associated with **Education**.
Table 3, Major depression associated with being nicotine dependent and **Sex**.
* Others: @breslau_major_1998
![](ADA1_WS_06_LiteratureReview_image1.JPG)
![](ADA1_WS_06_LiteratureReview_image2.JPG)
3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
* In Question 2 we see differences by Education and Sex.
I have decided to further focus my question by examining whether the
association between nicotine dependence and depression differs based on how
much a person smokes. I am wondering if at low levels of smoking compared to
high levels, nicotine dependence is more common among individuals with major
depression than those without major depression. I add relevant depression
questions/items/variables to my personal codebook as well as several
demographic measures (age, gender, ethnicity, education, etc.) and any other variables I
may wish to consider.
All required variables have been found and added to my personal codebook (by expanding Class 03).
Citations are at the bottom of this document in the "References" section.
----------------------------------------
# Data Management
## Week03: Data Subset, Univariate Summaries And Plots
### Background
#### Purpose of study: smoking and depression in young adults
(Note: This is the last paragraph of the research proposal used to remind the
reader of the research objectives.)
The present study will examine young adults from the National Epidemiologic
Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis
will include 1) establishing the relationship between major depression and
nicotine dependence; and 2) determining whether or not the relationship between
major depression and nicotine dependence exists above and beyond smoking
quantity.
#### Variables
Variables from NESARC that will be used include (already in codebook, above):
* `IDNUM` (unique ID number with no alphabetics),
* `MAJORDEPLIFE` (has subject experienced a major depression?),
* `CHECK321` (has subject smoked cigarettes in past 12 months?),
* `AGE` (age of subject),
* `TAB12MDX` (tobacco dependence past 12 months),
* `S3AQ3B1` (usual frequency when cigarettes smoked),
* `ETHRACE2A` (ethnicity of subject),
* `SEX` (gender of subject), and
* `S3AQ3C1` (usual smoking quantity of cigarettes when smoked).
### Tuesday ---------
### Data subset
First, the data is placed on the search path.
```{R}
# data analysis packages
library(tidyverse) # Data manipulation and visualization suite
library(forcats) # Factor variables
library(lubridate) # Dates
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Use the load() statement for the dataset you want to use.
##
## load("AddHealth.RData")
## load("addhealth_public4.RData")
## load("NESARC.RData")
# read data
load("NESARC.RData")
dim(NESARC)
```
The variables of interest are specified and a subset of the dataset including
only these variables is created and stored in the data frame `nesarc_sub`.
```{R}
# variables to include in our data subset
nesarc_sub <-
NESARC %>%
select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)
```
Check size and structure of data subset.
Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
```{R}
# size of subset
dim(nesarc_sub)
# structure of subset
str(nesarc_sub)
```
#### EXTRA: AddHealth, joining waves 1 and 2 together
Joining AddHealth waves 1 and 2 together into a single dataset
can be done if you want to use variables from when
the participants were both adolecents and adults.
```{R}
load("AddHealth.RData")
load("addhealth_public4.RData")
# size of datasets, to verify the data is available
dim(AddHealth)
dim(addhealth_public4)
# First column is the unique ID
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Rename W4's unique ID variable "AID" and "aid" to "ID"
# so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
AddHealth %>%
dplyr::rename(ID = AID)
addhealth_public4 <-
addhealth_public4 %>%
dplyr::rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Join two datasets by unique ID
AH14 <-
full_join(AddHealth, addhealth_public4)
# now we have twice as many columns
dim(AH14)
# remove the original data and use AH14 going forward
#rm(AddHealth)
#rm(addhealth_public4)
# I don't need this data for the NESARC project, so I am removing this large object.
# DELETE THIS rm() LINE if you're using AddHealth
#rm(AH14)
```
### Renaming Variables
Rename variables `NewName = OriginalName`.
```{R}
nesarc_sub <-
nesarc_sub %>%
dplyr::rename(
ID = IDNUM
, Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, TobaccoDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
, IncomePersonal = S1Q10A
, Height_ft = S1Q24FT
, Height_in = S1Q24IN
, Weight_lb = S1Q24LB
)
str(nesarc_sub)
```
The first 6 and last 6 values are given by `head()` and `tail()`.
```{R}
head(nesarc_sub)
tail(nesarc_sub)
```
A quick summary of the variables indicates many missing values.
```{R}
# summary of variables -- notice many NAs
summary(nesarc_sub)
```
### Coding missing values
There are two steps.
The first step is to recode any existing `NA`s to actual values, if necessary.
The method for doing this differs for numeric and categorical variables.
The second step is to recode any coded missing values, such as 9s or 99s, as actual `NA`.
#### Coding `NA`s as meaningful "missing"
First step: the existing blank values with `NA` mean "never",
and "never" has a meaning different from "missing".
For each variable we need to decide what "never" means
and code it appropriately.
##### `NA`s recoded as numeric
The easiest is `DailyCigsSmoked` where `NA` means 0 cigarettes.
We replace `NA` with 0.
The function `is.na()` identifies each observation that "is an `NA`".
I've updated the codebook to indicate that the original `NA` values were changed.
```
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
```
```{R}
table(nesarc_sub$DailyCigsSmoked)
nesarc_sub <-
nesarc_sub %>%
replace_na(
# replace NA with new numeric value
list(
DailyCigsSmoked = 0
)
)
table(nesarc_sub$DailyCigsSmoked)
```
##### `NA`s recoded as categorical
The next two variables require defining a new category level.
Both `SmokingStatus` and `SmokingFreq` with a value of `NA` means "never",
so we need to create a new "Never" category and put in order
with the rest of the levels.
I've updated the codebook to indicate that the original `NA` values were changed.
```
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
```
```{R}
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
nesarc_sub <-
nesarc_sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
, SmokingFreq = as.character(SmokingFreq )
) %>%
replace_na(
# replace NA with new category values
list(
SmokingStatus = "3"
, SmokingFreq = "7"
)
)
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
```
#### Coding 9s and 99s as `NA`s
Second step: replace the coded missing values with `NA`,
then remove the remaining records with missing values.
Not all of these variables need this operation, since the previous subset commands
removed missing values by chance.
However, I'm going to run each as an extra precaution.
Afterall, later, if we were to make different previous data decisions,
it's possible we could end up with missing values at this point that
are incorrectly coded without this step!
```{R}
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
table(nesarc_sub$DailyCigsSmoked)
nesarc_sub <-
nesarc_sub %>%
mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
, SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
, Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA)
, Height_in = replace(Height_in , Height_in %in% c( 99), NA)
, Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA)
# drop unused factor levels
, SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
)
table(nesarc_sub$SmokingStatus )
table(nesarc_sub$SmokingFreq )
table(nesarc_sub$DailyCigsSmoked)
summary(nesarc_sub)
```
Finally, consider using `na.omit()` to remove any records with missing values
if it won't cause issues with analysis.
The command `na.omit()` removes a row if any columns have an `NA` in that row.
This can have __drastic__ consequences.
For example, imagine that you have 7 variables and 6 variables have complete data,
but the 7th variable is about three-quarters `NA`s!
Then `na.omit()` will drop three-quarters of your data,
even though most of your variables are complete!
There's only one advantage I can think of for running `na.omit()`,
and that's if you have a large dataset and only a few observations have a few `NA`s.
You can't analyze data with `NA`s,
since any calculation involving `NA` results in an `NA`.
Therefore imagine two scenarios:
(1) An analysis with Variables 1 and 2, each with a few `NA` distributed throughout;
first the observations with `NA` for either Variable 1 or 2 will be removed, then
the analysis will proceed.
(2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`
for different observations than Variable 2 --- therefore, different observations
will be removed before analysis.
In summary, these two analyses will include different observations from the dataset.
Therefore, by running `na.omit()` you will start with a dataset with no `NA`s,
thus every analysis will be of exactly the same set of data
(at the cost of not using all the possible data for each individual analysis).
Below, I do __not run `na.omit()`__ since I'm concerned about having lots of `NA`s for some variables.
```{R}
dim(nesarc_sub)
# remove rows with missing values
#nesarc_sub <- na.omit(nesarc_sub)
dim(nesarc_sub)
summary(nesarc_sub)
```
Now we have a dataset with no missing values.
### Creating new variables
Additional variables that we created from the original variables are listed in
this ammendment to the original codebook.
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```
#### From categories to numeric
`DaysSmoke` estimates the days per month a subject smokes
by converting `SmokingFreq` (a factor with 6 levels) to a numeric variable using `as.numeric()`
and multiplying by the midpoint of the range of `SmokingFreq`.
```{R}
nesarc_sub <-
nesarc_sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30 # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
)
)
summary(nesarc_sub$DaysSmoke)
```
##### AddHealth4 Income example
Use the AddHealth Wave 4 income variable to convert from categorical to numerical.
Note that there's also variable `H4EC2`, which is already a numeric income variable.
```{R}
# create "Income4" from the Section 12: Economics section.
AH14 <-
AH14 %>%
mutate(
Income4 = h4ec1
)
table(AH14$Income4)
# create a numeric version of this categorical variable
AH14 <-
AH14 %>%
mutate(
Income4_num =
case_when(
Income4 == 1 ~ 2500 # less than $5,000
, Income4 == 2 ~ 7500 # $5,000 to $9,999
, Income4 == 3 ~ 12500 # $10,000 to $14,999
, Income4 == 4 ~ 17500 # $15,000 to $19,999
, Income4 == 5 ~ 22500 # $20,000 to $24,999
, Income4 == 6 ~ 27500 # $25,000 to $29,999
, Income4 == 7 ~ 35000 # $30,000 to $39,999
, Income4 == 8 ~ 45000 # $40,000 to $49,999
, Income4 == 9 ~ 62500 # $50,000 to $74,999
, Income4 == 10 ~ 87500 # $75,000 to $99,999
, Income4 == 11 ~ 125000 # $100,000 to $149,999
, Income4 == 12 ~ 175000 # $150,000 or more
## These two would have been coded as NA previously
## You can't code them as NA in this case_when() function.
#, Income4 == 96 ~ # refused
#, Income4 == 98 ~ # don't know
)
)
table(AH14$Income4_num)
```
#### From numeric to numeric
`Height_inches` is the height of a person in inches.
Since the data was collected in two variables (`Height_ft` and `Height_in`)
we need to add them together for our new variable.
```{R}
nesarc_sub <-
nesarc_sub %>%
mutate(
Height_inches = Height_ft * 12 + Height_in
)
summary(nesarc_sub$Height_inches)
```
The variable `TotalCigsSmoked` estimates the monthly number of cigarettes
a subject smokes per month by multiplying `DaysSmoke` times `DailyCigsSmoked`
(the usual quantity smoked per day).
```{R}
nesarc_sub <-
nesarc_sub %>%
mutate(
TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
)
```
**Log scale.**
I like log2 since its interpretation is that every unit is a doubling.
Notice that I added 1 to the `TotalCigsSmoked` before computing the log.
This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.
```{R}
nesarc_sub <-
nesarc_sub %>%
mutate(
TotalCigsSmokedLog2 = log(TotalCigsSmoked + 1, base = 2)
)
summary(nesarc_sub$TotalCigsSmokedLog2)
```
**Square root scale.**
Square root is another transformation that spreads out values between 0 and 1
and compresses values from 1 to infinity.
It can be preferred to a log transformation in some cases.
```{R}
nesarc_sub <-
nesarc_sub %>%
mutate(
TotalCigsSmokedSqrt = sqrt(TotalCigsSmoked)
)
summary(nesarc_sub$TotalCigsSmokedSqrt)
```
#### From numeric to categories based on quantiles
The numeric variable `TotalCigsSmoked` is converted into a factor
with four roughly equivalent numbers (quartiles) stored in each level of the
factor `CigsSmokedFac` using the `cut` function.
The label `"Q1"` indicates that the original value was in the first quarter (25%) of the ordered data,
that is, they are among the smallest 25% observations;
while `"Q4"` indicates the original value was in the last quarter (25%) of the ordered data,
that is, they are among the largest 25% observations.
```{R}
quantile(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)
nesarc_sub <-
nesarc_sub %>%
mutate(
CigsSmokedFac =
.bincode(TotalCigsSmoked
, breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
, include.lowest = TRUE
)
, CigsSmokedFac = factor( CigsSmokedFac
, levels = c(1, 2, 3, 4)
, labels = c("Q1", "Q2", "Q3", "Q4")
)
)
```
In my data, more than half of the values are 0s which is the minimum value,
thus they all get grouped in "Q1"; this leaves no values for "Q2" -- oh well.
```{R}
summary(nesarc_sub$CigsSmokedFac)
```
#### From many categories to a few
The variable `SmokingFreq3` collapses the `SmokingFreq` from 6 down to 3 categories.
This is an example to show you how to do this with your own variables.
Some of you will have categorical variables that have a dozen or so categories;
having so many categories makes interpretation very difficulty --- fewer is easier.
```{R}
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc_sub$SmokingFreq)
# look at numeric values
table(as.numeric(nesarc_sub$SmokingFreq))
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
nesarc_sub <-
nesarc_sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
)
)
# new variable
table(nesarc_sub$SmokingFreq3)
# we will label the new factor variable below, with the others
```
#### Review results of new variables
Look at the first 10 observations in the dataset, with special attention to the
new variables.
```{R}
head(nesarc_sub, 10)
```
### Labeling Categorical variable levels
Look at summary of dataset to see the state of the labelling before we make changes.
```{R}
summary(nesarc_sub)
```
Informative labels are given to the factor levels.
The order of the levels is also rearranged for the variables
`SmokingFreq`, `TobaccoDependence`, and `Sex`.
```{R}
# Label factors with meaningful names in the same order as the codebook
nesarc_sub$Depression <-
factor(
nesarc_sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)
nesarc_sub$SmokingStatus <-
factor(
nesarc_sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)
# Specify the order of the levels to be different from codebook order
nesarc_sub$Sex <-
factor(nesarc_sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc_sub$Sex)
nesarc_sub$TobaccoDependence <-
factor(nesarc_sub$TobaccoDependence
, levels = c( 1
, 0
)
, labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
)
)
# Remember to include the new category level from the original `NA` values.
nesarc_sub$SmokingFreq <-
factor(nesarc_sub$SmokingFreq
, levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
)
, labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
)
nesarc_sub$Ethnicity <-
factor(nesarc_sub$Ethnicity
# shorter labels are helpful for plots
, labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
)
# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
nesarc_sub$SmokingFreq3 <-
factor(nesarc_sub$SmokingFreq3
, levels = c( 1
, 2
, 3
, 4
)
, labels = c( "Daily"
, "Weekly"
, "Monthly"
, "Never"
)
)
summary(nesarc_sub)
```
### Data subset rows to young smokers
In practice, I would probably subset the data much earlier.
In this example I waited until the end to subset so that I could illustrate
all of the other steps with complete values.
In our example,
we are interested in the _subset of people 25 or younger_
who _smoked in the last 12 months_ and this is
created using the `filter()` function.
Note that `SmokingStatus == "Smoked past 12 months"` is used to choose people who
smoked in the past 12 months.
Notice, the missing values have been removed
(since `NA`s are not less than or equal to numbers).
```{R}
# the subset command below will not include a row that evalutes to NA for the
# specified subset. For example:
(NA == 1)
# age and smoking subset
nesarc_sub <-
nesarc_sub %>%
filter(
Age <= 25 # people 25 or younger
, SmokingStatus == "Smoked past 12 months" # smoked in the past 12 months
)
dim(nesarc_sub)
summary(nesarc_sub)
```
### EXTRA: Working with Dates
#### NESARC age example
Dates can be tricky in R. The `lubridate` package makes them much easier.
Below I'm going to create a variable for the person's age in years
based on the difference between the person's date of birth and the interview date.
There is an `AGE` variable which I'm already using,
but if there wasn't, or I wanted to be more precise about their age (such as, 24.34 years old),
then this is a way to do it.
I'm going to create a separate data frame just for this example to show how to do it;
if you needed these variables, they should be part of your original subset above.
Below, the "C" variables (`CDAY`) are the interview day, month, and year,
and the "DOB" variables (`DOBD`) are the date of birth day, month, and year.
```{R}
dat_sub <-
NESARC %>%
select(
CDAY
, CMON
, CYEAR
, DOBD
, DOBM
, DOBY
)
str(dat_sub)
```
Combine the date fields together, then interpret them into a date object.
I'll do this for both the interview date and the date of birth.
I print the head (first 6 observations) after each operation.
```{R}
## interview date
# combine day, month, and year into one text string
cdate_text <- paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)
head(cdate_text)
# create date object by interpretting the day, month, year with dmy()
cdate_date <- dmy(cdate_text)
head(cdate_date)
## date of birth
# combine day, month, and year into one text string
dob_text <- paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)
head(dob_text)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob_date <- dmy(dob_text)
head(dob_date)
```
Now that we have the interview date and date of birth as date objects,
we calculate the difference to give the person's age at the time of the interview.
Because the difference of dates is in the unit "days", we convert to years.
```{R}
# difference in days / 365.25 = difference in years
age_days <- cdate_date - dob_date
head(age_days)
age_years <- as.numeric(age_days / 365.25) # change from "difftime" to "numeric"
head(age_years)
# difference between the age we calculated and the age in the original dataset
head(age_years - NESARC$AGE)
```
Add the new age variable to your data frame,
drop the ones you don't need anymore,
and you're ready to go!
```{R}
# add the variable
dat_sub$age_years <- age_years
str(dat_sub)
# drop the original ones you don't need anymore
# by using subset() and select = -VAR, it excludes the VAR column
dat_sub <-
dat_sub %>%
select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat_sub)
```
Now you've got an age (or other date object) you can use in your project.
#### AddHealth4 age example
Dates can be tricky in R. The `lubridate` package makes them much easier.
Below I'm going to create a variable for the person's age in years
based on the difference between the person's date of birth and the interview date.
There is an `AGE` variable which I'm already using,
but if there wasn't, or I wanted to be more precise about their age (such as, 24.34 years old),
then this is a way to do it.
I'm going to create a separate data frame just for this example to show how to do it;
if you needed these variables, they should be part of your original subset above.
Below, the "i" variables (`iday4) are the interview day, month, and year,
and the date of birth variables (`H4OD1M` and `H4OD1Y`) are the date of birth month and year
(there is no day, so we'll use day = 15 for the middle of the month).
Combine the date fields together, then interpret them into a date object.
I'll do this for both the interview date and the date of birth.
I print the head (first 6 observations) after each operation.
```{R}
AH14 <-
AH14 %>%
mutate(
idate = paste(iday4, imonth4, iyear4)
, dob = paste(15, h4od1m, h4od1y)
)
head(select(AH14, idate, dob))
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
AH14 <-
AH14 %>%
mutate(
idate = dmy(idate)
, dob = dmy(dob)
)
head(select(AH14, idate, dob))
```
Now that we have the interview date and date of birth as date objects,
we calculate the difference to give the person's age at the time of the interview.
Because the difference of dates is in the unit "days", we convert to years.
```{R}
# difference in days / 365.25 = difference in years
AH14 <-
AH14 %>%
mutate(
age_days = idate - dob
, age_years = as.numeric(age_days / 365.25) # change from "difftime" to "numeric"
)
head(select(AH14, idate, dob, age_days, age_years))
summary(AH14$age_years)
```
Now you've got an age (or other date object) you can use in your project.
#### Converting 12-hour to 24-hour time
Let's use a time from Add Health "Wave IV Section 7: Sleep Patterns" about what time people wake up.
There are three fields for hour, minute, and am/pm, and we want to convert these three fields into a single time number.
```
Now I'm going to ask you about your sleep patterns.
1H. On the days you go to work, school or similar activities,
what time do you usually wake up?
H4SP1H num [Hour] 96=refused, 98=don't know
H4SP1M num [Minute] 96=refused, 98=don't know
H4SP1T cat [am/pm] 1=am, 2=pm, 6=refused, 8=don't know
```
There are two steps to address.
1. First, "PM" adds 12 hours.
2. Second, 12 PM stays as 12 and 12 AM becomes 00.
```{r}
AH14 <-
AH14 %>%
mutate(
# name the hour, minute, and am_pm variables
wake_hour = h4sp1h %>% as.numeric()
, wake_min = h4sp1m %>% as.numeric()
, wake_ampm = h4sp1t %>% as.numeric()
# code missing values labels as NAs
, wake_hour = replace(wake_hour, wake_hour %in% c(96, 98), NA)
, wake_min = replace(wake_min , wake_min %in% c(96, 98), NA)
, wake_ampm = replace(wake_ampm, wake_ampm %in% c( 6, 8), NA)
# 1=am, 2=pm
, wake_ampm = case_when(
wake_ampm == 1 ~ "AM"
, wake_ampm == 2 ~ "PM"
)
# convert hour to 24-hour time
, wake_hour24 = case_when(
# hour is 1-11 and AM, then hour is 1-11
(wake_hour %in% 1:11) & (wake_ampm == "AM") ~ wake_hour
# hour is 1-11 and PM, then hour adds 12 for 13-23
, (wake_hour %in% 1:11) & (wake_ampm == "PM") ~ wake_hour + 12
# hour is 12 and AM, then hour is 0 (midnight)
, (wake_hour == 12 ) & (wake_ampm == "AM") ~ 0
# hour is 12 and PM, then hour is 12 (noon)
, (wake_hour == 12 ) & (wake_ampm == "PM") ~ 12
)
# convert minutes from 0-59 to fractions of an hour
, wake_min_frac = wake_min / 60
# add the 24-hour hour and the fraction minute to get a value from 0 to 23.99
, wake_time = wake_hour24 + wake_min_frac
)
# Here's an example of the new variables
# This looks correct :)
AH14 %>%
select(wake_hour, wake_min, wake_ampm, wake_time) %>%
na.omit() %>%
arrange(wake_time) %>%
slice(seq(1, n(), length = 10) )
```
Below we see what we expect, that most people wake between 5 and 8 AM.
```{R}
library(ggplot2)
p <- ggplot(data = AH14, aes(x = wake_time))
p <- p + geom_histogram(boundary = 0, binwidth = 0.5)
p <- p + geom_rug()
p <- p + labs(title = "AddHealth Wave 4 Wake time")
p <- p + theme_bw()
print(p)
```
## Data is complete
At this point, your dataset should be complete, coded, formatted, and ready for analysis.
### Plot entire dataset, show missing values
Plot the data subset to show the extent of missing values.
```{r}
ggplot_missing <- function(dat) {
# https://github.com/njtierney/neato/blob/master/R/ggplot_missing.R
# A function that plots missing data in ggplot. For a more updated version,
# see `vis_miss` from visdat - github.com/njtierney/visdat
# dat is a data frame
# return a ggplot of the missing data.
dat2 <-
dat %>%
is.na()
# create a column indicating which rows all have data (no missing)
NO_MISSING <-
!(rowSums(!dat2) == ncol(dat2))
dat2 <-
cbind(dat2, NO_MISSING) %>%
reshape2::melt()
p <- ggplot2::ggplot(data = dat2, aes(x = Var2, y = Var1))
p <- p + ggplot2::geom_raster(aes(fill = value)) # , alpha = 0.6
p <- p + ggplot2::scale_fill_grey(name = "",
labels = c("Present","Missing"))
p <- p + ggplot2::theme_minimal()
p <- p + ggplot2::theme(axis.text.x = element_text(angle=90, vjust=0, hjust=0))
p <- p + ggplot2::labs(x = "Variables in Dataset",
y = "Rows / observations")
#ggplot2::scale_y_continuous(expand = c(0,0))
p <- p + ggplot2::scale_y_reverse(expand = c(0,0), breaks = c(1, seq(0, 10000, by=20)))
return(p)
}
```
```{R, fig.height = 8, fig.width = 8}
ggplot_missing(nesarc_sub)
```
--------------------------------------------------------------------------------
# Graphing and Tabulating
## Thursday ---------
## Categorical variables
## Tables for categorical variables
Basic tables can be created using the functions `table` or `xtabs`.
Frequency tables are created for the variables
`TobaccoDependence`, `CigsSmokedFac`, and `SmokingFreq`.
When only looking at one variable, `table` is the same as `xtabs`, but
`xtabs` will make looking at two-way, three-way, and higher tables possible.
```{R}
# table() produces a one-varible frequency table
table(nesarc_sub$TobaccoDependence)
# proportions available by passing these frequencies to prop.table()
table(nesarc_sub$TobaccoDependence) %>% prop.table()
# xtabs() can be used for one or more variables
T1 <- xtabs( ~ TobaccoDependence, data = nesarc_sub)
T1
# proportions available with prop.table()
prop.table(T1)
T2 <- xtabs( ~ SmokingFreq, data = nesarc_sub)
T2
```
(Look at the .Rmd code to see that these numbers in the next paragraph are
calculated directly from the numbers summarized in the table!
This way, if the data changes, the text will be correctly updated.)
In the data frame `nesarc_sub`, there are
`r T1[1]` nicotine dependent subjects and
`r T1[2]` subjects that are not nicotine dependent.
Most of the subjects in `nesarc_sub` are daily smokers
(`r T2[6]`) with the remainder distributed uniformly across the
first five levels of `SmokingFreq`.
### Graphing frequency tables
The barplots are all created with the package `ggplot2`.
The barplots start with the defaults for the `geom_bar` and add more detail to the plot with each graph.
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc_sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1
p2 <- ggplot(data = nesarc_sub, aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Smoking Adults")
p2 <- p2 + theme_bw()
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
p2
```
### Removing NAs from factor axes
When a factor variable has `NA`s,
then `NA` appears as a level with frequences (see bar plots above).
As of writing this, `ggplot()` does not have a way to remove the `NA`s before plotting.
Therefore, we have to do it manually.
The easiest and most transparent way is to use the `drop_na()` function below
where we only plot the non-`NA` values.
Compare the two plots below of the same data,
but the second plot had the `NA`s removed before plotting.
```{R, fig.height = 4, fig.width = 8}
library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc_sub, aes(x = SmokingFreq))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p1
# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep, and "!is.na()" means "keep the non-NA values"
# p2 <- ggplot(data = subset(nesarc_sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
# tidyverse syntax uses drop_na(Var) to drop observations where Var is NA
p2 <- ggplot(data = nesarc_sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p2
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")
```
### Final versions
In your project, you only need to present your "Final version".
I'm illustrating the process above.
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc_sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = ""
, y = "Total Number"
, title = "Nicotine Dependence for Young Smoking Adults"
)
print(p1)
```
**Interpretation:**
Among smokers 25 years and younger,
there are slightly more (`r (prop.table(T1)[1] - prop.table(T1)[2]) %>% round(3) * 100`% more)
people who are
nicotine dependent (`r prop.table(T1)[1] %>% round(3) * 100`%)
compared to those who are not nicotine dependent (`r prop.table(T1)[2] %>% round(3) * 100`%).
```{R}
p2 <- ggplot(data = nesarc_sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = ""
, y = "Total Number"
, title = "Smoking Frequency for Young Smoking Adults"
)
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
print(p2)
```
**Interpretation:**
Among smokers 25 years and younger,
most smoke every day (`r prop.table(T2)[7] %>% round(3) * 100`%)
while the less frequent categories are roughly evenly distributed
around 4% to 5%.
## Numeric variables
## Graphing numeric variables
One popular way to graph a continuous variable is to use a histogram.
`R` has the base function `hist()` which can be used for this purpose.
We will also use the package `ggplot2` to create histograms.
We start with a basic histogram of the variable `TotalCigsSmoked`.
```{R}
hist(nesarc_sub$TotalCigsSmoked)
```
Experiment with the code in the next code chunk to set the `binwidth = ` (extremely important!),
decide if you need `boundary = 0`,
add a title, and if needed
the labels on the $x$- and $y$-axes.
```{R}
p <- ggplot(data = nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
# Use "boundary = 0" for values that can't be negative
p <- ggplot(data = nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month", title = "boundary at 0")
p <- p + theme_bw()
p
```
### Creating Density Plots
A smoothed density curve can be added with the `geom_density()` function,
where `adjust = 2.0` is a fairly smooth curve.
Experiment with `adjust` values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness
and decide what level best captures the general shape of your distribution.
```{R}
p <- ggplot(data = nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
```
### Shape, center, and spread
```{R}
summary(nesarc_sub$TotalCigsSmoked)
fivenum(nesarc_sub$TotalCigsSmoked)
median(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)
IQR(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)
```
The "shape" of the distribution for the estimated cigarettes smoked per month
is skewed to the right.
The "center" (median) of the distribution is
`r median(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)`
and the "spread" (interquartile range, middle 50%) for the distribution is
`r IQR(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)`.
### Final versions
In your project, you only need to present your "Final version".
I'm illustrating the process above.
```{R}
p <- ggplot(data = nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
```
**Interpretation:**
Among smokers 25 years and younger,
the number of cigarettes smoked is right skewed
with a median (center) of `r median(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)`
and the IQR (spread) (interquartile range, middle 50%) for the distribution is
`r IQR(nesarc_sub$TotalCigsSmoked, na.rm = TRUE)`.
There are three distinct modes (peaks) in the distribution
because of how we constructed this variable from two others,
and because of the common numbers of cigarettes being relative to a pack size (20).
There are several outlying values greater than 1200 (representing 40 cigarettes per day)
that are possibly overestimates from the people responding.
```{R}
p <- ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Monthly cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
```
**Interpretation:**
Among smokers 25 years and younger,
most smoke less than 10 cigarettes per day with another peak at 20 (1 pack per day),
another small peak at 40 (2 packs per day),
and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).
----------------------------------------
## Week04: Bivariate graphs
### Tuesday ---------
### Scatter plot (for regression): x = numerical, y = numerical
Research question:
With my questions, I don't have a meaningful question to ask that will relate
two numeric variable.
An exploratory question whether there is a relationship between a person's age (`Age`)
and total number of cigaretts smoked (`TotalCigsSmoked`).
Because age is an integer varible, I jitter the points,
and I reduce the opacity (increase the transparancy) with `alpha = 1/4`
to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).
__Interpretation:__
I added a linear regression smoother to see whether there was any linear trend with age.
The horizontal line suggests that there isn't a (linear) relationship here.
The first plot has all the points in their original locations,
but they end up stacking on top of each other so you can't tell how many points are there.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
This is the same data as above, but I have added some horizontal jitter (displacement)
to each point which doesn't affect their value on the y-axis
but allows us to see how many points are at each discrete age year.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.2), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
Here's an example where there's a strong positive relationship between the two
numeric variables.
Most variables in our dataset won't exhibit such a nice relationship.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
```
If you have two groups you want to compare, you can create facets (small multiples)
and show the relationship by each group.
Below, I compare by sex.
The slopes appear similar;
later in the semester we'll learn how to formally compare these lines.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
p <- p + facet_wrap(~ Sex)
print(p)
```
### Box plots (for ANOVA): x = categorical, y = numerical
Research question:
1. Is there a relationship between depression (`Depression`) and total number of cigarettes smoked (`TotalCigsSmoked`)?
The $x$ variable should be a factor variable, and $y$ should be numeric.
__Interpretation:__
The boxplots indicate that the five number summary is about the same for both depression groups.
Also, the means are about the same, though slightly higher for the people with depression.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(title = "Depression does not predict Total Cigarettes Smoked")
print(p)
```
### Thursday ---------
### Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
Research question:
Is there a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)?
This strategy uses the base graphics by plotting the tables directly as bar plots.
While I'm illustrating how frequency and proportion tables can be plotted,
I do not recommend using the `barplot()` for your work.
```{R}
T1 <- xtabs(~ TobaccoDependence + Depression, data = nesarc_sub)
T1
barplot(T1)
T2 <- prop.table(T1, 2)
T2
barplot(T2)
```
The last graph needs a legend as well as a title. While it is possible to
construct a legend and title for the last graph, it is much better to use an
approach that will generate the legend automatically.
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
#### Categorical plots in `ggplot`
This set of plots using `ggplot()` are preferred.
Is there a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)?
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
, y = "Proportion"
, title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status"
)
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
print(p)
```
__Reversing color order:__
It is sometimes preferable to reverse the order of your color-filled variable
to read the more interesting proportions from the bottom of the plot
up the proportion scale.
Use the function `forcats::fct_rev()` on your `fill=` variable.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(TobaccoDependence)))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
, y = "Proportion"
, title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status"
)
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
print(p)
```
Does the a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)
vary by sex?
__Interpretation:__
It appears that nicotine dependence is slightly higher for women with depression than for men with depression.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc_sub, aes(x = Depression, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(x = "Depression status"
, y = "Proportion"
, title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status"
)
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
# added a row facets by Sex
p <- p + facet_grid(Sex ~ .)
# tilted the x-axis labels
p <- p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
print(p)
```
Research question:
Is there a relationship between smoking frequency (`SmokingFreq`) and nicotine dependence (`TobaccoDependence`)?
Also compare by sex.
```{R}
library(ggplot2)
p <- ggplot(data = subset(nesarc_sub, !is.na(SmokingFreq)), aes(x = SmokingFreq, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
# tilt axis labels
p <- p + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
p <- p + labs(x = "Smoking Frequency", y = "Percent")
print(p)
# also compare by sex
p <- p + facet_grid(Sex ~ .)
print(p)
```
#### Using Mosaic Plots
Mosaic plots will be used later in the course when we calculate statistics
to test whether a pair of categorical variables are associated.
These are helpful because they color-code each cell based on whether
they provide evidence of an association.
These plots are difficult to customize, so the `ggplot()` versions
above will generally be preferred to these.
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
```{R}
library(vcd)
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub)
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub, shade = TRUE)
mosaic( ~ TobaccoDependence + Depression + Sex, data = nesarc_sub, shade = TRUE)
```
It's easy to look at such a fine resolution (too many variables)
that it's difficult to understand what's happening.
```{R, fig.width = 8, fig.height = 8}
mosaic(~ TobaccoDependence + Depression + Sex + CigsSmokedFac, data = nesarc_sub, shade = TRUE)
```
##### Customizing Mosaic Plots
There are several solutions to dealing with the messy labels that overlap.
The help line below will show the many options you have control over.
Read through the comments of each code chunk, see the new options, and see the result.
```{R, fig.height = 4, fig.width = 8}
library(vcd)
# help(labeling_border)
# Initial plot
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub)
# Abbreviate labels, specify the number of characters (try a few)
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub
, abbreviate_labs = 4)
# Rotate and align labels
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = "right" # right-justifies all the labels (top and side)
)
# Rotate and align labels only y-axis label
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
)
# Rotate and align labels only y-axis label, and offset the variable name
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0,0,0,9) # offset the y-axis var name
)
```
__Final version__
Labeling this plot well is, in my opinion, a bit too hard to be worth it.
Notice in my code chunk options (you'll need to look at the `.Rmd` file),
that I've included `fig.height = 8, fig.width = 8` which makes the size of
the plot in the output 8 inches high by 8 inches wide.
```{R, fig.height = 8, fig.width = 8}
library(vcd)
mosaic( ~ TobaccoDependence + Depression, data = nesarc_sub
, shade = TRUE
, rot_labels = c(0, 90, 0, 0) # order is: top, right, bottom, left
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0, 0, 0, 9) # offset the y-axis var name
# a big left margin makes space for the plot labels
, margins = c(3, 3, 3, 14) # order is: top, right, bottom, left
, main = "Fraction of young adult smokers with and without\n nicotine dependence by depression status"
, labeling_args = list(set_varnames = c(TobaccoDependence = "Tobacco Addiction Status"
, Depression = "Depression status")
)
)
```
__Interpretation:__
If a person has nicotine dependence they are more likely to have depression
than if they do not have nicotine dependence.
### Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
Research question:
1. Frequency and quantity of smoking (`TotalCigsSmoked`) are markedly imperfect indices for
determining an individual's probability of exhibiting nicotine dependence (`TobaccoDependence`) (this
is true for other drugs as well).
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = TobaccoDependence))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
print(p)
```
__Interpretation:__
These two groups are quite balanced on the probability of tobacco dependence
given the total number of cigarettes smoked,
though they increase together slightly.
To overlay a curve, we need to convert our response variable $y$ to be
binary 0 or 1, where we let 1 indicate a "success" in the direction of interest.
#### Collapsing a variable to binary
Whether you have a numeric or categorical variable that you'd like to represent with two levels,
you'll need to convert either to a numeric binary variable (values of 0 or 1, only).
The value of 1 indicates "success" or the feature you're interested in (below, 1 = someone with tobacco dependence)
and 0 is "failure" (not tobacco dependent).
Note that "success" does not necessarily mean "good".
##### Numeric to binary
It is convenient to collapse a continuous variable into a categorical variable
(as I did in Week03 with the `cut()` command)
or further collapse a categorical variable into a binary variable.
One strategy to create a binary variable is to use the `ifelse()` function.
Let's create a binary `Age` variable; if 21 years and older, than legal to purchase alcohol,
otherwise it is illegal.
See how the two columns below correspond as we wanted them to.
I do not recommend overwriting the original column,
but create a new variable in the data.frame.
```{R}
nesarc_sub$Alcohol_binary <- ifelse(nesarc_sub$Age >= 21, "Legal", "Illegal")
nesarc_sub$Alcohol_binary01 <- ifelse(nesarc_sub$Age >= 21, 1, 0)
str(nesarc_sub$Alcohol_binary)
head(nesarc_sub %>% select(Age, Alcohol_binary, Alcohol_binary01), 12)
```
##### Categorical to binary
Two-level categorical to binary.
```{R}
table(nesarc_sub$TobaccoDependence)
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc_sub$TobaccoDependence01 <- ifelse(nesarc_sub$TobaccoDependence == "Nicotine Dependence", 1, 0)
# check that the frequencies are correct before and after
table(nesarc_sub$TobaccoDependence01)
# Note, sometimes it is necessary to make your variable a character type first.
# If the code above doesn't produce both 0s and 1s, then try this:
nesarc_sub$TobaccoDependence01 <- ifelse(as.character(nesarc_sub$TobaccoDependence) == "Nicotine Dependence", 1, 0)
```
Multi-level categorical to binary.
```{R}
# Smoking frequency has 7 categories
# if at least 3 Days/week, then code as a 1, otherwise code as a 0
table(nesarc_sub$SmokingFreq)
nesarc_sub$SmokingFreq01 <-
ifelse(
nesarc_sub$SmokingFreq
%in%
c("3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
, 1 # success
, 0 # failure
)
# check that the frequencies are correct before and after
table(nesarc_sub$SmokingFreq01)
```
#### Plot with logistic curve
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = TobaccoDependence01))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = "Tabacco Dependence (1=Yes, 0=No)"
, title = "Likelihood of Tabacco Dependence increases with number of cigs smoked"
)
print(p)
```
__Interpretation:__
The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases.
The logistic curve model probably doesn't apply at (or near) 0 cigarettes.
# Statistical methods: inference and testing
## Week05: Simple linear regression, logarithm transformation
### Tuesday ---------
In-class SLR intuition-building exercise and interpretation.
Do not add to this document.
Turn it in as its own document.
### Thursday ---------
### 1. Scatter plot, add a regression line.
Reproduce scatterplot from before, with a regression line.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
### 2. Fit the linear regression, interpret slope and $R^2$ (R-squared) values
```{R}
# fit the simple linear regression model
lm_TCS_Age <- lm(TotalCigsSmoked ~ Age, data = nesarc_sub)
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCS_Age)
```
The slope is `r signif(coef(lm_TCS_Age)[2],3)`.
Thus, for every year increase in age,
we expect an increase of `r signif(coef(lm_TCS_Age)[2],3)` in total cigarettes smoked.
The $R^2$ value is `r summary(lm_TCS_Age)$r.squared`, which is effectively 0.
Compared to a model which uses only the mean for total cigarettes smoked,
the model with age accounts for `r signif(summary(lm_TCS_Age)$r.squared, 2)` additional variance.
When the $x$ and $y$ variables are correlated ($r \ne 0$),
$R^2$ will be larger (since $R^2 = r^2$, the capitalization doesn't matter in this case).
### 3. Interpret the intercept. Does it make sense?
The intercept of `r signif(coef(lm_TCS_Age)[1],3)` is the predicted number of total cigarettes smoked
for a person whose age is 0.
Not only is that a gross extrapolation,
but it doesn't make sense.
### 4. Try plotting on log scale ($x$-only, $y$-only, both).
What happened to your data when you transformed it?
The plot with $(x, \log_{10}(y))$ looks much better.
That's because total cigarettes smoked (`TotalCigsSmoked`)
has a lower bound of 0 and
is right skewed (most values are low, some are extremely high).
```{R, fig.height = 4, fig.width = 8}
library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p1 <- p1+ geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p1 <- p1+ stat_smooth(method = lm)
p1 <- p1+ labs(title = "Total Cigarettes Smoked vs Age")
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p2 <- p2+ geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p2 <- p2+ stat_smooth(method = lm)
p2 <- p2+ scale_y_log10()
p2 <- p2+ labs(title = "log10(Total Cigarettes Smoked) vs Age")
p2 <- p2+ labs(y = "log10(TotalCigsSmoked)")
#print(p2)
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Note: log10(Total Cigarettes Smoked) improves the symmetry about the regression line")
```
The `Age` range is limited and not an interesting variable to transform in this case.
Below I provide those two plots: $(\log_{10}(x), y)$ and $(\log_{10}(x), \log_{10}(y))$.
You should try them on your data.
They will be most helpful for a right-skewed $x$ variable, such as income.
```{R, fig.height = 4, fig.width = 8}
library(ggplot2)
p1 <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + geom_jitter(position = position_jitter(width = 0.002), alpha = 1/4)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + scale_x_log10()
p1 <- p1 + labs(title = "Total Cigarettes Smoked vs log10(Age)")
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + geom_jitter(position = position_jitter(width = 0.002), alpha = 1/4)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + scale_y_log10()
p2 <- p2 + labs(title = "log10(Total Cigarettes Smoked) vs log10(Age)")
#print(p2)
library(gridExtra)
#grid.arrange(p1, p2, nrow=1, main = "log10(Age) isn't a good idea")
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Note: log10(Age) isn't a good idea")
```
### 5. Does log transformation help?
The plot with $(x, \log_{10}(y))$ is an improvement over $(x,y)$
since the points are more symmetric around the regression line
(they are not perfectly symmetric, but they are better).
There might be a better transformation (such as square-root),
or we might have to live with a non-symmetric response.
```{R}
nesarc_sub$TotalCigsSmokedLog10 <- log10(nesarc_sub$TotalCigsSmoked)
# fit the simple linear regression model
lm_TCSL10_Age <- lm(TotalCigsSmokedLog10 ~ Age, data = nesarc_sub)
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm_TCSL10_Age)
```
The slope is `r signif(coef(lm_TCSL10_Age)[2],3)`.
Thus, for every year increase in age,
we expect an increase of `r signif(coef(lm_TCSL10_Age)[2],3)` in log 10 of total cigarettes smoked;
that is, we expect a small decrease on the log scale.
Notice that the interpretation is on the scale of the regression.
If the $x$ or $y$ variable is on a different scale (such as $\log_{10}$),
then the interpretation is on that scale, too.
## Week06: Correlation and Categorical contingency tables
### Tuesday ---------
### Correlation
From a previous scatterplot, calculate the correlation and interpret it.
The plot from the previous section and correlation are calculated below.
(Unfortunately, my example is not very interesting.)
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + scale_y_log10()
p <- p + labs(title = "log10(Total Cigarettes Smoked) vs Age")
p <- p+ labs(y = "log10(TotalCigsSmoked)")
print(p)
```
```{R}
# Calculate correlation
# Because of possible missing values, the "use" command excludes NAs before calculation
cor_TL_A <- cor(nesarc_sub$TotalCigsSmokedLog10, nesarc_sub$Age, use="pairwise.complete.obs")
cor_TL_A
```
__Interpretation of correlation__
The correlation is `r signif(cor_TL_A, 3)`, which is negative (but effectively 0).
This suggests that there effectively no linear relationship between the $\log_{10}$(Total Cigarettes Smoked) and Age.
(See the Rmd file for how I included the correlation value in the text.)
__Height and weight example__
As a second example with a beautiful correlation, here is height and weight.
We should all be so lucky to be analyzing data that looks like this!
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
```
```{R}
# Calculate correlation
# Because of possible missing values, the "use" command excludes NAs before calculation
cor_H_W <- cor(nesarc_sub$Height_inches, nesarc_sub$Weight_lb, use="pairwise.complete.obs")
cor_H_W
```
__Interpretation of correlation__
The correlation is `r signif(cor_H_W, 3)`, which is positive and moderately strong.
This suggests that there is a decent linear relationship between height and weight.
### Thursday ---------
### Contingency table
From a previous multiple-variable categorical table,
calculate conditional proportions and interpret them.
Research question:
Is there a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`),
and does it differ by Sex?
(We'll perform the statistical test for this in a couple weeks;
right now we're calculating, plotting, and interpretting the proportions.)
I'll modify a plot from before using code from Worksheet 12.
### TobaccoDependence by Depression and Sex
```{R}
# proportions are of the last group_by() variable within combinations of the other variables
tab_long <-
nesarc_sub %>%
group_by(Sex, Depression, TobaccoDependence) %>%
summarise(Frequency = n()) %>%
mutate(Proportion = Frequency / sum(Frequency)) %>%
ungroup()
tab_long
```
### Interpretation of conditional proportions
The proportions above show that Females without depression have less nicotine dependence
(`r signif(subset(tab_long
, (Sex == "Female")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "No Depression")
, select = Proportion), 3)`)
than those with depression
(`r signif(subset(tab_long
, (Sex == "Female")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "Yes Depression")
, select = Proportion), 3)`),
and this disparity is even greater for Males
(`r signif(subset(tab_long
, (Sex == "Male")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "No Depression")
, select = Proportion), 3)`
vs
`r signif(subset(tab_long
, (Sex == "Male")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "Yes Depression")
, select = Proportion), 3)`).
(See the Rmd file for how I included the values from the table in the text.)
```{R, fig.height = 6, fig.width = 8}
# lines are sometimes easier, especially when many categories along the x-axis
library(ggplot2)
p <- ggplot(data = tab_long, aes(x = Depression, y = Proportion, colour = TobaccoDependence))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_point(aes(shape = TobaccoDependence), size = 3)
p <- p + geom_line(aes(linetype = TobaccoDependence, group = TobaccoDependence), size = 1)
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_grid(. ~ Sex)
p <- p + theme_bw()
p <- p + theme(legend.position = "bottom")
p <- p + labs(title = "Proportion of Tobacco Dependence by Depression and Sex")
p <- p + labs(y = "Proportion of Tobacco Dependence")
print(p)
```
## Week07: Inference and Parameter estimation (one-sample)
```{R, echo=FALSE}
## This code provides two functions for using the bootstrap to compare whether the
## sampling distribution is close to a normal distribution.
## These functions are defined here then used in later code chunks.
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution with
# a normal distribution with mean and SEM estimated from the data
bs_one_samp_dist <- function(dat, N = 1e4) {
n <- length(dat);
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
# draw a histogram of the means
sam_mean <- colMeans(sam);
# save par() settings
old_par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(2,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 6
, main = "Plot of data with smoothed density curve")
points(density(dat), type = "l")
rug(dat)
hist(sam_mean, freq = FALSE, breaks = 25
, main = "Bootstrap sampling distribution of the mean"
, xlab = paste("Data: n =", n
, ", mean =", signif(mean(dat), digits = 5)
, ", se =", signif(sd(dat)/sqrt(n)), digits = 5))
# overlay a density curve for the sample means
points(density(sam_mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(sam_mean), max(sam_mean), length = 1000)
points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam_mean)
# restore par() settings
par(old_par)
}
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data
bs_two_samp_diff_dist <- function(dat1, dat2, N = 1e4) {
n1 <- length(dat1);
n2 <- length(dat2);
# resample from data
sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
# calculate the means and take difference between populations
sam1_mean <- colMeans(sam1);
sam2_mean <- colMeans(sam2);
diff_mean <- sam1_mean - sam2_mean;
# save par() settings
old_par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat1, freq = FALSE, breaks = 6
, main = paste("Sample 1", "\n"
, "n =", n1
, ", mean =", signif(mean(dat1), digits = 5)
, ", sd =", signif(sd(dat1), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat1), type = "l")
rug(dat1)
hist(dat2, freq = FALSE, breaks = 6
, main = paste("Sample 2", "\n"
, "n =", n2
, ", mean =", signif(mean(dat2), digits = 5)
, ", sd =", signif(sd(dat2), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat2), type = "l")
rug(dat2)
hist(diff_mean, freq = FALSE, breaks = 25
, main = paste("Bootstrap sampling distribution of the difference in means", "\n"
, "mean =", signif(mean(diff_mean), digits = 5)
, ", se =", signif(sd(diff_mean), digits = 5)))
# overlay a density curve for the sample means
points(density(diff_mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(diff_mean), max(diff_mean), length = 1000)
points(x, dnorm(x, mean = mean(diff_mean), sd = sd(diff_mean))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(diff_mean)
# restore par() settings
par(old_par)
}
```
### Tuesday ---------
### Numeric variable confidence interval for mean $\mu$
Using a numerical variable,
calculate and interpret a confidence interval for the population mean.
```{R}
# calcuate mean and CI
t_result <- t.test(nesarc_sub$TotalCigsSmokedLog10, conf.level = 0.95)
t_result
```
The sample mean of the $\log_{10}($Total Cigs Smoked$)$ is `r signif(t_result$estimate, 3)`.
We are 95% confident that the population mean is between
`r signif(t_result$conf.int[1], 3)` and
`r signif(t_result$conf.int[2], 3)`.
Yes, the interval can be very narrow when the sample size is large.
(Note that this does _not_ mean that there's 95% probability that the population mean
is in that interval. Rather, for 95% of samples, the true population mean will
be in the interval. We do not know whether the true mean is in our interval or not.)
```{R, fig.width = 7, fig.height = 7}
# plot data and create table
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmokedLog10))
p <- p + geom_histogram(binwidth = 0.2, alpha = 1)
p <- p + geom_rug(alpha = 1/8)
# est
p <- p + geom_vline(aes(xintercept = as.numeric(t_result$estimate)), colour = "blue", size = 1)
p <- p + geom_rect(aes(xmin = t_result$conf.int[1], xmax = t_result$conf.int[2], ymin = -20, ymax = 0)
, fill = "blue", alpha = 1)
p <- p + labs(title = "Log 10 Total Cigs Smoked\nblue = est +- 95% CI")
print(p)
# checking model assumptions, some slight skewness observed
bs_one_samp_dist(nesarc_sub %>% drop_na(TotalCigsSmokedLog10) %>% pull(TotalCigsSmokedLog10))
```
### Thursday ---------
### Categorical variable confidence interval for proportion $p$
Using a two-level categorical variable,
calculate and interpret a confidence interval for the population proportion.
Let's estimate the population proportion of people with Depression
for the subset of people we've sampled from.
(Because I have at least 5 "successes" and "failures",
and a large sample size ($n=`r nrow(nesarc_sub)`$)
I know the `prop.test()` and `binom.test()` will give the same result.
Try it for your data. They will only differ with quite small samples,
especially if the proportions are close to 0 or 1.)
```{R}
# proportions are of the last group_by() variable within combinations of the other variables
tab_dep_long <-
nesarc_sub %>%
group_by(Depression) %>%
summarise(Frequency = n()) %>%
mutate(Proportion = Frequency / sum(Frequency)) %>%
ungroup()
tab_dep_long
# prop.test() is an asymptotic (approximate) test for a binomial random variable
p_summary <-
prop.test(
x = tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency)
, n = sum(tab_dep_long$Frequency)
, conf.level = 0.95
)
p_summary
```
Note how to add error bars using `geom_errorbar()`.
First determine the CI bounds from the `prop.test()` previously,
then set those as limits.
```{R, fig.height = 2}
# get names of objects in b_summary
names(p_summary)
# here's the confidence interval bounds (the attribute tells us this is a 95% interval)
p_summary$conf.int
p_summary$conf.int[1]
p_summary$conf.int[2]
library(ggplot2)
p <- ggplot(data = tab_dep_long %>% filter(Depression == "Yes Depression"), aes(x = Depression, y = Proportion))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat = "identity")
p <- p + geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25)
p <- p + scale_y_continuous(limits = c(0, 1))
# swap axes to take less vertical space
p <- p + coord_flip()
print(p)
```
The sample proportion of the true population proportion of people depressed
is `r signif(p_summary$estimate, 3)`.
We are 95% confident that the population proportion is between
`r signif(p_summary$conf.int[1], 3)` and
`r signif(p_summary$conf.int[2], 3)`.
Yes, the interval can be very narrow when the sample size is large.
(Note that this does _not_ mean that there's 95% probability that the population proportion
is in that interval. Rather, for 95% of samples, the true population proportion will
be in the interval. We do not know whether the true mean is in our interval or not.)
## Week08: Hypothesis testing (one- and two-sample)
### Mechanics of a hypothesis test (review)
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``The population mean for [what is being studied] is different from [value of $\mu_0$].''
(Note that the statement in words is in terms of the alternative hypothesis.)
* In notation: $H_0: \mu=\mu_0$ versus $H_A: \mu \ne \mu_0$
(where $\mu_0$ is specified by the context of the problem).
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}$, where $SE_{\bar{Y}}=s/\sqrt{n}$ is the standard error.
4. Determine the __tail(s)__ of the sampling distribution where the __$p$-value__ from the test statistic will be calculated
(for example, both tails, right tail, or left tail).
(Historically, we would compare the observed test statistic, $t_{s}$,
with the __critical value__ $t_{\textrm{crit}}=t_{\alpha/2}$
in the direction of the alternative hypothesis from the
$t$-distribution table with degrees of freedom $df = n-1$.)
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
6. __Check assumptions__ of the test (for now we skip this).
### What do we do about "significance"?
Adapted from **[Significance Magazine](https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2019.01295.x)**.
Recent calls have been made to abandon the term "statistical significance".
The American Statistical Association (ASA) issued its
[statement](https://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108) and
[recommendation](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913)
on p-values (see the [special issue of p-values](https://www.tandfonline.com/toc/utas20/73/sup1?nav=tocList) for more).
In summary, the problem of "significance" is one of misuse, misunderstanding, and misinterpretation.
The recommendation in this class is that it is no longer sufficient to say that a
result is "statistically significant" or "non-significant" depending on whether a p-value is less than a threshold.
Instead, we will be looking for wording as in the following paragraph.
"The difference between the two groups turns out to be small (8%), while the
probability ($p$) of observing a result at least as extreme as this under the
null hypothesis of no difference between the two groups is $p = 0.003$ (that is,
0.3%). This p-value is statistically significant as it is below our pre-defined
threshold ($p < 0.05$). However, the p-value tells us only that the 8% difference
between the two groups is somewhat unlikely given our hypothesis and null model's
assumptions. More research is required, or other considerations may be needed,
to conclude that the difference is of practical importance and reproducible."
### Tuesday ---------
### Two-sample $t$-test: `TotalCigsSmoked` by `Depression`
1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels),
specify a two-sample $t$-test associated with your research questions.
* (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
* (0 p) use `t.test()` to calculate the mean, test statistic, and p-value,
* (3 p) state the significance level, test statistic, and p-value, and
* (2 p) state the conclusion in the context of the problem.
* (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
* (2 p) Provide an appropriate plot of the data and sample estimates in a well-labelled plot.
```{R}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
est_mean_TCS_D <-
nesarc_sub %>%
group_by(Depression) %>%
summarise(TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)) %>%
ungroup()
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(title = "Total Cigarettes Smoked by Depression")
print(p)
# histogram using ggplot
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 40)
p <- p + geom_rug()
p <- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour = "red")
p <- p + facet_grid(Depression ~ .)
print(p)
```
```{R}
t_summary_TCS_D <- t.test(TotalCigsSmoked ~ Depression, nesarc_sub)
t_summary_TCS_D
```
1. "Is the population mean total cigarettes smoked different for those with depression or not?"
* $H_0: \mu_{D} = \mu_{ND}$ versus $H_A: \mu_{D} \ne \mu_{ND}$
2. Let $\alpha=0.05$, the significance level of the test and the Type-I error probability if the null hypothesis is true.
3. $t_{s} = `r signif(t_summary_TCS_D$statistic, 4)`$.
4. $p=`r signif(t_summary_TCS_D$p.value, 3)`$, this is the observed significance of the test.
5. Because $p=`r signif(t_summary_TCS_D$p.value, 3)` > 0.05$,
we have insufficient evidence to reject $H_0$, concluding that the
total cigarettes smoked does not differ by depression status.
Because we failed to reject $H_0$, I could have committed a Type-II error
(that there is a difference but we didn't detect it).
### Thursday -------- Fall Break
Enjoy your Fall Break!
## Week09: ANOVA and Assessing Assumptions
### Tuesday ---------
Do not add to this document.
Turn it in as its own document.
### Thursday ---------
### ANOVA: Total cigarettes smoked by Ethnicity
#### Transform the response variable to satisfy assumptions
I've decided to try to find a better transformation of the total cigarettes smoked,
`TotalCigsSmoked`,
to improves the symmetry (and normality) of the distributions
(the original scale was right skewed, and the $\log_{10}$ was left skewed).
The __Box-Cox transformation__ is a power transformation where we find a best power $\lambda$
for $y^{\lambda}$ such that the residuals from the model are most normal.
We're fitting ANOVA, so we first fit the model,
then have the `boxcox()` function estimate the power that gives us residuals that
are most "normal".
Finally, we'll create a new variable $y^{\lambda}$ for a sensible value of $\lambda$.
```{R}
# Box-Cox transformation finds the best power transformation of the response
# to make the residuals most "normal".
# A transformation close to this will help us meet our normality assumption.
# It is often preferred to choose a transformation that is a "simple" exponent,
# such as 0.5, rather than the exact estimate of the best, such as 0.4242424.
library(car)
aov_summary <- aov(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
bc_summary <- boxCox(aov_summary
, lambda = seq(-2, 2, 0.1)
, plotit = TRUE)
opt_lambda <- bc_summary$x[which.max(bc_summary$y)]
opt_lambda
```
The best power transformation value is $\lambda = `r signif(opt_lambda, 3)`$,
so I will choose 0.5 (which is a square root).
When I reassess the power after taking the square root,
the power $\lambda = 1$ is close to the 95% interval,
so I will use this as my transformation to ease interpretation
and because the plots we see below of the data on the square-root scale
are close to normal.
```{R}
## Choose a square root transformation.
nesarc_sub$TotalCigsSmokedSqrt <- sqrt(nesarc_sub$TotalCigsSmoked)
library(car)
aov_summary <- aov(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc_sub)
bc_summary <- boxCox(aov_summary
, lambda = seq(-2, 2, 0.1)
, plotit = TRUE
)
opt_lambda <- bc_summary$x[which.max(bc_summary$y)]
opt_lambda
```
#### ANOVA Hypothesis test
A table of the ethnic groups to compare is below.
```{R}
summary(nesarc_sub$Ethnicity)
```
Let $\mu_j$ = pop mean number of the square-root total cigarettes smoked for the five ethnicities identified in the dataset:
Caucasian,
African American,
Native American,
Asian, and
Hispanic,
numbered $(j = 1, 2, 3, 4, 5)$, respectively.
We wish to test $H_0: \mu_1=\cdots=\mu_5$ against $H_A: \textrm{not } H_0$ (at least one pair of means differ.
Plot the data in a way that compares the means.
Error bars are 95% confidence intervals of the mean.
```{R}
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = Ethnicity, y = TotalCigsSmokedSqrt))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(nesarc_sub$TotalCigsSmokedSqrt),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_violin(width = 0.5, alpha = 0.25)
p <- p + geom_boxplot(width = 0.25, alpha = 0.25)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 4,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Total cigarettes smoked by Ethnicity")
p <- p + ylab("Square-root of Total Cigs Smoked")
print(p)
```
__Hypothesis test__
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The population mean square-root total cigarettes smoked is different between ethnicities."
* In notation: $H_0: \mu_1=\cdots=\mu_5$ versus $H_A: \textrm{not } H_0$ (at least one pair of means differ).
2. Let the significance level of the test be $\alpha=0.05$.
3. Compute the __test statistic__.
```{R}
aov_summary <- aov(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc_sub)
summary(aov_summary)
```
The $F$-statistic for the ANOVA is $F = `r signif(unlist(summary(aov_summary))["F value1"], 3)`$.
4. Compute the __$p$-value__ from the test statistic.
The p-value for testing the null hypothesis is
$p = `r signif(unlist(summary(aov_summary))["Pr(>F)1"], 3)`$.
5. State the __conclusion__ in terms of the problem.
Because $p = `r signif(unlist(summary(aov_summary))["Pr(>F)1"], 3)` < 0.05$,
we reject $H_0$ in favor of $H_1$ concluding that
the mean square-root total cigarettes smoked is different between ethnicities.
#### Check assumptions
6. __Check assumptions__ of the test.
a. Residuals are normal
b. Populations have equal variances.
* Check whether residuals are normal.
- Plot the residuals and assess whether they appear normal.
```{R}
# Plot the data using ggplot
df_res <- data.frame(res = aov_summary$residuals)
library(ggplot2)
p <- ggplot(df_res, aes(x = res))
p <- p + geom_histogram(aes(y = ..density..), binwidth = 4)
p <- p + geom_density(colour = "blue")
p <- p + geom_rug()
p <- p + stat_function(fun = dnorm, colour = "red", args = list(mean = mean(df_res$res), sd = sd(df_res$res)))
p <- p + labs(title = "ANOVA Residuals")
print(p)
```
There are a few extreme observations, and a large peak of points in the center.
They are not symmetric, but they are unimodal.
- Plot the residuals versus the normal quantiles.
If the residuals are normal, then the will fall on the center line and
very few will be outside the error bands.
```{R}
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(aov_summary$residuals, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")
```
There are deviations from normality, and I think this largely has to do with rounding the number of cigarettes smoked.
Those lowest values are a group of very small numbers (1 cig - 10 cig in a month),
and because those are bounded at 0,
we're getting that the residuals are larger than expected
under the assumption that they are normal.
The points in the middle under the red line are a group of points of regular smokers who smoke a moderate amount.
- A formal test of normality on the residuals tests the hypothesis
$H_0:$ The distribution is Normal vs
$H_1:$ The distribution is not Normal.
We can test the distribution of the residuals.
Three formal tests for normality are reported below.
```{R}
shapiro.test(aov_summary$residuals)
library(nortest)
ad.test(aov_summary$residuals)
cvm.test(aov_summary$residuals)
```
All three tests reject normality.
In summary, both the plots and tests for normality suggest these residuals are not normal.
The original plots do indicate that for each ethnic group
the square-root transformation has made the data roughly symmetric, but not normal.
We have a few outliers we may want to worry about (remove and see if the results change).
Ultimately, we will probably want to use a nonparametric method to test the ANOVA hypothesis,
a method that doesn't require the normality assumption.
* Check whether populations have equal variances.
- Look at the numerical summaries below.
```{R}
# calculate summaries
nesarc_sub_summary <-
nesarc_sub %>%
select(Ethnicity, TotalCigsSmokedSqrt) %>%
group_by(Ethnicity) %>%
summarise(
m = mean(TotalCigsSmokedSqrt, na.rm = TRUE)
, s = sd(TotalCigsSmokedSqrt, na.rm = TRUE)
, n = length(TotalCigsSmokedSqrt)
) %>%
ungroup()
nesarc_sub_summary
```
The standard deviations are between `r signif(min(nesarc_sub_summary$s), 3)`
and `r signif(max(nesarc_sub_summary$s), 3)`,
which is not a wide range.
- Formal tests for equal variances.
We can test whether the variances are equal between our three groups.
This is similar to the ANOVA hypothesis, but instead of testing means we're tesing variances.
$H_0: \sigma^2_1=\sigma^2_2=\sigma^2_3$
versus $H_A: \textrm{not } H_0$ (at least one pair of variances differ).
```{R}
## Test equal variance
# assumes populations are normal
bartlett.test(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc_sub)
# does not assume normality, requires car package
library(car)
leveneTest(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc_sub)
# nonparametric test
fligner.test(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc_sub)
```
Interpret the result of the appropriate test.
If normality was reasonable then use Bartlett, otherwise use Levene.
Both the Levene and Fligner tests reject equal variance.
However, we have a large sample size, so we are "over powered"
to identify small deviations from equality of variance.
While this difference probably wouldn't violate our assumption so badly as to
invalidate the test, we will still want to use the nonparametric test next week, instead.
#### Post Hoc pairwise comparison tests
7. If the ANOVA null hypothesis was rejected, then perform follow-up Post Hoc
pairwise comparison tests to determine which pairs of means are different.
(We'll continue assuming our assumptions are met,
for the sake of completing the exercise.
We understand that we'll want to use a nonparametric test, instead.)
There are several multiple comparison methods described in the notes.
Let's use Tukey's Honest Significant Difference (HSD) here to test which pairs of
populations differ.
```{R}
# sort the Ethnicity by means
arrange(nesarc_sub_summary, m)
# Tukey 95% Individual p-values
TukeyHSD(aov_summary)
```
Summarize results by ordering the means and grouping pairs that do not differ (see notes for examples).
I find it easiest to look for the large p-values (which pairs are not significantly different)
and group those first, then look at the significant pairs and make sure none are connected.
Here's two ways to do this
(vertical is useful when there are many groups,
horizontal is easier to read).
```
(numerical sumamries are a nice addition)
NotSigDiff Mean Ethnicity m s n
| 12.7 Hispanic 12.7 7.64 335
|| 14.5 Asian 14.5 6.94 211
|| 14.5 African American 14.5 5.98 58
|| 15.9 Native American 15.9 7.64 42
| 17.4 Caucasian 17.4 8.16 1060
Ethnicity: Hispanic Asian African American Native American Caucasian
Mean: 12.7 14.5 14.5 15.9 17.4
-------------------------------------
--------------------------------------------
----------------------------
```
## Week10: Nonparametric methods and Binomial and multinomial proportion tests
### Tuesday ---------
### Thursday ---------
### Multinomial goodness-of-fit: `Ethnicity` at US census proportions
__Question:__
Did NESARC sample at random from the US population so that ethnicities are
proportionally represented?
To answer this question, I need to go back to the original NESARC dataset.
If I use my subset, I may have removed observations that will bias my answer to this question.
#### Observed
NB. All the results in this section depend on `nesarc_eth` and `US_eth`
the frequency tables of the full NESARC dataset and the 2010 US Census.
```{R}
# create a temporary Ethnicity factor variable from the orginal dataset
nesarc_eth <-
NESARC %>%
select(ETHRACE2A) %>%
mutate(
Ethnicity = factor(ETHRACE2A
, labels = c("Caucasian"
, "African American"
, "Native American"
, "Asian"
, "Hispanic"
)
)
) %>%
select(Ethnicity) %>%
group_by(Ethnicity) %>%
summarise(obs = n()) %>%
mutate(obs_prop = obs / sum(obs)) %>%
ungroup()
nesarc_eth
```
#### Expected
The US [Ethnicity](https://en.wikipedia.org/wiki/Demographics_of_the_United_States#Race_and_ethnicity) proportions
are listed below with numbered categories consistent with NESARC.
```
ETHRACE2A (Ethnicity)
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
https://en.wikipedia.org/wiki/Demographics_of_the_United_States#Race_and_ethnicity
The U.S. population's distribution by race and ethnicity in 2010 was as follows:
1. Non-Hispanic White 196,817,552 63.7 %
2. Non-Hispanic Black or African American 37,685,848 12.2 %
3. Non-Hispanic American Indian or Alaska Native 2,247,098 0.7 %
4. Non-Hispanic Asian 14,465,124 4.7 %
4. Non-Hispanic Native Hawaiian or other Pacific Islander 481,576 0.2 %
4. Non-Hispanic some other race 604,265 0.2 %
4. Non-Hispanic two or more races 5,966,481 1.9 %
5. Hispanic or Latino 50,477,594 16.4 %
```
Expected numbers and
combining observed and expected into a table.
```{R}
US_eth <- c(196817552
, 37685848
, 2247098
, 14465124 + 481576 + 604265 + 5966481
, 50477594
)
nesarc_eth <-
nesarc_eth %>%
mutate(
exp_prop = US_eth / sum(US_eth)
# Expected number of eth
, exp = sum(obs) * exp_prop
)
nesarc_eth
```
### Perform $\chi^2$ Goodness-of-fit test
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The NESARC and US Census proportions are equal for people of each Ethnicity."
* In notation:
$H_0:
p_{1} = `r round(nesarc_eth$exp_prop[1], 3)`,
p_{2} = `r round(nesarc_eth$exp_prop[2], 3)`,
p_{3} = `r round(nesarc_eth$exp_prop[3], 3)`,
p_{4} = `r round(nesarc_eth$exp_prop[4], 3)`, \textrm{and}
p_{5} = `r round(nesarc_eth$exp_prop[5], 3)`$
versus $H_A: \textrm{not } H_0$
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
```{R}
# calculate chi-square goodness-of-fit test
x_summary <- chisq.test(nesarc_eth$obs, correct = FALSE, p = nesarc_eth$exp_prop)
# print result of test
x_summary
```
3. Compute the __test statistic__.
The test statistic is $X^2 = `r signif(x_summary$statistic, 5)`$.
4. Compute the __$p$-value__.
The p-value $= `r signif(x_summary$p.value, 3)`$.
5. (1 p) State the __conclusion__ in terms of the problem.
* We reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
Because the p-value $= `r signif(x_summary$p.value, 3)` < 0.05 = \alpha$,
we reject $H_0$ in favor of $H_A$
concluding that the proportions of ethnicities sampled in NESARC differs
from the US Census.
6. (1 p) __Check assumptions__ of the test (see notes).
Because there is an expected frequency of at least 5 for each category,
the assumptions are met for this test.
#### Chi-sq statistic helps us understand the deviations
```{R}
# use output in x_summary and create table
x_table <-
data.frame(
Ethnicity = nesarc_eth$Ethnicity
, obs = x_summary$observed
, exp = x_summary$expected
, res = x_summary$residuals
, chisq = x_summary$residuals^2
, stdres = x_summary$stdres
)
x_table
```
Plot observed vs expected values to help identify `Ethnicity` groups that deviate the most.
Plot contribution to chi-square values to help identify `Ethnicity` groups that deviate the most.
The term "Contribution to Chi-Square" (`chisq`) refers to the values of
$\frac{(O-E)^2}{E}$ for each category.
$\chi^2_{\textrm{s}}$ is the sum of those contributions.
These plots suggest that African Americans and Native Americans were oversampled
and Asians were undersampled.
```{R, fig.width = 6, fig.height = 4}
library(tidyr)
x_table_obsexp <-
x_table %>%
gather(
# list the source columns to combine into two columns
obs
, exp
# key is the name of the destination column identifying each
# original column that the measurement came from
, key = "stat"
# value is the column name for values in table
, value = "value"
)
# Observed vs Expected counts
library(ggplot2)
p1 <- ggplot(x_table_obsexp, aes(x = Ethnicity, fill = stat, weight=value))
p1 <- p1 + geom_bar(position="dodge")
p1 <- p1 + labs(title = "Observed and Expected frequencies")
p1 <- p1 + xlab("Ethnicity category")
print(p1)
# Contribution to chi-sq
# pull out only the Ethnicity and chisq columns
x_table_chisq <-
x_table %>%
select(
Ethnicity
, chisq
) %>%
# reorder the Ethnicity categories to be descending relative to the chisq statistic
arrange(
-chisq
)
x_table_chisq
# reorder Ethnicity descending by chisq statistic
p2 <- ggplot(x_table_chisq, aes(x = reorder(Ethnicity, -chisq), weight = chisq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Contribution to Chi-sq statistic")
p2 <- p2 + xlab("Sorted Ethnicity")
p2 <- p2 + ylab("Chi-sq")
print(p2)
#library(gridExtra)
#grid.arrange(grobs = list(p1, p2), nrow=1, top = "")
```
#### Multiple Comparisons
The goodness-of-fit test suggests that at least one of the `Ethnicity` category
proportions for the observed number og eth differs from the expected number.
A reasonable next step in the analysis would be to __separately__ test
the five hypotheses to see which `Ethnicity` categories led to rejecting
the goodness-of-fit null hypothesis:
* $H_0: p_{1} = `r round(nesarc_eth$exp_prop[1], 3)`$
* $H_0: p_{2} = `r round(nesarc_eth$exp_prop[2], 3)`$
* $H_0: p_{3} = `r round(nesarc_eth$exp_prop[3], 3)`$
* $H_0: p_{4} = `r round(nesarc_eth$exp_prop[4], 3)`$
* $H_0: p_{5} = `r round(nesarc_eth$exp_prop[5], 3)`$
A Bonferroni comparison with a Family Error Rate $\alpha=0.05$
sets each test's significance level to $\alpha = 0.05 / 5 = `r signif(0.05/5, 3)`$,
which corresponds to `r signif(100 * (1 - 0.05/5), 4)`% two-sided CIs
for the individual proportions.
Below I perform exact binomial tests of proportion for each of the five `Ethnicity` categories
at the Bonferroni-adjusted significance level.
I save the p-values and confidence intervals in a table along with the
observed and census proportions in order to display the table below.
```{R}
b_sum <-
nesarc_eth %>%
group_by(Ethnicity) %>%
do(
b_out = binom.test(.$obs, sum(nesarc_eth$obs), p = .$exp_prop, alternative = "two.sided", conf.level = 1-0.05 / nrow(.))
) %>%
summarise(
p.value = b_out$p.value
, CI.lower = b_out$conf.int[1]
, CI.upper = b_out$conf.int[2]
) %>%
ungroup()
b_sum$Ethnicity <- nesarc_eth$Ethnicity
b_sum$Observed <- x_table$obs / sum(x_table$obs)
b_sum$exp_prop <- nesarc_eth$exp_prop
b_sum
```
All five hypothesis tests have p-values less than $\alpha = `r signif(0.05/5, 3)`$,
thus they all reject $H_0$,
concluding that each Ethnicity was not sampled
in perfect proportion with the 2010 US Census.
In a survey like this, typically it is desirable to oversample smaller groups,
such as Asian and Native American,
and this was not done for Asian.
```{R}
# Observed vs Expected proportions
library(ggplot2)
p <- ggplot(b_sum, aes(x = Ethnicity, y = Observed))
# observed
p <- p + geom_point(size = 4)
p <- p + geom_line(aes(group = 1))
p <- p + geom_errorbar(aes(ymin = CI.lower, ymax = CI.upper), width = 0.1)
# expected
p <- p + geom_point(aes(y = exp_prop), colour = "red", shape = "x", size = 6)
p <- p + geom_line(aes(y = exp_prop, group = 1), colour = "red")
p <- p + labs(title = "NESARC (w/99% CI) and 2010 US Census (red x) proportions")
p <- p + expand_limits(y = 0)
print(p)
```
## Week11: Two-way categorical tables and Simple linear regression, inference
### Tuesday ---------
### Two-way categorical analysis.
Using two categorical variables with two to five levels each,
specify a hypothesis test for homogeneity of proportions associated with your research questions.
#### (1 p) Specify the hypotheses in words and notation.
* In words: "There is an association between TobaccoDependence and Ethnicity."
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
#### (1 p) State the conclusion of the test in the context of the problem.
```{R}
# Tabulate by two categorical variables:
tab_E_TD <- xtabs(~ Ethnicity + TobaccoDependence, data = nesarc_sub)
tab_E_TD
# column proportions
prop.table(tab_E_TD, margin = 2)
```
```{R}
chisq_E_TD <- chisq.test(tab_E_TD, correct=FALSE)
chisq_E_TD
```
The test statistic is $X^2 = `r signif(chisq_E_TD$statistic, 3)`$.
Because the p-value $= `r signif(chisq_E_TD$p.value, 3)` < 0.05$
we reject the null hypothesis
concluding that there is an association between
TobaccoDependence and Ethnicity.
```{R}
# table of expected frequencies:
chisq_E_TD$expected
# smallest expected frequency:
min(chisq_E_TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
#### (1 p) Plot a mosaic plot of the data and Pearson residuals.
```{R}
# The Pearson residuals
chisq_E_TD$residuals
# The sum of the squared residuals is the chi-squared statistic:
#chisq_E_TD$residuals^2
#sum(chisq_E_TD$residuals^2)
```
```{R, fig.width = 6, fig.height = 6}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
mosaic(tab_E_TD, shade=TRUE, legend=TRUE)
#mosaic(~ Ethnicity + TobaccoDependence, data = nesarc_sub, shade=TRUE, legend=TRUE, direction = "v")
```
#### (1 p) Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting $H_0$ is that
Hispanics have less nicotine dependence than expected.
### Thursday ---------
### Simple linear regression.
Select two numerical variables.
I don't have great variables in my dataset related to my research question for this.
I select `TotalCigsSmokedLog10` and `DaysSmoke`.
#### (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.
See below.
#### (0 p) Fit the simple linear regression model.
```{R}
# fit model
lm_fit <- lm(TotalCigsSmokedSqrt ~ DaysSmoke, data = nesarc_sub)
coef(lm_fit)
```
Note that prediction intervals should be bounded below at 0.
```{R}
library(ggplot2)
p <- ggplot(nesarc_sub, aes(x = DaysSmoke, y = TotalCigsSmokedSqrt))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm_fit, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm_fit, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence and prediction bands")
print(p)
```
#### (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and $x$-value).
```{R, fig.width = 8, fig.height = 6}
# plot diagnistics
par(mfrow=c(2,3))
plot(lm_fit, which = c(1,4,6))
# residuals vs DaysSmoke
# can only plot observations that were did not include NA for the lm_fit
plot(nesarc_sub %>% drop_na(TotalCigsSmokedSqrt, DaysSmoke) %>% pull(DaysSmoke), lm_fit$residuals, main="Residuals vs DaysSmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm_fit$residuals, las = 1, id = list(n = 3), main="QQ Plot")
# # residuals vs order of data
# plot(lm_fit$residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
hist(lm_fit$residuals, breaks=10, main="Residuals")
```
There is some nonrandom pattern.
It almost appears that there'd be a shallower slope if it wasn't for the 30 days/month smokers.
#### (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).
Normality is not perfect, with some slight right skewness,
but probably about as good as it gets for this dataset.
The bigger problem is the mean pattern (mentioned above).
#### (1 p) Assess the relative influence of points.
There are a few points that have large Cook's D,
which are those few people who smoke 30 days/month and smoke the most
and a person who smokes few days but smokes many times more.
It may be worth seeing the effect of removing those four observations.
__Seeing affect of removing 4 influential observations...__
```{R}
tail(sort(cooks.distance(lm_fit)))
```
As seen below, removing the 4 largest Cook's D observations
(which is a rash move -- usually we only remove one at a time
since which points are influential can change based on removing
a single observation)
does not greatly improve the model fit
or normality.
```{R}
# data with deleted points with large Cook's D
nesarc_sub_D <- nesarc_sub %>% filter(!(rownames(nesarc_sub) %in% names(tail(sort(cooks.distance(lm_fit)), 4))))
dim(nesarc_sub)
dim(nesarc_sub_D)
lm_fit_D <- lm(TotalCigsSmokedSqrt ~ DaysSmoke, data = nesarc_sub_D)
library(ggplot2)
p <- ggplot(nesarc_sub_D, aes(x = DaysSmoke, y = TotalCigsSmokedSqrt))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm_fit_D, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm_fit_D, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence and prediction bands")
print(p)
```
```{R, fig.width = 8, fig.height = 6}
# plot diagnistics
par(mfrow=c(2,3))
plot(lm_fit_D, which = c(1,4,6))
# residuals vs DaysSmoke
# can only plot observations that were did not include NA for the lm_fit
plot(nesarc_sub_D %>% drop_na(TotalCigsSmokedSqrt, DaysSmoke) %>% pull(DaysSmoke), lm_fit_D$residuals, main="Residuals vs DaysSmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm_fit_D$residuals, las = 1, id = list(n = 3), main="QQ Plot")
# # residuals vs order of data
# plot(lm_fit_D$residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
hist(lm_fit_D$residuals, breaks=10, main="Residuals")
```
__Continuing with the original data set ...__
#### (1 p) Test whether the slope is different from zero, $H_A: \beta_1 \ne 0$.
```{R}
summary(lm_fit)
```
First note that we have insufficient evidence to conclude that
$\beta_0$ is different from 0, which makes sense ---
if you smoke 0 days,
then 0 should be the number of cigarettes you smoke.
We have sufficient evidence to conclude the slope is different from zero.
Furthermore, for every additional day increase per month a person smokes,
there is a predicted 0.6 square-root of cigarettes smoked.
When this is squared to be on the original scale, the increase is scary.
#### (1 p) Interpret the $R^2$ value.
The regression model explains
$R^2 = `r signif(summary(lm_fit)$r.squared, 3)`$
proportion of the variation
in the response
above using the mean value alone.
## Week12: Logistic regression and Experiments vs Observational studies
### Tuesday ---------
### Logistic Regression
If I provide the solutions here, then there's nothing for you to do for the in-class assignment ...
### Thursday ---------
### Experiments and observational studies
* (2 p) Write a few sentences describing these characteristics of the data you're using for your project.
* What is the sampling design?
* What is the target population and sampling frame?
* How was nonresponse handled? Answer this both in terms of individual
nonresponse (when a respondant is lost to follow up, if this applies
for you (e.g., AddHealth)) and question nonresponse (when a question is
skipped).
You started doing this earlier in Week07 Tuesday, when you described the sampling.
----------------------------------------
# Poster presentation
## Week13: Complete poster in HW document
### Tuesday ---------
### Thursday ---------
#### Title
Smoking behavior is (barely) associated with major depression in young adults
#### 1. __(1 p)__ Introduction
* __Smoking behavior is associated with major depression.__
- Longitudinal investigations have shown that depression increases risk
of later smoking [@breslau_major_1998; @dierker_association_2001].
- Many subsequent studies find a role of major depression in increasing
the probability and amount of smoking [@dierker_defining_2004;
@rohde_psychiatric_2004; @rohde_psychiatric_2003].
* __Smoking behavior and Nicotine dependence.__
- While smoking is a necessary requirement for nicotine dependence, smoking
behavior is a poor predictor of developing nicotine dependence
[@kandel_extent_2000; @stanton_antecedents_1995].
- Heavy smokers may not have nicotine dependence, conversely
light smokers may have nicotine dependence [@kandel_extent_2000].
#### 2. __(1 p)__ Research Questions
* (Week02 Research Questions) 2 bullets, one for each research question, stated as the alternative hypothesis.
The goals of the analysis include answering these two questions:
* Is there an association between major depression and nicotine dependence?
* Does the prevalence of nicotine dependence differ by ethnicity?
#### 3. __(1 p)__ Methods
* (Week01 Personal Codebook) Data source(s).
A sample from the first wave of the National Epidemiologic Survey on
Alcohol and Related Conditions (NESARC) (US adults).
We focus on young adult (18--25) smokers
(`r nrow(nesarc_sub)` of `r nrow(NESARC)` respondants).
In the [Introduction to the National Epidemiologic Survey on Alcohol and Related Conditions](http://pubs.niaaa.nih.gov/publications/arh29-2/74-78.htm)
they describe the sampling design.
Adult respondants were sampled from a subset of households
included in the US Census 2000 Supplementary Survey.
To improve information about smaller groups,
Blacks and Hispanics were oversampled by about 50%,
as were young adults ages 18-24 by 125%.
Using the US Census as a sampling frame helps to provide a representative sampling of the country.
Oversampling smaller subsets of the population means that comparisons involving those
subsets will be more reliable.
* (Week01 Personal Codebook) Variables used.
* Lifetime major depression (`Depression`, 0=No, 1=Yes) [@grant_alcohol_2003; @grant_alcohol_1995].
* Total cigarettes per month (`TotalCigsSmoked` = days smoked times daily cigs smoked, range: 0--`r max(nesarc_sub$TotalCigsSmoked)`),
a combination of these questions:
"About how often did you usually smoke in the past year?"
and "On the days that you smoked in the last year,
about how many cigarettes did you usually smoke?"
* Nicotine dependence (`TobaccoDependence`, 0=No, 1=Yes) in the last 12 months.
* (Various) Statistical methods used to answer the research questions.
* A two-sample $t$-test comparing the mean total cigarettes smoked by depression status.
* A $\chi^2$ analysis of a two-way contingency table of nicotine dependence by ethnicity.
#### 4. __(1 p)__ Discussion (while this would follow your results, let's put it here so you have a full column to show the results of the analysis of both research questions)
* Put the results you found for each research question in the context you provided in your introduction.
These results support the previous literature that individuals with major
depression are more sensitive to the development of nicotine dependence
regardless of how much they smoke. Thus, people with depression are an
important population subgroup for targeted smoking intervention programs.
#### 5. __(1 p)__ Further directions or Future work or Next steps or something else that indicates there more to do and you've thought about it.
* What do these results lead you to want to investigate?
We can extend focus from young adults to all adults, and (with multiple
regression techniques we'll learn in ADA2) assess the particular depressed
subpopulations most susceptible to developing nicotine dependence.
#### 6. __(1 p)__ References
* By citing sources in your introduction, this section will automatically have your bibliography.
* [Bibliography will go here -- it's currently at the bottom of the document]
#### 7. __(2 p)__ Results for your first research question.
* "Is the population mean square-root total cigarettes smoked different for those with depression or not?"
* $H_0: \mu_{D} = \mu_{ND}$ versus $H_A: \mu_{D} \ne \mu_{ND}$
```{R, echo=FALSE}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
est_mean_TCS_D <-
nesarc_sub %>%
group_by(Depression) %>%
summarise(
TotalCigsSmoked = mean(TotalCigsSmoked , na.rm = TRUE)
, TotalCigsSmokedSqrt = mean(TotalCigsSmokedSqrt, na.rm = TRUE)
) %>%
ungroup()
# est_mean_TCS_D
# library(ggplot2)
# p <- ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmokedSqrt))
# p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
# p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# # diamond at mean for each group
# p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
# colour = "red", alpha = 0.8)
# p <- p + labs(title = "Square-root of Total Cigarettes Smoked by Depression Status")
# print(p)
# histogram using ggplot
p <- ggplot(nesarc_sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 2)
p <- p + geom_rug(alpha = 0.1)
p <- p + geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmokedSqrt^2), colour = "red")
p <- p + facet_grid(Depression ~ .)
p <- p + scale_x_sqrt(breaks = c(0, 5, 25, 50, 100, 250, 500, 1000, 2000, 3000))
print(p)
```
```{R, echo=FALSE, results="asis"}
t_summary_TCS_D <- t.test(TotalCigsSmokedSqrt ~ Depression, nesarc_sub)
t_summary_TCS_D
#names(t_summary_TCS_D)
# checking model assumptions, looks perfect
bs_two_samp_diff_dist(
nesarc_sub %>% filter(Depression == "No Depression") %>% drop_na(TotalCigsSmokedSqrt) %>% pull(TotalCigsSmokedSqrt)
, nesarc_sub %>% filter(Depression == "Yes Depression") %>% drop_na(TotalCigsSmokedSqrt) %>% pull(TotalCigsSmokedSqrt)
)
# original scale
bs_two_samp_diff_dist(
nesarc_sub %>% filter(Depression == "No Depression") %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked)
, nesarc_sub %>% filter(Depression == "Yes Depression") %>% drop_na(TotalCigsSmoked) %>% pull(TotalCigsSmoked)
)
t_summary_TCS_DXX <- t.test(TotalCigsSmoked ~ Depression, nesarc_sub)
t_summary_TCS_DXX
```
* Model assumptions are met, the the sampling distribution of the difference in means is normal.
* Because $p=`r signif(t_summary_TCS_D$p.value, 3)` > 0.05$ (with $t_{s} = `r signif(t_summary_TCS_D$statistic, 4)`$),
we have insufficient evidence to reject $H_0$ at an $\alpha=0.05$ significance level,
concluding that the total cigarettes smoked does not differ by depression
status.
* The difference is very small, on the square-root scale it is `r signif(diff(t_summary_TCS_D$estimate), 3)`
and on the natural scale it is `r t_summary_TCS_D$estimate^2`, a difference of
`r round(diff(t_summary_TCS_D$estimate^2))` cigarettes each year.
#### 8. __(2 p)__ Results for your second research question.
* Is there an association between TobaccoDependence and Ethnicity?
```{R, echo=FALSE}
# Tabulate by two categorical variables:
tab_E_TD <- xtabs(~ Ethnicity + TobaccoDependence, data = nesarc_sub)
#tab_E_TD
# column proportions
#prop.table(tab_E_TD, margin = 2)
```
```{R, echo=FALSE}
chisq_E_TD <- chisq.test(tab_E_TD, correct=FALSE)
#chisq_E_TD
```
```{R, echo=FALSE}
# table of expected frequencies:
#chisq_E_TD$expected
# smallest expected frequency:
#min(chisq_E_TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
Because the p-value $= `r signif(chisq_E_TD$p.value, 3)` < 0.05$ ($X^2 = `r signif(chisq_E_TD$statistic, 3)`$)
we reject the null hypothesis
concluding that there is an association between
TobaccoDependence and Ethnicity.
```{R, echo=FALSE}
# The Pearson residuals
#chisq_E_TD$residuals
# The sum of the squared residuals is the chi-squared statistic:
#chisq_E_TD$residuals^2
#sum(chisq_E_TD$residuals^2)
```
```{R, echo=FALSE, fig.width = 6, fig.height = 6}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
#mosaic(tab_E_TD, shade=TRUE, legend=TRUE)
mosaic(tab_E_TD, shade=TRUE, legend=TRUE, rot_labels = c(0, 90, 0, 0))
#, rot_varnames = c(0, 90, 0, 0)
#, labeling = labeling_border(just_labels = "right"))
#mosaic(~ Ethnicity + TobaccoDependence, data = nesarc_sub, shade=TRUE, legend=TRUE, direction = "v")
```
The primary cause of rejecting $H_0$ is that
Hispanics have less nicotine dependence than expected by chance.
## Week14: Posters, finishing up
### Tuesday ---------
### Thursday ---------
# Additional items
## Statistics summary tables
We won't get to this in our class,
but anyone in biostatistics or epidemiology will appreciate how easy it is to
produce such an informative "Demographics" table.
```{R results='asis', echo = TRUE}
library(moonBook)
out <-
moonBook::mytable(
Depression
+ TobaccoDependence
~
Age
+ SmokingFreq
+ DailyCigsSmoked
+ Ethnicity
+ Sex
+ DaysSmoke
+ TotalCigsSmoked
+ CigsSmokedFac
, data = nesarc_sub
)
myhtml(out)
```
# References (from Week02)