---
title: "S4R: Cumulative project file, Example assignment document"
subtitle: "UNM Stat 145 Special, Statistics for Research, Spring 2019, Prof Erik Erhardt"
author: "Erik Erhardt"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
toc_depth: 5
code_folding: show
bibliography: NESARC_NicotineDependence_LitReview.bib
---
*Note: Each class save this file with a new name,
updating the last two digits to the class number.
Then, you'll have a record of your progress,
as well as which files you turned in for grading.*
Starting in Class 07, we will concatenate all our WSs together to retain the
relevant information needed for subsequent classes.
You will also have an opportunity to revisit previous parts to make changes or improvements,
such as updating your codebook, modifying your research questions, improving tables and plots.
I've provided an initial predicted organization of our
sections and subsections using the \# and \#\# symbols.
A table of contents is automatically generated using the "toc: true" in the yaml
and can headings in the table of contents are clickable to jump down
to each (sub)section.
You will need to update any section symbols to be at least 3 deep (\#\#\#)
if you've copied them from an older assignment into this document.
Finally, you can delete all this header text once you don't need to refer to it.
--------------------------------------------------------------------------------
# Document overview
This document is organized by Class number.
Consider your readers (grader):
* organize the document clearly (use this document as an example)
* label minor sections under each day (use this document as an example)
* For each thing you do, always have these three parts:
1. Say what you're going to do and why.
2. Do it with code, and document your code.
3. Interpret the results.
## Global code options
```{R}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE) #$
```
--------------------------------------------------------------------------------
# Research Questions
## Class 03 Datasets, Codebooks, Personal Codebook
### Question of interest
__Dataset__:
National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),
with codebook NESARC_W1_CodeBook.pdf.
__Initial thinking__:
(My helpful narrative description to help you get going.)
_While nicotine dependence is a good starting point, I need to determine what
it is about nicotine dependence that I am interested in. It strikes me that
friends and acquaintances that I have known through the years that became
hooked on cigarettes did so across very different periods of time. Some seemed
to be dependent soon after their first few experiences with smoking and others
after many years of generally irregular smoking behavior. I decide that I am
most interested in exploring the association between level of smoking and
nicotine dependence. I add to my codebook variables reflecting smoking levels
(e.g., smoking quantity and frequency)._
__Topic of interest__:
(You need this part.)
I have decided to investigate the relationship between nicotine dependence and
the frequency and quantity of smoking on people up to 25 years old. The
association may differ by ethnicity, age, gender, and other factors.
__How I did it__:
(My helpful narrative description to help you get going.)
I look through the codebook and find some variables of interest. I searched
the text with "Ctrl-F" (find) to find these variables. For each variable, I
copy/paste the description here, then formatted it so it's organized. You can
choose to use a table or an outline format. I found this verbatim text format
to be very easy to format. I retained the "frequency" of each response because
it's interesting to know, and because it was already in the codebook --- this
value is not required for your codebook.
### Codebook
```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
VarName (RenamedVarName)
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
IDNUM (ID)
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
SEX (Sex)
SEX
Nominal
18518 1. Male
24575 2. Female
AGE (Age)
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change NA to 3. Never
* change 9 to NA
TAB12MDX (TobaccoDependence)
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change NA to 7. Never
* change 9 to NA
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change NA to 0 Cigarett(s)
* change 99 to NA
ETHRACE2A (Ethnicity)
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
MAJORDEPLIFE (Depression)
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
S1Q10A (IncomePersonal)
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars
S1Q24FT (Height_ft)
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
S1Q24IN (Height_in)
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
S1Q24LB (Weight_lb)
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
```
Additional variables were created from the original variables:
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```
--------------------------------------------------------------------------------
## Class 04 Citations
I will use [@beck2014; @rich2013; @murp2012; @dean2014] for this work which was
obtained using `[@beck2014; @rich2013; @murp2012; @dean2014]`. Plus, @beck2014
says some interesting stuff and that citation was obtained using `@beck2014`.
For more documentation on bibliographies and citations with R Markdown, see
[http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html).
For general help with R Markdown, see
[https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).
--------------------------------------------------------------------------------
## Class 05 Research Questions
See Class 06 below.
--------------------------------------------------------------------------------
## Class 06 Literature Review
__Dataset__:
National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),
with codebook NESARC_W1_CodeBook.pdf.
Research question:
1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
* Google scholar search: "nicotine dependence smoking frequency"
* Citation: pdf file available for @dierker_association_2007
* Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
* Others: @kandel_extent_2000 and @caraballo_linking_2009
*You don't need to include images in your literature review.
I'm providing these tables to illustrate what these tables look like:*
2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
* Google scholar search: "nicotine dependence depression"
* Citation: pdf file available for @breslau1995psychiatric
* Interesting points: Table 2, Smoking and Nicotine Dependence both associated with **Education**.
Table 3, Major depression associated with being nicotine dependent and **Sex**.
* Others: @breslau_major_1998
3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
* In Question 2 we see differences by Education and Sex.
I have decided to further focus my question by examining whether the
association between nicotine dependence and depression differs based on how
much a person smokes. I am wondering if at low levels of smoking compared to
high levels, nicotine dependence is more common among individuals with major
depression than those without major depression. I add relevant depression
questions/items/variables to my personal codebook as well as several
demographic measures (age, gender, ethnicity, education, etc.) and any other variables I
may wish to consider.
All required variables have been found and added to my personal codebook (by expanding Class 03).
--------------------------------------------------------------------------------
# Data Management
## Class 07 Working With Data, Data Management
### Purpose of study: smoking and depression in young adults
(Note: This is the last paragraph of the research proposal used to remind the
reader of the research objectives.)
The present study will examine young adults from the National Epidemiologic
Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis
will include 1) establishing the relationship between major depression and
nicotine dependence; and 2) determining whether or not the relationship between
major depression and nicotine dependence exists above and beyond smoking
quantity.
#### Variables
Variables from NESARC that will be used include (already in codebook, above):
* `IDNUM` (unique ID number with no alphabetics),
* `MAJORDEPLIFE` (has subject experienced a major depression?),
* `CHECK321` (has subject smoked cigarettes in past 12 months?),
* `AGE` (age of subject),
* `TAB12MDX` (tobacco dependence past 12 months),
* `S3AQ3B1` (usual frequency when cigarettes smoked),
* `ETHRACE2A` (ethnicity of subject),
* `SEX` (gender of subject), and
* `S3AQ3C1` (usual smoking quantity of cigarettes when smoked).
### Data subset columns
First, the data is placed on the search path using the `PDS` package.
```{R}
# data analysis packages
library(tidyverse) # Data manipulation and visualization suite
library(forcats) # Factor variables
library(lubridate) # Dates
# data package
library(PDS)
## Problems installing PDS package? Solution.
## If you had problems installing the PDS package, no problem; here's how to get the data:
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Where I have "library(PDS)" in my code, change it to the two lines below.
## Update the "filename" to the name of the file.
##
## # library(PDS)
## load("filename.RData")
dim(NESARC)
```
The variables of interest are specified and a subset of the dataset including
only these variables is created and stored in the data frame `nesarc.sub`.
```{R}
# variables to include in our data subset
nesarc.sub <-
NESARC %>%
select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)
```
Check size and structure of data subset.
Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
```{R}
# size of subset
dim(nesarc.sub)
# structure of subset
str(nesarc.sub)
```
#### EXTRA: AddHealth, joining waves 1 and 2 together
Joining AddHealth waves 1 and 2 together into a single dataset
can be done if you want to use variables from when
the participants were both adolecents and adults.
```{R}
# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
dim(addhealth_public4)
# First column is the unique ID
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Rename W4's unique ID variable "AID" and "aid" to "ID"
# so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
AddHealth %>%
rename(ID = AID)
addhealth_public4 <-
addhealth_public4 %>%
rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Join two datasets by unique ID
AH14 <-
full_join(AddHealth, addhealth_public4)
# now we have twice as many columns
dim(AH14)
# Removing this large object since I don't need this data for the NESARC project
# DELETE THIS LINE if you're using AddHealth
rm(AH14)
```
### Renaming Variables
Rename variables `NewName = OriginalName`.
```{R}
nesarc.sub <-
nesarc.sub %>%
rename(
ID = IDNUM
, Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, TobaccoDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
, IncomePersonal = S1Q10A
, Height_ft = S1Q24FT
, Height_in = S1Q24IN
, Weight_lb = S1Q24LB
)
str(nesarc.sub)
```
The first 6 and last 6 values are given by `head()` and `tail()`.
```{R}
head(nesarc.sub)
tail(nesarc.sub)
```
A quick summary of the variables indicates many missing values.
```{R}
# summary of variables -- notice many NAs
summary(nesarc.sub)
```
--------------------------------------------------------------------------------
## Class 08 Subsetting data and R Programming
### Coding missing values
There are two steps.
The first step is to recode any existing `NA`s to actual values, if necessary.
The method for doing this differs for numeric and categorical variables.
The second step is to recode any coded missing values, such as 9s or 99s, as actual `NA`.
#### Coding `NA`s as meaningful "missing"
First step: the existing blank values with `NA` mean "never",
and "never" has a meaning different from "missing".
For each variable we need to decide what "never" means
and code it appropriately.
##### `NA`s recoded as numeric
The easiest is `DailyCigsSmoked` where `NA` means 0 cigarettes.
We replace `NA` with 0.
The function `is.na()` identifies each observation that "is an `NA`".
I've updated the codebook to indicate that the original `NA` values were changed.
```
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
```
```{R}
table(nesarc.sub$DailyCigsSmoked)
nesarc.sub <-
nesarc.sub %>%
mutate(
# replace NA with 0
DailyCigsSmoked = replace(DailyCigsSmoked, is.na(DailyCigsSmoked), 0)
)
table(nesarc.sub$DailyCigsSmoked)
```
##### `NA`s recoded as categorical
The next two variables require defining a new category level.
Both `SmokingStatus` and `SmokingFreq` with a value of `NA` means "never",
so we need to create a new "Never" category and put in order
with the rest of the levels.
I've updated the codebook to indicate that the original `NA` values were changed.
```
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
```
```{R}
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
nesarc.sub <-
nesarc.sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
, SmokingFreq = as.character(SmokingFreq )
# replace NA with new category values
, SmokingStatus = replace(SmokingStatus , is.na(SmokingStatus), "3")
, SmokingFreq = replace(SmokingFreq , is.na(SmokingFreq) , "7")
)
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
```
#### Coding 9s and 99s as `NA`s
Second step: replace the coded missing values with `NA`,
then remove the remaining records with missing values.
Not all of these variables need this operation, since the previous subset commands
removed missing values by chance.
However, I'm going to run each as an extra precaution.
Afterall, later, if we were to make different previous data decisions,
it's possible we could end up with missing values at this point that
are incorrectly coded without this step!
```{R}
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
table(nesarc.sub$DailyCigsSmoked)
nesarc.sub <-
nesarc.sub %>%
mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
, SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
, Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA)
, Height_in = replace(Height_in , Height_in %in% c( 99), NA)
, Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA)
# drop unused factor levels
, SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
)
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
table(nesarc.sub$DailyCigsSmoked)
summary(nesarc.sub)
```
Finally, consider using `na.omit()` to remove any records with missing values
if it won't cause issues with analysis.
The command `na.omit()` removes a row if any columns have an `NA` in that row.
This can have __drastic__ consequences.
For example, imagine that you have 7 variables and 6 variables have complete data,
but the 7th variable is about three-quarters `NA`s!
Then `na.omit()` will drop three-quarters of your data,
even though most of your variables are complete!
There's only one advantage I can think of for running `na.omit()`,
and that's if you have a large dataset and only a few observations have a few `NA`s.
You can't analyze data with `NA`s,
since any calculation involving `NA` results in an `NA`.
Therefore imagine two scenarios:
(1) An analysis with Variables 1 and 2, each with a few `NA` distributed throughout;
first the observations with `NA` for either Variable 1 or 2 will be removed, then
the analysis will proceed.
(2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`
for different observations than Variable 2 --- therefore, different observations
will be removed before analysis.
In summary, these two analyses will include different observations from the dataset.
Therefore, by running `na.omit()` you will start with a dataset with no `NA`s,
thus every analysis will be of exactly the same set of data
(at the cost of not using all the possible data for each individual analysis).
Below, I do __not run `na.omit()`__ since I'm concerned about having lots of `NA`s for some variables.
```{R}
dim(nesarc.sub)
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
summary(nesarc.sub)
```
Now we have a dataset with no missing values.
### Creating new variables
Additional variables that we created from the original variables are listed in
this ammendment to the original codebook.
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
```
#### From categories to numeric
`DaysSmoke` estimates the days per month a subject smokes
by converting `SmokingFreq` (a factor with 6 levels) to a numeric variable using `as.numeric()`
and multiplying by the midpoint of the range of `SmokingFreq`.
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30 # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
)
)
summary(nesarc.sub$DaysSmoke)
```
#### From numeric to numeric
`Height_inches` is the height of a person in inches.
Since the data was collected in two variables (`Height_ft` and `Height_in`)
we need to add them together for our new variable.
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
Height_inches = Height_ft * 12 + Height_in
)
summary(nesarc.sub$Height_inches)
```
The variable `TotalCigsSmoked` estimates the monthly number of cigarettes
a subject smokes per month by multiplying `DaysSmoke` times `DailyCigsSmoked`
(the usual quantity smoked per day).
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
)
```
**Log scale.**
I like log2 since its interpretation is that every unit is a doubling.
Notice that I added 1 to the `TotalCigsSmoked` before computing the log.
This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmokedLog2 = log(TotalCigsSmoked + 1, base = 2)
)
summary(nesarc.sub$TotalCigsSmokedLog2)
```
**Square root scale.**
Square root is another transformation that spreads out values between 0 and 1
and compresses values from 1 to infinity.
It can be preferred to a log transformation in some cases.
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmokedSqrt = sqrt(TotalCigsSmoked)
)
summary(nesarc.sub$TotalCigsSmokedSqrt)
```
#### From numeric to categories based on quantiles
The numeric variable `TotalCigsSmoked` is converted into a factor
with four roughly equivalent numbers (quartiles) stored in each level of the
factor `CigsSmokedFac` using the `cut` function.
The label `"Q1"` indicates that the original value was in the first quarter (25%) of the ordered data,
that is, they are among the smallest 25% observations;
while `"Q4"` indicates the original value was in the last quarter (25%) of the ordered data,
that is, they are among the largest 25% observations.
```{R}
quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
nesarc.sub <-
nesarc.sub %>%
mutate(
CigsSmokedFac =
.bincode(TotalCigsSmoked
, breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
, include.lowest = TRUE
)
, CigsSmokedFac = factor( CigsSmokedFac
, levels = c(1, 2, 3, 4)
, labels = c("Q1", "Q2", "Q3", "Q4")
)
)
```
In my data, more than half of the values are 0s which is the minimum value,
thus they all get grouped in "Q1"; this leaves no values for "Q2" -- oh well.
```{R}
summary(nesarc.sub$CigsSmokedFac)
```
#### From many categories to a few
The variable `SmokingFreq3` collapses the `SmokingFreq` from 6 down to 3 categories.
This is an example to show you how to do this with your own variables.
Some of you will have categorical variables that have a dozen or so categories;
having so many categories makes interpretation very difficulty --- fewer is easier.
```{R}
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
nesarc.sub <-
nesarc.sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
)
)
# new variable
table(nesarc.sub$SmokingFreq3)
# we will label the new factor variable below, with the others
```
#### Review results of new variables
Look at the first 10 observations in the dataset, with special attention to the
new variables.
```{R}
head(nesarc.sub, 10)
```
### Labeling Categorical variable levels
Look at summary of dataset to see the state of the labelling before we make changes.
```{R}
summary(nesarc.sub)
```
Informative labels are given to the factor levels.
The order of the levels is also rearranged for the variables
`SmokingFreq`, `TobaccoDependence`, and `Sex`.
```{R}
# Label factors with meaningful names in the same order as the codebook
nesarc.sub$Depression <-
factor(
nesarc.sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)
nesarc.sub$SmokingStatus <-
factor(
nesarc.sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)
# Specify the order of the levels to be different from codebook order
nesarc.sub$Sex <-
factor(nesarc.sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc.sub$Sex)
nesarc.sub$TobaccoDependence <-
factor(nesarc.sub$TobaccoDependence
, levels = c( 1
, 0
)
, labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
)
)
# Remember the new category level from the original `NA` values.
nesarc.sub$SmokingFreq <-
factor(nesarc.sub$SmokingFreq
, levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
)
, labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
)
nesarc.sub$Ethnicity <-
factor(nesarc.sub$Ethnicity
# shorter labels are helpful for plots
, labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
)
# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
nesarc.sub$SmokingFreq3 <-
factor(nesarc.sub$SmokingFreq3
, levels = c( 1
, 2
, 3
, 4
)
, labels = c( "Daily"
, "Weekly"
, "Monthly"
, "Never"
)
)
summary(nesarc.sub)
```
### Data subset rows to young smokers
In practice, I would probably subset the data much earlier.
In this example I waited until the end to subset so that I could illustrate
all of the other steps with complete values.
In our example,
we are interested in the _subset of people 25 or younger_
who _smoked in the last 12 months_ and this is
created using the `filter()` function.
Note that `SmokingStatus == 1` is used to choose people who
smoked in the past 12 months.
Notice, the missing values have been removed
(since `NA`s are not less than or equal to numbers).
```{R}
# the subset command below will not include a row that evalutes to NA for the
# specified subset. For example:
(NA == 1)
# age and smoking subset
nesarc.sub <-
nesarc.sub %>%
filter(
Age <= 25 # people 25 or younger
, SmokingStatus == "Smoked past 12 months" # smoked in the past 12 months
)
dim(nesarc.sub)
summary(nesarc.sub)
```
### EXTRA: Working with Dates
Dates can be tricky in R. The `lubridate` package makes them much easier.
Below I'm going to create a variable for the person's age in years
based on the person's date of birth and the interview date.
There is an `AGE` variable which I'm already using,
but if there wasn't, or I wanted to be more precise about their age (such as, 24.34 years old),
then this is a way to do it.
I'm going to create a separate data frame just for this example to show how to do it;
if you needed these variables, they should be part of your original subset above.
Below, the "C" variables (`CDAY`) are the interview day, month, and year,
and the "DOB" variables (`DOBD`) are the date of birth dat, month, and year.
```{R}
dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
```
Combine the date fields together, then interpret them into a date object.
I'll do this for both the interview date and the date of birth.
I print the head (first 6 observations) after each operation.
```{R}
## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
head(dob.date)
```
Now that we have the interview date and date of birth as date objects,
we calculate the difference to give the person's age at the time of the interview.
Because the difference of dates is in the unit "days", we convert to years.
```{R}
# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
age.years <- as.numeric(age.days / 365.25) # change from "difftime" to "numeric"
head(age.years)
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
```
Add the new age variable to your data frame,
drop the ones you don't need anymore,
and you're ready to go!
```{R}
# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
# drop the original ones you don't need anymore
# by using subset() and select = -VAR, it excludes the VAR column
dat.sub <-
dat.sub %>%
select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
```
Now you've got an age (or other date object) you can use in your project.
--------------------------------------------------------------------------------
# Graphing
## Class 11 Graphing Univariate
### Categorical variables
#### Tables for categorical variables
Basic tables can be created using the functions `table` or `xtabs`.
Frequency tables are created for the variables
`TobaccoDependence`, `CigsSmokedFac`, and `SmokingFreq`.
When only looking at one variable, `table` is the same as `xtabs`, but
`xtabs` will make looking at two-way, three-way, and higher tables possible.
```{R}
# table() produces a one-varible frequency table
table(nesarc.sub$TobaccoDependence)
# proportions available by passing these frequencies to prop.table()
table(nesarc.sub$TobaccoDependence) %>% prop.table()
# xtabs() can be used for one or more variables
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
# proportions available with prop.table()
prop.table(T1)
T2 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T2
```
(Look at the .Rmd code to see that these numbers in the next paragraph are
calculated directly from the numbers summarized in the table!
This way, if the data changes, the text will be correctly updated.)
In the data frame `nesarc.sub`, there are
`r T1[1]` nicotine dependent subjects and
`r T1[2]` subjects that are not nicotine dependent.
Most of the subjects in `nesarc.sub` are daily smokers
(`r T2[6]`) with the remainder distributed uniformly across the
first five levels of `SmokingFreq`.
#### Graphing frequency tables
The barplots are all created with the package `ggplot2`.
The barplots start with the defaults for the `geom_bar` and add more detail to the plot with each graph.
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1
p2 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Smoking Adults")
p2 <- p2 + theme_bw()
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
p2
```
#### Removing NAs from factor axes
When a factor variable has `NA`s,
then `NA` appears as a level with frequences (see bar plots above).
As of writing this, `ggplot()` does not have a way to remove the `NA`s before plotting.
Therefore, we have to do it manually.
The easiest and most transparent way is to use the `drop_na()` function below
where we only plot the non-`NA` values.
Compare the two plots below of the same data,
but the second plot had the `NA`s removed before plotting.
```{R, fig.height = 4, fig.width = 8}
library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p1
# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
# and "!is.na()" means "keep the non-NA values"
#p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p2
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")
```
#### Final versions
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = ""
, y = "Total Number"
, title = "Nicotine Dependence for Young Smoking Adults"
)
print(p1)
```
**Interpretation:**
Among smokers 25 years and younger,
there are slightly more (`r (prop.table(T1)[1] - prop.table(T1)[2]) %>% round(3) * 100`% more)
people who are
nicotine dependent (`r prop.table(T1)[1] %>% round(3) * 100`%)
compared to those who are not nicotine dependent (`r prop.table(T1)[2] %>% round(3) * 100`%).
```{R}
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = ""
, y = "Total Number"
, title = "Smoking Frequency for Young Smoking Adults"
)
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
print(p2)
```
**Interpretation:**
Among smokers 25 years and younger,
most smoke every day (`r prop.table(T2)[7] %>% round(3) * 100`%)
while the less frequent categories are roughly evenly distributed
around 4% to 5%.
### Numeric variables
#### Graphing numeric variables
One popular way to graph a continuous variable is to use a histogram.
`R` has the base function `hist()` which can be used for this purpose.
We will also use the package `ggplot2` to create histograms.
We start with a basic histogram of the variable `TotalCigsSmoked`.
```{R}
hist(nesarc.sub$TotalCigsSmoked)
```
Experiment with the code in the next code chunk to set the `binwidth = ` (extremely important!),
decide if you need `boundary = 0`,
add a title, and if needed
the labels on the $x$- and $y$-axes.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
# Use "boundary = 0" for values that can't be negative
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month", title = "boundary at 0")
p <- p + theme_bw()
p
```
#### Creating Density Plots
A smoothed density curve can be added with the `geom_density()` function,
where `adjust = 2.0` is a fairly smooth curve.
Experiment with `adjust` values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness
and decide what level best captures the general shape of your distribution.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
```
#### Shape, center, and spread
```{R}
summary(nesarc.sub$TotalCigsSmoked)
fivenum(nesarc.sub$TotalCigsSmoked)
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
```
The "shape" of the distribution for the estimated cigarettes smoked per month
is skewed to the right.
The "center" (median) of the distribution is
`r median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`
and the "spread" (interquartile range, middle 50%) for the distribution is
`r IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`.
#### Final versions
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
```
**Interpretation:**
Among smokers 25 years and younger,
the number of cigarettes smoked is right skewed
with a median (center) of `r median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`
and the IQR (spread) (interquartile range, middle 50%) for the distribution is
`r IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`.
There are three distinct modes (peaks) in the distribution
because of how we constructed this variable from two others,
and because of the common numbers of cigarettes being relative to a pack size (20).
There are several outlying values greater than 1200 (representing 40 cigarettes per day)
that are possibly overestimates from the people responding.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
```
**Interpretation:**
Among smokers 25 years and younger,
most smoke less than 10 cigarettes per day with another peak at 20 (1 pack per day),
another small peak at 40 (2 packs per day),
and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).
--------------------------------------------------------------------------------
## Class 13 Graphing Bivariate
### Scatter plot (for regression): x = numerical, y = numerical
Research question:
With my questions, I don't have a meaningful question to ask that will relate
two numeric variable.
An exploratory question whether there is a relationship between a person's age (`Age`)
and total number of cigaretts smoked (`TotalCigsSmoked`).
Because age is an integer varible, I jitter the points,
and I reduce the opacity (increase the transparancy) with `alpha = 1/4`
to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).
__Interpretation:__
I added a linear regression smoother to see whether there was any linear trend with age.
The horizontal line suggests that there isn't a (linear) relationship here.
The first plot has all the points in their original locations,
but they end up stacking on top of each other so you can't tell how many points are there.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
This is the same data as above, but I have added some horizontal jitter (displacement)
to each point which doesn't affect their value on the y-axis
but allows us to see how many points are at each discrete age year.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.2), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
Here's an example where there's a strong positive relationship between the two
numeric variables.
Most variables in our dataset won't exhibit such a nice relationship.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
print(p)
```
If you have two groups you want to compare, you can create facets (small multiples)
and show the relationship by each group.
Below, I compare by sex.
The slopes appear similar;
later in the semester we'll learn how to formally compare these lines.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Height_inches, y = Weight_lb))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Weight vs Height")
p <- p + facet_wrap(~ Sex)
print(p)
```
### Box plots (for ANOVA): x = categorical, y = numerical
Research question:
1. Is there a relationship between depression (`Depression`) and total number of cigarettes smoked (`TotalCigsSmoked`)?
The $x$ variable should be a factor variable, and $y$ should be numeric.
__Interpretation:__
The boxplots indicate that the five number summary is about the same for both depression groups.
Also, the means are about the same, though slightly higher for the people with depression.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(title = "Depression does not predict Total Cigarettes Smoked")
print(p)
```
### Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
Research question:
Is there a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)?
This strategy uses the base graphics by plotting the tables directly as bar plots.
The next set of plots using `ggplot()` may be preferred.
```{R}
T1 <- xtabs(~ TobaccoDependence + Depression, data = nesarc.sub)
T1
barplot(T1)
T2 <- prop.table(T1, 2)
T2
barplot(T2)
```
The last graph needs a legend as well as a title. While it is possible to
construct a legend and title for the last graph, it is much better to use an
approach that will generate the legend automatically.
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc.sub, aes(x = Depression, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status")
# the legend title can be modified, if desired (try this line)
# p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
print(p)
```
Does the a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)
vary by sex?
__Interpretation:__
It appears that nicotine dependence is slightly higher for women with depression than for men with depression.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc.sub, aes(x = Depression, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status")
# added a row facets by Sex
p <- p + facet_grid(Sex ~ .)
# tilted the x-axis labels
p <- p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
p <- p + labs(title = "Fraction of young adult smokers with and without\n nicotine dependence by smoking frequency")
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
print(p)
```
Research question:
Is there a relationship between smoking frequency (`SmokingFreq`) and nicotine dependence (`TobaccoDependence`)?
Also compare by sex.
```{R}
library(ggplot2)
p <- ggplot(data = subset(nesarc.sub, !is.na(SmokingFreq)), aes(x = SmokingFreq, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
# tilt axis labels
p <- p + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
p <- p + labs(x = "Smoking Frequency", y = "Percent")
print(p)
# also compare by sex
p <- p + facet_grid(Sex ~ .)
print(p)
```
#### Using Mosaic Plots
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
```{R}
library(vcd)
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub)
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub, shade = TRUE)
mosaic( ~ TobaccoDependence + Depression + Sex, data = nesarc.sub, shade = TRUE)
```
It's easy to look at such a fine resolution that it's difficult to understand what's happening
```{R, fig.width = 8, fig.height = 8}
mosaic(~TobaccoDependence + Depression + Sex + CigsSmokedFac, data = nesarc.sub, shade = TRUE)
```
##### Customizing Mosaic Plots
There are several solutions to dealing with the messy labels that overlap.
The help line below will show the many options you have control over.
Read through the comments of each code chunk, see the new options, and see the result.
```{R, fig.height = 4, fig.width = 8}
library(vcd)
# help(labeling_border)
# Initial plot
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub)
# Abbreviate labels, specify the number of characters (try a few)
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, abbreviate_labs = 4)
# Rotate and align labels
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = "right" # right-justifies all the labels (top and side)
)
# Rotate and align labels only y-axis label
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
)
# Rotate and align labels only y-axis label, and offset the variable name
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0,0,0,9) # offset the y-axis var name
)
```
### Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
Research question:
1. Frequency and quantity of smoking (`TotalCigsSmoked`) are markedly imperfect indices for
determining an individual's probability of exhibiting nicotine dependence (`TobaccoDependence`) (this
is true for other drugs as well).
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked, y = TobaccoDependence))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
print(p)
```
To overlay a curve, we need to convert our response variable $y$ to be
binary 0 or 1, where we let 1 indicate a "success" in the direction of interest.
__Interpretation:__
These two groups are quite balanced on the probability of tobacco dependence
given the total number of cigarettes smoked,
though they increase together slightly.
```{R}
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc.sub$TobaccoDependence01 <- ifelse(nesarc.sub$TobaccoDependence == "Nicotine Dependence", 1, 0)
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked, y = TobaccoDependence01))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)
print(p)
```
#### Example of collapsing a variable to binary.
It is convenient to collapse a continuous variable into a categorical variable
(as I did in Week03 with the `cut()` command)
or further collapse a categorical variable into a binary variable.
One strategy to create a binary variable is to use the `ifelse()` function.
Let's create a binary `Age` variable; if 21 years and older, than legal to purchase alcohol,
otherwise it is illegal.
See how the two columns below correspond as we wanted them to.
I do not recommend overwriting the original column,
but create a new variable in the data.frame.
```{R}
nesarc.sub$Alcohol_binary <- ifelse(nesarc.sub$Age >= 21, "Legal", "Illegal")
str(nesarc.sub$Alcohol_binary) #$
head(nesarc.sub[,c("Age", "Alcohol_binary")], 15)
```
# Statistical methods
## Mechanics of a hypothesis test (review)
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``The population mean for [what is being studied] is different from [value of $\mu_0$].''
(Note that the statement in words is in terms of the alternative hypothesis.)
* In notation: $H_0: \mu=\mu_0$ versus $H_A: \mu \ne \mu_0$
(where $\mu_0$ is specified by the context of the problem).
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}$, where $SE_{\bar{Y}}=s/\sqrt{n}$ is the standard error.
4. Determine the __tail(s)__ of the sampling distribution where the __$p$-value__ from the test statistic will be calculated
(for example, both tails, right tail, or left tail).
(Historically, we would compare the observed test statistic, $t_{s}$,
with the __critical value__ $t_{\textrm{crit}}=t_{\alpha/2}$
in the direction of the alternative hypothesis from the
$t$-distribution table with degrees of freedom $df = n-1$.)
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
6. __Check assumptions__ of the test (for now we skip this).
## Class 18 Hypothesis Testing
### Two-sample $t$-test: `TotalCigsSmoked` by `Depression`
1. Using a numerical response variable and a two-level categorical variable
(or a categorical variable you can reduce to two levels),
specify a **two-sample $t$-test** associated with your research questions.
* (1 p) Indicate the two variables,
their definitions (the questions from the codebook that defines their data), and
a `summary()` of each variable.
* (2 p) Specify the hypotheses in words and notation (either one- or two-sided test).
* (0 p) Use `t.test()` to calculate the mean, test statistic, and p-value.
* (3 p) State the significance level, test statistic, and p-value.
* (2 p) State the conclusion in the context of the problem.
* (2 p) Provide an appropriate plot of the data and sample estimates in a well-labelled plot.
Variables: $y=$`TotalCigsSmoked` by $x=$`Depression`.
```
MAJORDEPLIFE (Depression)
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
```
Calculate the means to use in our plots.
```{R}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
est.mean.TCS.D <-
nesarc.sub %>%
group_by(Depression) %>%
summarize(
TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)
)
est.mean.TCS.D
```
Use one of these two plots for your analysis.
For my data, I prefer the histogram.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(title = "Comparison of Mean Total cigarettes smoked by Depression status")
print(p)
# histogram using ggplot
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 40)
p <- p + geom_rug()
p <- p + geom_vline(data = est.mean.TCS.D, aes(xintercept = TotalCigsSmoked), colour = "red")
p <- p + facet_grid(Depression ~ .)
p <- p + labs(title = "Comparison of Mean Total cigarettes smoked by Depression status")
print(p)
```
Calculate the statistics for use in the hypothesis test.
```{R}
t.summary.TCS.D <- t.test(TotalCigsSmoked ~ Depression, nesarc.sub)
t.summary.TCS.D
```
In the statements below, I fill in the values automatically with some R code;
you can type the numbers manually from the `t.test()` output to make your life easier.
1. "Is the population mean total cigarettes smoked different for those with depression or not?"
* $H_0: \mu_{D} = \mu_{ND}$ versus $H_A: \mu_{D} \ne \mu_{ND}$
2. Let $\alpha=0.05$, the significance level of the test and the Type-I error probability if the null hypothesis is true.
3. $t_{s} = `r signif(t.summary.TCS.D$statistic, 4)`$.
4. $p=`r signif(t.summary.TCS.D$p.value, 3)`$, this is the observed significance of the test.
5. Because $p=`r signif(t.summary.TCS.D$p.value, 3)` > 0.05$,
we have insufficient evidence to reject $H_0$, concluding that the
total cigarettes smoked does not differ by depression status.
--------------------------------------------------------------------------------
## Class 19 ANOVA
### ANOVA Hypothesis test: Total cigarettes smoked by Ethnicity
A table of the ethnic groups to compare is below.
```{R}
summary(nesarc.sub$Ethnicity)
```
Let $\mu_j$ = pop mean number of the total cigarettes smoked for the five ethnicities identified in the dataset:
Caucasian,
African American,
Native American,
Asian, and
Hispanic,
numbered $(j = 1, 2, 3, 4, 5)$, respectively.
We wish to test $H_0: \mu_1=\cdots=\mu_5$ against $H_A: \textrm{not } H_0$ (at least one pair of means differ).
Plot the data in a way that compares the means.
Error bars are 95% confidence intervals of the mean.
```{R}
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Ethnicity, y = TotalCigsSmoked))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(nesarc.sub$TotalCigsSmoked, na.rm = TRUE),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_violin(width = 0.5, alpha = 0.25)
p <- p + geom_boxplot(width = 0.25, alpha = 0.25)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 4,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Total cigarettes smoked by Ethnicity")
p <- p + ylab("Total Cigs Smoked")
p <- p + labs(caption = "Dashed line is grand mean, red diamonds are group means, red intervals are 95% confidence intervals.")
print(p)
```
__Hypothesis test__
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The population mean total cigarettes smoked is different between ethnicities."
* In notation: $H_0: \mu_1=\cdots=\mu_5$ versus $H_A: \textrm{not } H_0$ (at least one pair of means differ).
2. Let the significance level of the test be $\alpha=0.05$.
3. Compute the __test statistic__.
```{R}
aov.summary <- aov(TotalCigsSmoked ~ Ethnicity, data = nesarc.sub)
summary(aov.summary)
```
The $F$-statistic for the ANOVA is $F = `r signif(unlist(summary(aov.summary))["F value1"], 3)`$.
4. Compute the __$p$-value__ from the test statistic.
The p-value for testing the null hypothesis is
$p = `r signif(unlist(summary(aov.summary))["Pr(>F)1"], 3)`$.
5. State the __conclusion__ in terms of the problem.
Because $p = `r signif(unlist(summary(aov.summary))["Pr(>F)1"], 3)` < 0.05$,
we reject $H_0$ in favor of $H_1$ concluding that
the mean total cigarettes smoked is different between ethnicities.
### Post Hoc pairwise comparison tests
6. If the ANOVA null hypothesis was rejected, then perform follow-up Post Hoc
pairwise comparison tests to determine which pairs of means are different.
There are several multiple comparison methods described in the notes.
Let's use Tukey's Honest Significant Difference (HSD) here to test which pairs of
populations differ.
```{R}
TotalCigsSmoked_means <-
nesarc.sub %>%
# group each Ethnicity to calculate a mean for each
group_by(Ethnicity) %>%
summarize( # calculate means by Ethnicity
m = mean(TotalCigsSmoked, na.rm = TRUE)
) %>%
arrange( # sort the Ethnicity by means
m
)
TotalCigsSmoked_means
# Tukey 95% Individual p-values
TukeyHSD(aov.summary)
```
Summarize results by ordering the means and grouping pairs that do not differ (e.g., `p adj` $\ge 0.05$).
I find it easiest to look for the large p-values (which pairs are not significantly different, `p adj` $\ge 0.05$)
and group those first, then look at the significant pairs (`p adj` $< 0.05$) and make sure none are connected.
Here's two ways to do this
(vertical is useful when there are many groups,
horizontal is easier to read).
```
NotSigDiff Mean Ethnicity
| 220 Hispanic
| 244 African American
| 260 Asian
|| 311 Native American
| 369 Caucasian
Ethnicity: Hispanic African American Asian Native American Caucasian
Mean: 220 244 260 311 369
--------------------------------------------------------
----------------------------
```
**Summary of differences:**
On average, Hispanics and African Americans smoke the least, while Native Americans and Caucasians smoke the most.
Furthermore, Hispanics are significantly different from the Native American and Caucasian groups,
and African Americans and Asians are significantly different from the Caucasian group.
--------------------------------------------------------------------------------
## Class 21 Contingency tables
### Hypothesis test for Homogeneity of Proportions (categorical association)
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
(Note that the statement in words is in terms of the alternative hypothesis.)
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
This is another way of saying that the probability of row category $i$ conditional on column category $j$,
$p(i|j) = p(i)$, does not depend on which column category $j$ we consider.
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $\chi^2$.
4. Compute the __$p$-value__ from the test statistic
with degrees of freedom $df = (R-1)(C-1)$,
where $R$ and $C$ are the number of rows and columns.
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
### Two-way categorical tables: TobaccoDependence and Ethnicity
Using two categorical variables with two to five levels each,
specify a hypothesis test for homogeneity of proportions associated with your research questions.
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "There is an association between TobaccoDependence and Ethnicity."
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
Observed frequency and (row) proportion tables:
```{R}
# Tabulate by two categorical variables:
tab.E.TD <- xtabs(~ Ethnicity + TobaccoDependence, data = nesarc.sub)
tab.E.TD
# column proportions
prop.table(tab.E.TD, margin = 1)
```
2. GIVEN: Choose the __significance level__ of the test: let $\alpha=0.05$.
3. Compute the __test statistic__, such as $\chi^2$.
```{R}
chisq.E.TD <- chisq.test(tab.E.TD, correct = FALSE)
chisq.E.TD
```
The test statistic is $X^2 = `r signif(chisq.E.TD$statistic, 3)`$.
4. Compute the __$p$-value__ from the test statistic
with degrees of freedom $df = (R-1)(C-1)$,
where $R$ and $C$ are the number of rows and columns.
p-value $= `r signif(chisq.E.TD$p.value, 3)`$.
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
Because the p-value $= `r signif(chisq.E.TD$p.value, 3)` < 0.05$
we reject the null hypothesis
concluding that there is an association between
TobaccoDependence and Ethnicity.
6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
```{R}
# table of expected frequencies:
chisq.E.TD$expected
# smallest expected frequency (needs to be at least 5 for model assumptions):
min(chisq.E.TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
7. Pearson residuals.
If you rejected the null hypothesis above, the Pearson residuals are a way to
indicate which cells of the table were different from expected.
Residuals more extreme than roughly $\pm 2$ are considered ``large''.
```{R}
# The Pearson residuals
chisq.E.TD$residuals
# Note: The sum of the squared residuals is the chi-squared statistic:
chisq.E.TD$residuals^2
sum(chisq.E.TD$residuals^2)
```
The Hispanic row has residuals more extreme than 2.
8. Plot a mosaic plot of the data and Pearson residuals.
```{R, fig.width = 6, fig.height = 6}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
mosaic(tab.E.TD, shade=TRUE, legend=TRUE
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0,0,0,9) # offset the y-axis var name
)
#mosaic(~ Ethnicity + TobaccoDependence, data = nesarc.sub, shade=TRUE, legend=TRUE, direction = "v")
```
9. Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting $H_0$ is that
Hispanics have less nicotine dependence than expected
compared to other Ethnicities.
--------------------------------------------------------------------------------
## Class 23 Correlation and Interactions
From a previous scatterplot, calculate the correlation and interpret it.
The plot from the previous section and correlation are calculated below.
(Unfortunately, my example is not very interesting.)
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
```{R}
# Calculate correlation
# Because of possible missing values, the "use" command excludes NAs before calculation
cor.TL.A <- cor(nesarc.sub$TotalCigsSmoked, nesarc.sub$Age, use="pairwise.complete.obs")
cor.TL.A
```
### Interpretation of correlation
The correlation is `r signif(cor.TL.A, 3)`, which is negative (but effectively 0).
This suggests that there effectively no linear relationship between the $\log_{10}$(Total Cigarettes Smoked) and Age.
--------------------------------------------------------------------------------
## Class 26 Linear Regression
Select two numerical variables.
I don't have great variables in my dataset related to my research question for this.
I select `TotalCigsSmokedLog10` and `DaysSmoke`.
### Plot the data
If required, transform the variables so a roughly linear relationship is observed;
all interpretations will be done on this scale of the variables.
**Goal:**
We're looking for the regression line to pass through the centers of the groups of points
and for the groups of points to have roughly the same spread.
The original scale is a terrible fit.
The regression line does not pass through the centers of the groups of points
and the spread varies greatly between groups of points.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = DaysSmoke, y = TotalCigsSmoked))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence bands"
, x = "Days smoked per month"
, y = "Number of cigarettes smoked per month"
)
print(p)
```
The square-root scale is better than the oringal scale but not as good as the log2 scale below.
The regression line is a little further from the middle of each group but the spread for each group still increases substantially.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = DaysSmoke, y = TotalCigsSmokedSqrt))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence bands, square-root scale"
, x = "Days smoked per month"
, y = "Square-root Number of cigarettes smoked per month"
)
print(p)
```
The log2 scale is a pretty good fit.
The regression line just misses the middle of each group and has roughly equal spread for each group.
In a more advanced class we would consider a regression curve that bends with the points.
For this introductory course, I feel pretty happy with this fit.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = DaysSmoke, y = TotalCigsSmokedLog2))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence bands, log2 scale"
, x = "Days smoked per month"
, y = "Log2 Number of cigarettes smoked per month"
)
print(p)
```
Another way to handle the curve pattern of the points is to use another log transform on the x-axis of `DaysSmoke`.
Below I added a `scale_x_log10()` to scale the x-axis.
I don't think this looks substantially better than the y-axis only log2 plot above;
if I did, I'd create a new variable `DaysSmokeLog2` and use that for analysis going forward.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = DaysSmoke, y = TotalCigsSmokedLog2))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.01), alpha = 0.75)
p <- p + scale_x_log10()#limits=c(NA, log10(5000)))
p <- p + labs(title = "Regression with confidence bands, log2 scale both axes"
, x = "Log10 Days smoked per month"
, y = "Log2 Number of cigarettes smoked per month"
)
print(p)
```
#### Final plot
The relationship greatly improved with the log2-scale for the $y$-variable.
The line goes close to the center of each group and the spread (variance) is roughly equal for each group.
We'll use the log2-scale `TotalCigsSmokedLog2` variable for analysis.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = DaysSmoke, y = TotalCigsSmokedLog2))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Number of cigarettes by days smoked per month"
, x = "Days smoked per month"
, y = "Log2 Number of cigarettes smoked per month"
)
print(p)
```
### Fit the simple linear regression model.
```{R}
# fit model
lm.fit <- lm(TotalCigsSmokedLog2 ~ DaysSmoke, data = nesarc.sub)
```
### Test whether the slope is different from zero, $H_A: \beta_1 \ne 0$.
```{R}
summary(lm.fit)
```
__Hypothesis test__ (THIS IS NOT PART OF THE ASSIGNMENT)
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The slope of the linear regression line is different from 0."
* In notation: $H_0: \beta_1=0$ versus $H_A: \beta_1\ne 0$.
2. Let the significance level of the test be $\alpha=0.05$.
3. Compute the __test statistic__.
```{R}
summary(lm.fit)$coefficients[2,3]
```
The $t$-statistic for the slope parameter is $t = `r signif(summary(lm.fit)$coefficients[2,3], 3)`$.
4. Compute the __$p$-value__ from the test statistic.
The p-value for testing the null hypothesis is
$p = `r signif(summary(lm.fit)$coefficients[2,4], 3)`$.
5. State the __conclusion__ in terms of the problem.
Because $p = `r signif(summary(lm.fit)$coefficients[2,4], 3)` < 0.05$,
we reject $H_0$ in favor of $H_1$ concluding that
the slope is different from 0, $\beta_1\ne 0$.
We have sufficient evidence to conclude the slope is different from zero.
Furthermore, for every additional day increase per month a person smokes,
there is a predicted increase of 0.2 log2 of cigarettes smoked.
When this is on the original scale (by exponentiating), the increase is scary.
### Interpret the slope
The slope is `r signif(summary(lm.fit)$coefficients[2,1], 3)`,
thus for every 1-day increase per month that a person smokes,
we expect a `r signif(summary(lm.fit)$coefficients[2,1], 3)` increase
in the log2 number of cigarettes they smoke.
Since this is on a log scale, this means that the number of cigarettes increases exponentially with the number of days a person smokes.
### Interpret the $R^2$ value.
The regression model explains
$R^2 = `r signif(summary(lm.fit)$r.squared, 3)`$
proportion of the variation
in the response
above using the mean value alone.
In this case, this is a rather large value.
### Interpret the intercept
The interpretation below doesn't make sense since it's beyond the scope of our $x$-variable data.
We have limited our study to smokers, thus we don't have 0-day smokers.
A better intercept might be at 30 days, which would be for the daily smokers --
that's the largest smoking group in our data.
The intercept is `r signif(summary(lm.fit)$coefficients[1,1], 3)`,
thus the model predicts that a person smoke `r round(summary(lm.fit)$coefficients[1,1], 0)`
log2 cigarettes (which is 8 cigarettes).
Note that when the number of days smoked is 0,
we expect the number of cigarettes to be 0.
Recall that this variable was created with `TotalCigsSmokedLog2 = log(TotalCigsSmoked + 1, base = 2)`.
Thus, we added 1 then took the log.
Therefore, 0 cigarettes on the original scale because 0+1 = 1,
and then log(1) = 0.
Working backwards to the original (non-log2) scale,
the model predicts
$2^{`r signif(summary(lm.fit)$coefficients[1,1], 3)`} - 1 = `r signif(2^summary(lm.fit)$coefficients[1,1] - 1, 3)`$
cigarettes at the intercept.
--------------------------------------------------------------------------------
# Week13: Complete poster in Project document
### Title
Smoking behavior is (barely) associated with major depression
### 1. Introduction
* __Smoking behavior is associated with major depression.__
- Longitudinal investigations have shown that depression increases risk
of later smoking [@breslau_major_1998; @dierker_association_2001].
- Many subsequent studies find a role of major depression in increasing
the probability and amount of smoking [@dierker_defining_2004;
@rohde_psychiatric_2004; @rohde_psychiatric_2003].
* __Smoking behavior and Nicotine dependence.__
- While smoking is a necessary requirement for nicotine dependence, smoking
behavior is a poor predictor of developing nicotine dependence
[@kandel_extent_2000; @stanton_antecedents_1995].
- Heavy smokers may not have nicotine dependence, conversely
light smokers may have nicotine dependence [@kandel_extent_2000].
### 2. Research Questions
* (Week02 Research Questions) 2 bullets, one for each research question, stated as the alternative hypothesis.
The goals of the analysis include answering these two questions:
* Is there an association between major depression and nicotine dependence?
* Does the prevalence of nicotine dependence differ by ethnicity?
### 3. Methods
* (Week01 Personal Codebook) Data source(s).
A sample from the first wave of the National Epidemiologic Survey on
Alcohol and Related Conditions (NESARC) (US adults).
We focus on young adult (18--25) smokers
(`r nrow(nesarc.sub)` of `r nrow(NESARC)` respondants).
* (Week01 Personal Codebook) Variables used.
* Lifetime major depression (`Depression`, 0=No, 1=Yes) [@grant_alcohol_2003; @grant_alcohol_1995].
* Total cigarettes per month (`TotalCigsSmoked` = days smoked times daily cigs smoked, range: 0--`r max(nesarc.sub$TotalCigsSmoked)`),
a combination of these questions:
"About how often did you usually smoke in the past year?"
and "On the days that you smoked in the last year,
about how many cigarettes did you usually smoke?"
* Nicotine dependence (`TobaccoDependence`, 0=No, 1=Yes) in the last 12 months.
* (Various) Statistical methods used to answer the research questions.
* A two-sample $t$-test comparing the mean total cigarettes smoked by depression status.
* A $\chi^2$ analysis of a two-way contingency table of nicotine dependence by ethnicity.
### 4. Results for your first research question.
* "Is the population mean square-root total cigarettes smoked different for those with depression or not?"
* $H_0: \mu_{D} = \mu_{ND}$ versus $H_A: \mu_{D} \ne \mu_{ND}$
```
{R, echo=FALSE}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
#library(plyr)
est.mean.TCS.D <- plyr::ddply(nesarc.sub
, "Depression"
, summarise
, TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)
, TotalCigsSmokedSqrt = mean(TotalCigsSmokedSqrt, na.rm = TRUE)
)
# est.mean.TCS.D
# library(ggplot2)
# p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmokedSqrt))
# p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
# p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# # diamond at mean for each group
# p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
# colour = "red", alpha = 0.8)
# p <- p + labs(title = "Square-root of Total Cigarettes Smoked by Depression Status")
# print(p)
# histogram using ggplot
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 2)
p <- p + geom_rug(alpha = 0.1)
p <- p + geom_vline(data = est.mean.TCS.D, aes(xintercept = TotalCigsSmokedSqrt^2, colour = "red"))
p <- p + facet_grid(Depression ~ .)
p <- p + scale_x_sqrt(breaks = c(0, 5, 25, 50, 100, 250, 500, 1000, 2000, 3000))
print(p)
```
```
{R, echo=FALSE, results="asis"}
t.summary.TCS.D <- t.test(TotalCigsSmokedSqrt ~ Depression, nesarc.sub)
t.summary.TCS.D
#names(t.summary.TCS.D)
t.summary.TCS.DXX <- t.test(TotalCigsSmoked ~ Depression, nesarc.sub)
t.summary.TCS.DXX
```
* Model assumptions are met, the the sampling distribution of the difference in means is normal.
* Because $p=`r signif(t.summary.TCS.D$p.value, 3)` > 0.05$ (with $t_{s} = `r signif(t.summary.TCS.D$statistic, 4)`$),
we have insufficient evidence to reject $H_0$ at an $\alpha=0.05$ significance level,
concluding that the total cigarettes smoked does not differ by depression
status.
* The difference is very small, on the square-root scale it is `r signif(diff(t.summary.TCS.D$estimate), 3)`
and on the natural scale it is `r t.summary.TCS.D$estimate^2`, a difference of
`r round(diff(t.summary.TCS.D$estimate^2))` cigarettes each year.
### 5. Results for your second research question.
* Is there an association between TobaccoDependence and Ethnicity?
```{R, echo=FALSE}
# Tabulate by two categorical variables:
tab.E.TD <- xtabs(~ Ethnicity + TobaccoDependence, data = nesarc.sub)
#tab.E.TD
# column proportions
#prop.table(tab.E.TD, margin = 2)
```
```{R, echo=FALSE}
chisq.E.TD <- chisq.test(tab.E.TD, correct=FALSE)
#chisq.E.TD
```
```{R, echo=FALSE}
# table of expected frequencies:
#chisq.E.TD$expected
# smallest expected frequency:
#min(chisq.E.TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
Because the p-value $= `r signif(chisq.E.TD$p.value, 3)` < 0.05$ ($X^2 = `r signif(chisq.E.TD$statistic, 3)`$)
we reject the null hypothesis
concluding that there is an association between
TobaccoDependence and Ethnicity.
```{R, echo=FALSE}
# The Pearson residuals
#chisq.E.TD$residuals
# The sum of the squared residuals is the chi-squared statistic:
#chisq.E.TD$residuals^2
#sum(chisq.E.TD$residuals^2)
```
```{R, echo=FALSE, fig.width = 6, fig.height = 6}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
#mosaic(tab.E.TD, shade=TRUE, legend=TRUE)
mosaic(tab.E.TD, shade=TRUE, legend=TRUE, rot_labels = c(0, 90, 0, 0))
#, rot_varnames = c(0, 90, 0, 0)
#, labeling = labeling_border(just_labels = "right"))
#mosaic(~ Ethnicity + TobaccoDependence, data = nesarc.sub, shade=TRUE, legend=TRUE, direction = "v")
```
### Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting $H_0$ is that
Hispanics have less nicotine dependence than expected by chance.
### 6. Discussion
* Put the results you found for each research question in the context you provided in your introduction.
These results support the previous literature that individuals with major
depression are more sensitive to the development of nicotine dependence
regardless of how much they smoke. Thus, people with depression are an
important population subgroup for targeted smoking intervention programs.
### 7. Further directions or Future work or Next steps or something else that indicates there more to do and you've thought about it.
* What do these results lead you to want to investigate?
We can extend focus from young adults to all adults, and (with multiple
regression techniques we'll learn in ADA2) assess the particular depressed
subpopulations most susceptible to developing nicotine dependence.
### 8. References
* By citing sources in your introduction, this section will automatically have your bibliography.
* [Bibliography will go here]
--------------------------------------------------------------------------------
## Class 27 Poster Presentation
# References