---
title: "ADA1: All assignments, Erik's NESARC Project"
author: Erik Erhardt
date: 10/26/2016
output:
html_document:
toc: true
bibliography: NESARC_NicotineDependence_LitReview.bib
nocite: |
@breslau_major_1998,
@dierker_association_2001,
@dierker_defining_2004,
@grant_alcohol_1995,
@grant_alcohol_2003,
@kandel_extent_2000,
@rohde_psychiatric_2003,
@rohde_psychiatric_2004,
@stanton_antecedents_1995
---
# Document overview
This document is organized by Week number.
The in-class assignments are indicated by Tuesday and Thursday.
Each week's homework is often a combination of Tuesday and Thursday,
with a small extension.
Therefore, "fleshing out" the Tuesday and Thursday sections with a little addition
is often sufficient for your homework assignment.
Consider your readers (graders):
* organize the document clearly (use this document as an example)
* label minor sections under each day (use this document as an example)
* For each thing you do, always have these three parts:
1. Say what you're going to do and why.
2. Do it with code, and document your code.
3. Interpret the results.
## Global code options
```{R}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE) #$
```
__Starting at Week03, we will concatenate all our assignments together to retain the
relevant information needed for subsequent sections.
You will also have an opportunity to revisit previous parts to make changes or improvements,
such as updating your codebook, modifying your research questions, improving tables and plots.
Please organize your headers using the \# symbol appropriately, and generate a table of contents
by using the "toc: true" in the yaml (see any previous Rmd file on the website).__
----------------------------------------
# Week01: Personal Codebook
## Question of interest
## Thursday ---------
__Dataset__: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook wv1codebook.pdf.
__Initial thinking__: _While nicotine dependence is a good starting point, I
need to determine what it is about nicotine dependence that I am interested in.
It strikes me that friends and acquaintances that I have known through the
years that became hooked on cigarettes did so across very different periods of
time. Some seemed to be dependent soon after their first few experiences with
smoking and others after many years of generally irregular smoking behavior. I
decide that I am most interested in exploring the association between level of
smoking and nicotine dependence. I add to my codebook variables reflecting
smoking levels (e.g. smoking quantity and frequency)._
__Topic of interest__:
Is there an association between nicotine dependence
and the frequency and quantity of smoking on people up to 25 years old?
The association may differ by ethnicity, age, gender, and other factors
(though we won't be able to test these additional associations until next semester in ADA2).
__How I did it__: I look through the codebook wv1codebook.pdf and find some
variables of interest. I searched the text with "Ctrl-F" (find) to find these
variables.
For each variable, I copy/paste the description here, then formatted so it's
organized. You can choose to use a table or an outline format. I found this
text format to be very easy to format. I retained the "frequency" of each
response because it's interesting to know, and because it was already in the
codebook --- this value is not required for your codebook.
## Codebook
```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
VarName (RenamedVarName)
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
SEX (Sex)
SEX
Nominal
18518 1. Male
24575 2. Female
AGE (Age)
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
TAB12MDX (TobaccoDependence)
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
ETHRACE2A (Ethnicity)
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
MAJORDEPLIFE (Depression)
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
```
Additional variables were created from the original variables:
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
```
----------------------------------------
# Week02: Literature Review
__Dataset__: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook wv1codebook.pdf.
Given the association that I have decided to examine, I use such keywords as
_nicotine dependence_, _tobacco dependence_, and _smoking_. After reading
through several titles and abstracts, I notice that there has been relatively
little attention in the research literature to the association between smoking
exposure and nicotine dependence. I expand a bit to include other substance use
that provides relevant background as well.
## Tuesday ---------
## Research questions
Based on my reading of the articles in the **References** as well as others, I
have noted a few common and interesting themes:
1. While it is true that smoking exposure is a necessary requirement for
nicotine dependence, frequency and quantity of smoking are markedly imperfect
indices for determining an individual's probability of exhibiting nicotine
dependence (this is true for other drugs as well).
2. The association may differ based on ethnicity, age, and gender (although
there is little work on this).
3. One of the most potent risk factors consistently implicated in the etiology
of smoking behavior and nicotine dependence is depression.
I have decided to further focus my question by examining whether the
association between nicotine dependence and depression differs based on how
much a person smokes. I am wondering if at low levels of smoking compared to
high levels, nicotine dependence is more common among individuals with major
depression than those without major depression. I add relevant depression
questions/items/variables to my personal codebook as well as several
demographic measures (age, gender, ethnicity, etc.) and any other variables I
may wish to consider.
All required variables have been found and added to my personal codebook (by expanding Week01).
## Thursday ---------
## Citations
@breslau_major_1998,
@dierker_association_2001,
@dierker_defining_2004,
@grant_alcohol_1995,
@grant_alcohol_2003,
@kandel_extent_2000,
@rohde_psychiatric_2003,
@rohde_psychiatric_2004,
@stanton_antecedents_1995
----------------------------------------
# (Week02a Research Proposal -- skipping this step in ADA1)
__The Association between Nicotine Dependence and Major Depression at Different Levels of Smoking Exposure__
_In our class we won't do this assignment.
However, this example illustrates the next step toward clarifying your ideas and
writing introductory, methods, and discussion content for your poster._
## Introduction
One of the most potent risk factors consistently implicated in both the
etiology of smoking behavior as well as the subsequent development of nicotine
dependence is major depression. Evidence for this association comes from
longitudinal investigations in which depression has been shown to increase risk
of later smoking [@breslau_major_1998; @dierker_association_2001]. This
temporal ordering suggests the possibility of a causal relationship. In fact,
the vast majority of research to date has focused on the role of major
depression in increasing the probability and amount of smoking
[@dierker_defining_2004; @rohde_psychiatric_2004; @rohde_psychiatric_2003].
While it is true that smoking exposure is a necessary requirement for nicotine
dependence, frequency and quantity of smoking are markedly imperfect indices
for determining an individual's probability of developing nicotine dependence
[@kandel_extent_2000; @stanton_antecedents_1995]. For example, a substantial
number of individuals reporting daily and/or heavy smoking do not meet criteria
for nicotine dependence [@kandel_extent_2000]. Conversely, nicotine dependence
has been seen among population subgroups reporting relatively low levels of
daily and non daily smoking [@kandel_extent_2000].
A complementary or alternate role that major depression may play is as a cause
or signal of greater sensitivity to nicotine dependence, over and above an
individual's level of smoking exposure. While major depression has been shown
to increase an individual's probability of smoking initiation, regular use
and nicotine dependence, it remains unclear whether it may signal greater
sensitivity for nicotine dependence regardless of smoking quantity.
The present study will examine young adults from the National Epidemiologic
Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis
will include 1) establishing the relationship between major depression and
nicotine dependence; and 2) determining whether or not the relationship between
major depression and nicotine dependence exists above and beyond smoking
quantity.
## Methods
### Sample
The sample from the first wave of the National Epidemiologic Survey on Alcohol
and Related Conditions (NESARC) represents the civilian, non-institutionalized
adult population of the United States, and includes persons living in
households, military personnel living off base, and persons residing in the
following group quarters: boarding or rooming houses, non-transient hotels and
motels, shelters, facilities for housing workers, college quarters, and group
homes. The NESARC included over sampling of Blacks, Hispanics and young adults
aged 18 to 24 years. The sample included 43,093 participants.
### Procedure
One adult was selected for interview in each household, and face-to-face
computer assisted interviews were conducted in respondents' homes following
informed consent procedures.
### Measures
Lifetime major depression (i.e., those experienced in the past 12 months and
prior to the past 12 months) were assessed using the NIAAA, Alcohol Use
Disorder and Associated Disabilities Interview Schedule DSM-IV (AUDADIS-IV)
[@grant_alcohol_2003; @grant_alcohol_1995]. The tobacco module of the
AUDADIS-IV contains detailed questions on the frequency, quantity, and
patterning of tobacco use as well as symptom criteria for DSM-IV nicotine
dependence. Current smoking was evaluated through both smoking frequency
("About how often did you usually smoke in the past year?") coded dichotomously
in terms of the presence or absence of daily smoking, and quantity ("On the
days that you smoked in the last year, about how many cigarettes did you
usually smoke?")
### Implications
While chronic use is a key feature in the development of dependence, the
present study will evaluate whether individual differences in nicotine
dependence exist above and beyond level of exposure. If individuals with major
depression are more sensitive to the development of nicotine dependence
regardless of how much they smoke, they would represent an important population
subgroup for targeted smoking intervention programs.
----------------------------------------
# Week03: Data Subset, Univariate Summaries And Plots
## Purpose of study
(Note: This is the last paragraph of the research proposal used to remind the
reader of the research objectives.)
The present study will examine young adults from the National Epidemiologic
Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis
will include 1) establishing the relationship between major depression and
nicotine dependence; and 2) determining whether or not the relationship between
major depression and nicotine dependence exists above and beyond smoking
quantity.
## Variables
Variables from NESARC that will be used include (already in codebook, above):
* `IDNUM` (unique ID number with no alphabetics),
* `MAJORDEPLIFE` (has subject experienced a major depression?),
* `CHECK321` (has subject smoked cigarettes in past 12 months?),
* `AGE` (age of subject),
* `TAB12MDX` (tobacco dependence past 12 months),
* `S3AQ3B1` (usual frequency when cigarettes smoked),
* `ETHRACE2A` (ethnicity of subject),
* `SEX` (gender of subject), and
* `S3AQ3C1` (usual smoking quantity of cigarettes when smoked).
## Tuesday ---------
## Data subset
First, the data is placed on the search path using the `PDS` package.
```{R}
library(PDS)
dim(NESARC)
```
The variables of interest are specified and a subset of the dataset including
only these variables is created and stored in the data frame `nesarc.sub`.
```{R}
# variables to include in our data subset
var.list <- c("IDNUM"
, "SEX"
, "AGE"
, "CHECK321"
, "TAB12MDX"
, "S3AQ3B1"
, "S3AQ3C1"
, "ETHRACE2A"
, "MAJORDEPLIFE")
var.list
# subset of data
nesarc.sub <- subset(NESARC, select = var.list)
```
Check size and structure of data subset.
Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
```{R}
# size of subset
dim(nesarc.sub)
# structure of subset
str(nesarc.sub)
```
### Renaming Variables
Rename
```{R}
# note, if the new "dplyr" package is installed, this may not work correctly
library(plyr) # for rename(dat, c("from" = "to"))
nesarc.sub <- rename(nesarc.sub, c(
"SEX" = "Sex"
, "AGE" = "Age"
, "CHECK321" = "SmokingStatus"
, "TAB12MDX" = "TobaccoDependence"
, "S3AQ3B1" = "SmokingFreq"
, "S3AQ3C1" = "DailyCigsSmoked"
, "ETHRACE2A" = "Ethnicity"
, "MAJORDEPLIFE" = "Depression"
))
```
A quick summary of the variables indicates many missing values.
```{R}
# summary of variables -- notice many NAs
summary(nesarc.sub)
```
### AddHealth, joining waves 1 and 2 together
Joining AddHealth waves 1 and 2 together into a single dataset
can be done if you want to use variables from when
the participants were both adolecents and adults.
```{R}
# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
dim(addhealth_public4)
# First column is the unique ID
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Rename W4's unique ID variable "aid" to "AID"
# so both W1 and W4 have same variable name (lower/uppercase matters)
library(plyr)
addhealth_public4 <- rename(addhealth_public4, c("aid" = "AID"))
# note that the first column name has be updated
colnames(addhealth_public4)[1]
# Join two datasets by unique ID
AH14 <- join(AddHealth, addhealth_public4, by = "AID", type = "full")
# now we have twice as many columns
dim(AH14)
# Removing this large object since I don't need this data for this project
rm(AH14)
```
### Week04: Thursday ---------
### Data Cleaning
From the reading (ADA1 Ch 11 notes), there are three primary steps of data cleaning.
1. Define "edit rules" for detecting obvious inconsistencies for your variables.
2a. Determine which observations violate the rules and
2b. try to localize the reason for the violation.
3. Correct the errors.
#### 1. Define "edit rules" for detecting obvious inconsistencies for your variables
You should have at least one rule for each variable in your data subset.
Possibly some "mixed" rules if there are sensible variable relationships.
By writing these rules, I discovered I had not carefully coded values from the codebook correctly.
For example, `NA`s actually mean `0` (or 100+ cigarettes) for
`CHECK321 (SmokingStatus)`,
`S3AQ3B1 (SmokingFreq)`, and
`S3AQ3C1 (DailyCigsSmoked)` ---
so I updated the code above to
first recode `NA` to `0`, then `9` to `NA`!
_This is common to correct previous ideas as you learn new things by looking closer at your data._
```{R}
# I have saved my edit rules into this file (in the same folder as this .Rmd file):
fn.edit.rules <- "ADA1_HW_04_NESARC_DataCleaningEditRules.txt"
# You don't need to do this, but I think it's handy to verify that the correct
# file is being read.
# This is the content of the text file:
readLines(fn.edit.rules)
# Encode these rules using the editrules package
library(editrules)
EditRules.N.sub <- editfile(fn.edit.rules)
EditRules.N.sub
```
Show dependencies between rules and variables.
Blue circles represent variables and yellow boxes represent restrictions.
The lines indicate which restrictions pertain to what variables.
```{R}
op <- par(no.readonly = TRUE) # save plot settings
par(mfrow=c(1,1), mar = c(0,0,0,0))
plot(EditRules.N.sub)
par(op) # restore plot settings
```
#### 2a. Determine which observations violate the rules and
If there are several isolated issues, then best to correct those.
_Alternatively, if there are many errors this may indicate that your edit rules are incorrect;
maybe you don't understand your data._
```{R, fig.height = 8}
ve.N.sub <- violatedEdits(EditRules.N.sub, nesarc.sub)
summary(ve.N.sub)
plot(ve.N.sub)
```
Check whether all the violations are due to missing values.
Based on the numbers, they most certainly are, but best to make sure.
```{R}
## Strategy:
# Determine which edit rules pertain to which variables
# Find the NAs in the violatedEdits result and in the data set
# If these all match, and there are the same number as violations,
# then all the violations are due to NAs.
# The column names of the violatedEdits
colnames(ve.N.sub)
# The edits with violations
summary(ve.N.sub)
# The variables assigned to each edit name
EditRules.N.sub
# are all TRUEs for editrule dat2 the same as NAs for DailyCigsSmoked?
all(ve.N.sub[,"dat2"] == is.na(nesarc.sub$DailyCigsSmoked))
# are all TRUEs for editrule dat6 the same as NAs for SmokingFreq?
all(ve.N.sub[,"dat6"] == is.na(nesarc.sub$SmokingFreq))
# are all TRUEs for editrule dat7 the same as NAs for SmokingStatus?
all(ve.N.sub[,"dat7"] == is.na(nesarc.sub$SmokingStatus))
## If there were any FALSE values,
## that would indicate that at least one violatedEdit was not due to an NA.
## That would need to be investicated and corrected.
#$
```
Note that all the violations are `NA` values, thus, there's no issues with
the variables I have (fortunately).
#### 2b. Try to localize the reason for the violation.
Here, the `le.*` object contains some processing metadata and a logical array
labeled `adapt` which indicates the minimal set of variables to be altered
in each record.
It can be used in correction and imputation procedures for filling in valid values.
The values of `TRUE` are the recommended changes.
I don't run this code since it takes so very long and doesn't give much additional
information over the summary above from the `violatedEdits()`.
```
# NOT RUN
# This can take a long time on a large data set
le.N.sub <- localizeErrors(EditRules.N.sub, nesarc.sub, method = "mip")
le.N.sub$adapt #$
```
The behaviour of `localizeErrors()` can be tuned with various options.
It is possible to supply a confidence weight for each variable allowing for
fine grained control on which values should be adapted.
It is also possible to choose a branch-and-bound based solver (instead of the
MIP solver used here), which is typically slower but allows for more control.
#### 3. Correct the errors
For the purposes of ADA1, we will manually correct errors, either by replacing
values or by excluding observations.
Manual corrections example (not run in my code):
```
people[2, "status"] <- "single"
people[5, "height"] <- 7
people[5, "agegroup"] <- "adult"
summary(violatedEdits(EditRules.N.sub, people[id, ]))
```
### End Week04 data cleaning ----
### Back to Week03
### Subset data to young smokers
Next, a _subset of people 25 or younger_ who _smoked in the last 12 months_ is
created using the `subset` function.
Note that `SmokingStatus == 1` is used to see if subject has smoked in the past 12
months.
Notice, the missing values have been removed (since `NA`s are not less than or
equal to numbers).
```{R}
# the subset command below will not include a row that evalutes to NA for the
# specified subset. For example:
(NA == 1)
# age and smoking subset
nesarc.sub <- subset(nesarc.sub
, subset = (Age <= 25) # people 25 or younger
& (SmokingStatus == 1) # smoked in the past 12 months
)
dim(nesarc.sub)
summary(nesarc.sub)
```
### Coding missing values
Recall that many variables are coded in the codebook.
From the codebook, these are the variables and the codes for the missing or `unknown` values.
```
CHECK321 (SmokingStatus)
22 9. Unknown
S3AQ3B1 (SmokingFreq)
102 9. Unknown
S3AQ3C1 (DailyCigsSmoked)
262 99. Unknown
```
Replace the coded missing values with `NA`,
then remove the remaining records with missing values.
Not all of these variables need this operation, since the previous subset commands
removed missing values by chance.
However, I'm going to run each as an extra precaution.
Afterall, later, if we were to make different previous data decisions,
it's possible we could end up with missing values at this point that
are incorrectly coded without this step!
```{R}
# Note for this first variable, there are still three factor levels
str(nesarc.sub$SmokingStatus)
# even though only the first category has observations.
table(nesarc.sub$SmokingStatus)
# Replace any values that are "9" with NA
nesarc.sub$SmokingStatus[nesarc.sub$SmokingStatus == 9] <- NA
table(nesarc.sub$SmokingStatus)
# Then drop unused factor categories.
nesarc.sub$SmokingStatus <- factor(nesarc.sub$SmokingStatus)[, drop = TRUE]
table(nesarc.sub$SmokingStatus)
# Repeat this for the other two variables, using table() to see effect:
table(nesarc.sub$SmokingFreq)
nesarc.sub$SmokingFreq[nesarc.sub$SmokingFreq == 9] <- NA
table(nesarc.sub$SmokingFreq)
nesarc.sub$SmokingFreq <- factor(nesarc.sub$SmokingFreq)[, drop = TRUE]
table(nesarc.sub$SmokingFreq)
# numeric variable, no levels to drop
nesarc.sub$DailyCigsSmoked[nesarc.sub$DailyCigsSmoked == 99] <- NA
summary(nesarc.sub)
```
Finally, consider using `na.omit()` to remove any records with missing values
if it won't cause issues with analysis.
The command `na.omit()` removes a row if any columns have an `NA` in that row.
This can have __drastic__ consequences.
For example, imagine that you have 7 variables and 6 variables have complete data,
but the 7th variable is about three-quarters `NA`s!
Then `na.omit()` will drop three-quarters of your data,
even though most of your variables are complete!
There's only one advantage I can think of for running `na.omit()`,
and that's if you have a large dataset and only a few observations have a few `NA`s.
You can't analyze data with `NA`s,
since any calculation involving `NA` results in an `NA`.
Therefore imagine two scenarios:
(1) An analysis with Variables 1 and 2, each with a few `NA` distributed throughout;
first the observations with `NA` for either Variable 1 or 2 will be removed, then
the analysis will proceed.
(2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`
for different observations than Variable 2 --- therefore, different observations
will be removed before analysis.
In summary, these two analyses will include different observations from the dataset.
Therefore, by running `na.omit()` you will start with a dataset with no `NA`s,
thus every analysis will be of exactly the same set of data
(at the cost of not using all the possible data for each individual analysis).
Below, I do __not run `na.omit()`__ since I'm concerned about having lots of `NA`s for some variables.
```{R}
dim(nesarc.sub)
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
summary(nesarc.sub)
```
Now we have a dataset with no missing values.
## Creating new variables
Additional variables that we created from the original variables are listed in
this ammendment to the original codebook.
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
```
### From categories to numeric
`DaysSmoke` estimates the days per month a subject smokes
by converting `SmokingFreq` (a factor with 6 levels) to a numeric variable using `as.numeric()`
and multiplying by the midpoint of the range of `SmokingFreq`.
```{R}
nesarc.sub$DaysSmoke <- as.numeric(nesarc.sub$SmokingFreq)
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 1] <- 30 # 1. Every day
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 2] <- 4*5.5 # 2. 5 to 6 Day(s) a week
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 3] <- 4*3.5 # 3. 3 to 4 Day(s) a week
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 4] <- 4*1.5 # 4. 1 to 2 Day(s) a week
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 5] <- 2.5 # 5. 2 to 3 Day(s) a month
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 6] <- 1 # 6. Once a month or less
summary(nesarc.sub$DaysSmoke)
```
### From numeric to categories based on quantiles
The variable `TotalCigsSmoked` estimates the monthly number of cigarettes
a subject smokes per month by multiplying `DaysSmoke` times `DailyCigsSmoked`
(the usual quantity smoked per day).
```{R}
nesarc.sub$TotalCigsSmoked <- nesarc.sub$DaysSmoke * nesarc.sub$DailyCigsSmoked
```
The numeric variable `TotalCigsSmoked` is converted into a factor
with four roughly equivalent numbers (quartiles) stored in each level of the
factor `CigsSmokedFac` using the `cut` function.
```{R}
quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
nesarc.sub$CigsSmokedFac <- cut(nesarc.sub$TotalCigsSmoked
, breaks = quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
, include.lowest = TRUE)
summary(nesarc.sub$CigsSmokedFac)
```
### From many categories to a few
The variable `SmokingFreq3` collapses the `SmokingFreq` from 6 down to 3 categories.
This is an example to show you how to do this with your own variables.
Some of you will have categorical variables that have a dozen or so categories;
having so many categories makes interpretation very difficulty --- fewer is easier.
```{R}
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
nesarc.sub$SmokingFreq3 <- NA
# find indicies for each group based on original variable, then assign value to new variable
nesarc.sub$SmokingFreq3[(nesarc.sub$SmokingFreq %in% c(1, 2, 3))] <- 1 # 1. daily
nesarc.sub$SmokingFreq3[(nesarc.sub$SmokingFreq %in% c(4, 5 ))] <- 2 # 2. weekly
nesarc.sub$SmokingFreq3[(nesarc.sub$SmokingFreq %in% c(6 ))] <- 3 # 3. monthly
# new variable
table(nesarc.sub$SmokingFreq3)
# we will label the new factor variable below, with the others
```
### Review results of new variables
Look at the first 10 observations in the dataset, with special attention to the
new variables.
```{R}
head(nesarc.sub, 10)
```
## Working with Dates
Dates can be tricky in R. The `lubridate` package makes them much easier.
Below I'm going to create a variable for the person's age in years
based on the person's date of birth and the interview date.
There is an `AGE` variable which I'm already using,
but if there wasn't, or I wanted to be more precise about their age (such as, 24.34 years old),
then this is a way to do it.
I'm going to create a separate data frame just for this example to show how to do it;
if you needed these variables, they should be part of your original subset above.
Below, the "C" variables (`CDAY`) are the interview day, month, and year,
and the "DOB" variables (`DOBD`) are the date of birth dat, month, and year.
```{R}
dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
```
Combine the date fields together, then interpret them into a date object.
I'll do this for both the interview date and the date of birth.
I print the head (first 6 observations) after each operation.
```{R}
library(lubridate)
## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
head(dob.date)
```
Now that we have the interview date and date of birth as date objects,
we calculate the difference to give the person's age at the time of the interview.
Because the difference of dates is in the unit "days", we convert to years.
```{R}
# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
age.years <- as.numeric(age.days / 365.25) # change from "difftime" to "numeric"
head(age.years)
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
```
Add the new age variable to your data frame,
drop the ones you don't need anymore,
and you're ready to go!
```{R}
# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
# drop the original ones you don't need anymore
# by using subset() and select = -VAR, it excludes the VAR column
dat.sub <- subset(dat.sub, select = c(-CDAY, -CMON, -CYEAR, -DOBD, -DOBM, -DOBY))
str(dat.sub)
```
Now you've got an age (or other date object) you can use in your project.
## Labeling Categorical variable levels
Look at summary of dataset to see the state of the labelling before we make changes.
```{R}
summary(nesarc.sub)
```
Informative labels are given to the factor levels.
The order of the levels is also rearranged for the variables
`SmokingFreq`, `TobaccoDependence`, and `Sex`.
```{R}
nesarc.sub$Depression <- factor(nesarc.sub$Depression
, labels = c("No Depression"
, "Yes Depression"
))
# first label the existing levels
nesarc.sub$Sex <- factor(nesarc.sub$Sex
, labels = c("Male"
, "Female"
)
)
# check ordering with a frequency table
table(nesarc.sub$Sex)
# then reorder the levels
nesarc.sub$Sex <- factor(nesarc.sub$Sex
, levels = c("Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc.sub$Sex)
# labels
nesarc.sub$SmokingFreq <- factor(nesarc.sub$SmokingFreq
, labels = c("Every Day"
, "5 to 6 Days/week"
, "3 to 4 Days/week"
, "1 to 2 Days/week"
, "2 to 3 Days/month"
, "Once a month or less"
)
)
# include levels afterwards to put in order
nesarc.sub$SmokingFreq <- factor(nesarc.sub$SmokingFreq
, levels = c("Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
)
nesarc.sub$SmokingFreq3 <- factor(nesarc.sub$SmokingFreq3
, labels = c("daily"
, "weekly"
, "monthly"
)
)
nesarc.sub$TobaccoDependence <- factor(nesarc.sub$TobaccoDependence
, labels = c("No Nicotine Dependence"
, "Nicotine Dependence"
)
)
nesarc.sub$TobaccoDependence <- factor(nesarc.sub$TobaccoDependence
, levels = c("Nicotine Dependence"
, "No Nicotine Dependence"
)
)
nesarc.sub$Ethnicity <- factor(nesarc.sub$Ethnicity
, labels = c("Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
)
# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
summary(nesarc.sub)
```
## Thursday ---------
## Tables for categorical variables
Basic tables can be created using the functions `table` or `xtabs`.
Frequency tables are created for the variables
`TobaccoDependence`, `CigsSmokedFac`, and `SmokingFreq`.
When only looking at one variable, `table` is the same as `xtabs`, but
`xtabs` will make looking at two-way, three-way, and higher tables possible.
```{R}
table(nesarc.sub$TobaccoDependence)
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
T2 <- xtabs( ~ CigsSmokedFac, data = nesarc.sub)
T2
T3 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T3
```
(Look at the .Rmd code to see that these numbers in the next paragraph are
calculated directly from the numbers summarized in the table!)
In the data frame `nesarc.sub`, there are
`r T1[1]` nicotine dependent subjects and
`r T1[2]` subjects that are not nicotine dependent.
A small number of smokers
(`r T2[4]`) smoke over 600 cigarettes per month.
Most of the subjects in `nesarc.sub` are daily smokers
(`r T3[6]`) with the remainder distributed uniformly across the
first five levels of `SmokingFreq`.
## Graphing frequency tables
The barplots are all created with the package `ggplot2`. The barplots start with the defaults for the `geom_bar` and add more detail to the plot with each graph.
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1
p2 <- ggplot(data = nesarc.sub, aes(x = CigsSmokedFac))
p2 <- p2 + geom_bar()
p2
p3 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p3 <- p3 + geom_bar()
p3 <- p3 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Adults")
p3 <- p3 + theme_bw()
p3 <- p3 + theme(axis.text.x = element_text(angle = 90, vjust = 0))
p3
```
### Removing NAs from factor axes
When a factor variable has `NA`s,
then `NA` appears as a level with frequences (see bar plots above).
As of writing this, `ggplot()` does not have a way to remove the `NA`s before plotting.
Therefore, we have to do it manually.
The easiest and most transparent way is to use the `subset()` function below
where we only plot the non-`NA` values.
Compare the two plots below of the same data,
but the second plot had the `NA`s removed before plotting.
```{R, fig.height = 4, fig.width = 8}
library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = CigsSmokedFac))
p1 <- p1 + geom_bar()
#p1
# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
# and "!is.na()" means "keep the non-NA values"
p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- p2 + geom_bar()
#p2
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")
```
## Graphing numeric variables
One popular way to graph a continuous variable is to use a histogram.
`R` has the base function `hist` which can be used for this purpose.
We will also use the package `ggplot2` to create histograms.
We start with a basic histogram of the variable `TotalCigsSmoked`.
```{R}
hist(nesarc.sub$TotalCigsSmoked)
```
Experiment with the code in the previous code chunk to change the color, the
title, and if needed the labels on the $x$- and $y$-axes.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
```
### Creating Density Plots
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y=..density..), binwidth = 200)
p <- p + geom_rug()
p <- p + geom_density(alpha=0.1, fill="white", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
```
### Shape, center, and spread
```{R}
summary(nesarc.sub$TotalCigsSmoked)
fivenum(nesarc.sub$TotalCigsSmoked)
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
```
The `shape` of the distribution for the estimated cigarettes smoked per month
is skewed to the right.
The `center` (median) of the distribution is
`r median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`
and the `spread` (interquartile range) for the distribution is
`r IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`.
----------------------------------------
# Week04: Bivariate summaries and data cleaning
## Tuesday ---------
## Bivariate graphs
### Scatter plot (for regression): x = numerical, y = numerical
Research question:
With my questions, I don't have a meaningful question to ask that will relate
two numeric variable.
An exploratory question whether there is a relationship between a person's age (`Age`)
and total number of cigaretts smoked (`TotalCigsSmoked`).
Because age is an integer varible, I jitter the points,
and I reduce the opacity (increase the transparancy) with `alpha = 1/4`
to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).
__Interpretation:__
I added a linear regression smoother to see whether there was any linear trend with age.
The horizontal line suggests that there isn't a (linear) relationship here.
The first plot has all the points in their original locations,
but they end up stacking on top of each other so you can't tell how many points are there.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
This is the same data as above, but I have added some horizontal jitter (displacement)
to each point which doesn't affect their value on the y-axis
but allows us to see how many points are at each discrete age year.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.2), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
### Box plots (for ANOVA): x = categorical, y = numerical
Research question:
1. Is there a relationship between depression (`Depression`) and total number of cigarettes smoked (`TotalCigsSmoked`)?
The $x$ variable should be a factor variable, and $y$ should be numeric.
__Interpretation:__
The boxplots indicate that the five number summary is about the same for both depression groups.
Also, the means are about the same, though slightly higher for the people with depression.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(title = "Depression does not predict Total Cigarettes Smoked")
print(p)
```
### Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
Research question:
Is there a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)?
This strategy uses the base graphics by plotting the tables directly as bar plots.
The next set of plots using `ggplot()` may be preferred.
```{R}
T1 <- xtabs(~ TobaccoDependence + Depression, data = nesarc.sub)
T1
barplot(T1)
T2 <- prop.table(T1, 2)
T2
barplot(T2)
```
The last graph needs a legend as well as a title. While it is possible to
construct a legend and title for the last graph, it is much better to use an
approach that will generate the legend automatically.
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc.sub, aes(x = Depression, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status")
# the legend title can be modified, if desired (try this line)
# p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
print(p)
```
Does the a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`)
vary by sex?
__Interpretation:__
It appears that nicotine dependence is slightly higher for women with depression than for men with depression.
```{R}
library(ggplot2)
p <- ggplot(data = nesarc.sub, aes(x = Depression, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
p <- p + labs(title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status")
# added a row facets by Sex
p <- p + facet_grid(Sex ~ .)
# tilted the x-axis labels
p <- p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
p <- p + labs(title = "Fraction of young adult smokers with and without\n nicotine dependence by smoking frequency")
# the legend title can be modified, if desired (try this line)
p <- p + scale_fill_discrete(name="Tobacco Addiction\nStatus")
print(p)
```
```{R}
# ggplot(data = na.omit(nesarc.sub), aes(x = CigsSmokedFac, fill = TobaccoDependence)) +
# geom_bar(position = "fill") +
# theme_bw() + facet_grid(Sex ~ Depression) +
# theme(axis.text.x = element_text(angle = 85, vjust = 0.5)) +
# labs(x = "Estimated Number of Cigarettes Smoked per Month", y = "", title = "Fraction of young adult smokers with and without\n nicotine dependence by smoking quantity") +
# scale_fill_discrete(name="Tobacco Addiction\nStatus")
```
Research question:
Is there a relationship between smoking frequency (`SmokingFreq`) and nicotine dependence (`TobaccoDependence`)?
Also compare by sex.
```{R}
library(ggplot2)
p <- ggplot(data = subset(nesarc.sub, !is.na(SmokingFreq)), aes(x = SmokingFreq, fill = TobaccoDependence))
p <- p + geom_bar(position = "fill")
p <- p + theme_bw()
# tilt axis labels
p <- p + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
p <- p + labs(x = "Smoking Frequency", y = "Percent")
print(p)
# also compare by sex
p <- p + facet_grid(Sex ~ .)
print(p)
```
#### Using Mosaic Plots
__Interpretation:__
If a person has depression they are more likely to have nicotine dependence
than if they do not have depression.
```{R}
library(vcd)
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub)
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub, shade = TRUE)
mosaic( ~ TobaccoDependence + Depression + Sex, data = nesarc.sub, shade = TRUE)
```
It's easy to look at such a fine resolution that it's difficult to see what's happening
```{R, fig.width = 8, fig.height = 8}
mosaic(~TobaccoDependence + Depression + Sex + CigsSmokedFac, data = nesarc.sub, shade = TRUE)
```
##### Customizing Mosaic Plots
There are several solutions to dealing with the messy labels that overlap.
The help line below will show the many options you have control over.
Read through the comments of each code chunk, see the new options, and see the result.
```{R, fig.height = 4, fig.width = 8}
library(vcd)
# help(labeling_border)
# Initial plot
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub)
# Abbreviate labels, specify the number of characters (try a few)
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, abbreviate_labs = 4)
# Rotate and align labels
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = "right" # right-justifies all the labels (top and side)
)
# Rotate and align labels only y-axis label
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
)
# Rotate and align labels only y-axis label, and offset the variable name
mosaic( ~ TobaccoDependence + Depression, data = nesarc.sub
, rot_labels = c(0, 90, 0, 0)
, just_labels = c("center", "right") # order is opposite as in first argument
, offset_varnames = c(0,0,0,9) # offset the y-axis var name
)
```
### Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
Research question:
1. Frequency and quantity of smoking (`TotalCigsSmoked`) are markedly imperfect indices for
determining an individual's probability of exhibiting nicotine dependence (`TobaccoDependence`) (this
is true for other drugs as well).
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked, y = TobaccoDependence))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
print(p)
```
To overlay a curve, we need to convert our response variable $y$ to be
binary 0 or 1, where we let 1 indicate a "success" in the direction of interest.
__Interpretation:__
These two groups are quite balanced on the probability of tobacco dependence
given the total number of cigarettes smoked,
though they increase together slightly.
```{R}
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0
nesarc.sub$TobaccoDependence01 <- ifelse(nesarc.sub$TobaccoDependence == "Nicotine Dependence", 1, 0)
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked, y = TobaccoDependence01))
p <- p + geom_jitter(position = position_jitter(height = 0.1), alpha = 1/4)
# overlay a sigmodal curve to indicate direction of relationship
# As of ggplot2 2.1.0, need to use this binomial_smooth() function:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
p <- p + binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)
print(p)
```
#### Example of collapsing a variable to binary.
It is convenient to collapse a continuous variable into a categorical variable
(as I did in Week03 with the `cut()` command)
or further collapse a categorical variable into a binary variable.
One strategy to create a binary variable is to use the `ifelse()` function.
Let's create a binary `Age` variable; if 21 years and older, than legal to purchase alcohol,
otherwise it is illegal.
See how the two columns below correspond as we wanted them to.
I do not recommend overwriting the original column,
but create a new variable in the data.frame.
```{R}
nesarc.sub$Alcohol.binary <- ifelse(nesarc.sub$Age >= 21, "Legal", "Illegal")
str(nesarc.sub$Alcohol.binary) #$
head(nesarc.sub[,c("Age", "Alcohol.binary")], 15)
```
## Week04: Thursday ---------
__I've moved the data cleaning above,
before recoding and other cleaning operations.__
# Week05: Simple linear regression, logarithm transformation
## Tuesday ---------
### 1. Scatter plot, add a regression line.
Reproduce scatterplot from before, with a regression line.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)
```
### 2. Fit the linear regression, interpret slope and $R^2$ (R-squared) values
```{R}
# fit the simple linear regression model
lm.TCS.Age <- lm(TotalCigsSmoked ~ Age, data = nesarc.sub)
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm.TCS.Age)
```
The slope is `r signif(coef(lm.TCS.Age)[2],3)`.
Thus, for every year increase in age,
we expect an increase of `r signif(coef(lm.TCS.Age)[2],3)` in total cigarettes smoked.
The $R^2$ value is `r summary(lm.TCS.Age)$r.squared`, which is effectively 0.
Compared to a model which uses only the mean for total cigarettes smoked,
the model with age accounts for `r signif(summary(lm.TCS.Age)$r.squared, 2)` additional variance.
When the $x$ and $y$ variables are correlated ($r \ne 0$),
$R^2$ will be larger (since $R^2 = r^2$, the capitalization doesn't matter in this case).
### 3. Interpret the intercept. Does it make sense?
The intercept of `r signif(coef(lm.TCS.Age)[1],3)` is the predicted number of total cigarettes smoked
for a person whose age is 0.
Not only is that a gross extrapolation,
but it doesn't make sense.
## Thursday ---------
### 4. Try plotting on log scale ($x$-only, $y$-only, both).
What happened to your data when you transformed it?
The plot with $(x, \log_{10}(y))$ looks much better.
That's because total cigarettes smoked (`TotalCigsSmoked`)
has a lower bound of 0 and
is right skewed (most values are low, some are extremely high).
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + scale_y_log10()
p <- p + labs(title = "log10(Total Cigarettes Smoked) vs Age")
print(p)
```
Since the age range is limited and not an interesting variable to transform in this case,
I omit those two plots: $(\log_{10}(x), y)$ and $(\log_{10}(x), \log_{10}(y))$.
You should try them on your data.
```{R}
library(ggplot2)
p1 <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p1 <- p1 + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p1 <- p1 + stat_smooth(method = lm)
p1 <- p1 + scale_x_log10()
p1 <- p1 + labs(title = "Total Cigarettes Smoked vs log10(Age)")
#print(p1)
library(ggplot2)
p2 <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p2 <- p2 + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p2 <- p2 + stat_smooth(method = lm)
p2 <- p2 + scale_x_log10()
p2 <- p2 + scale_y_log10()
p2 <- p2 + labs(title = "log10(Total Cigarettes Smoked) vs log10(Age)")
#print(p2)
library(gridExtra)
#grid.arrange(p1, p2, nrow=1, main = "log10(Age) isn't a good idea")
grid.arrange(grobs = list(p1, p2), nrow=1, top = "log10(Age) isn't a good idea")
```
### 5. Does log transformation help?
The plot with $(x, \log_{10}(y))$ is an improvement over $(x,y)$.
```{R}
nesarc.sub$TotalCigsSmokedLog10 <- log10(nesarc.sub$TotalCigsSmoked)
# fit the simple linear regression model
lm.TCSL10.Age <- lm(TotalCigsSmokedLog10 ~ Age, data = nesarc.sub)
# use summary() to parameters estimates (slope, intercept) and other summaries
summary(lm.TCSL10.Age)
```
The slope is `r signif(coef(lm.TCSL10.Age)[2],3)`.
Thus, for every year increase in age,
we expect an increase of `r signif(coef(lm.TCSL10.Age)[2],3)` in log 10 of total cigarettes smoked;
that is, we expect a small decrease on the log scale.
Notice that the interpretation is on the scale of the regression.
If the $x$ or $y$ variable is on a different scale (such as $\log_{10}$),
then the interpretation is on that scale, too.
# Week06: Correlation and Categorical contingency tables
## Tuesday ---------
From a previous scatterplot, calculate the correlation and interpret it.
The plot from the previous section and correlation are calculated below.
(Unfortunately, my example is not very interesting.)
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + scale_y_log10()
p <- p + labs(title = "log10(Total Cigarettes Smoked) vs Age")
print(p)
```
```{R}
# Calculate correlation
# Because of possible missing values, the "use" command excludes NAs before calculation
cor.TL.A <- cor(nesarc.sub$TotalCigsSmokedLog10, nesarc.sub$Age, use="pairwise.complete.obs")
cor.TL.A
```
### Interpretation of correlation
The correlation is `r signif(cor.TL.A, 3)`, which is negative (but effectively 0).
This suggests that there effectively no linear relationship between the $\log_{10}$(Total Cigarettes Smoked) and Age.
(See the Rmd file for how I included the correlation value in the text.)
## Thursday ---------
From a previous multiple-variable categorical table,
calculate conditional proportions and interpret them.
Research question:
Is there a relationship between depression status (`Depression`) and nicotine dependence (`TobaccoDependence`),
and does it differ by Sex?
(We'll perform the statistical test for this in a couple weeks;
right now we're calculating, plotting, and interpretting the proportions.)
I'll modify a plot from before using code from Worksheet 12.
### Depression, TobaccoDependence, and Sex
```{R}
tab.freq <- xtabs( ~ TobaccoDependence + Depression + Sex, data = nesarc.sub)
#tab.freq
library(reshape2)
tab.freq.long <- melt(tab.freq, value.name = "Frequency")
#tab.freq.long
# For each Sex, calculate the column TobaccoDependence proportions by Depression
library(plyr)
tab.col.prop <- ddply(nesarc.sub, "Sex", function(.X) {
# print results to output as it goes
print(as.character(.X$Sex[1])) #$
tab.freq <- xtabs( ~ TobaccoDependence + Depression, data = .X)
print(tab.freq)
tab.col.prop <- prop.table(tab.freq, margin = 2)
print(tab.col.prop)
library(reshape2)
tab.col.prop.long <- melt(tab.col.prop, value.name = "Proportion")
#tab.col.prop.long
return(tab.col.prop.long)
}
)
# join these two datasets to have both Freq and Prop columns
library(plyr)
tab.long <- join(tab.freq.long, tab.col.prop)
tab.long
```
### Interpretation of conditional proportions
The proportions above show that Females without depression have less nicotine dependence
(`r signif(subset(tab.long
, (Sex == "Female")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "No Depression")
, select = Proportion), 3)`)
than those with depression
(`r signif(subset(tab.long
, (Sex == "Female")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "Yes Depression")
, select = Proportion), 3)`),
and this disparity is even greater for Males
(`r signif(subset(tab.long
, (Sex == "Male")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "No Depression")
, select = Proportion), 3)`
vs
`r signif(subset(tab.long
, (Sex == "Male")
& (TobaccoDependence == "Nicotine Dependence")
& (Depression == "Yes Depression")
, select = Proportion), 3)`).
(See the Rmd file for how I included the values from the table in the text.)
```{R}
# lines are sometimes easier, especially when many categories along the x-axis
library(ggplot2)
p <- ggplot(data = tab.long, aes(x = Depression, y = Proportion, colour = TobaccoDependence))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_point(aes(shape = TobaccoDependence))
p <- p + geom_line(aes(linetype = TobaccoDependence, group = TobaccoDependence))
p <- p + theme_bw()
p <- p + labs(title = "Proportion of TobaccoDependence by Depression and Sex")
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_grid(. ~ Sex)
print(p)
```
# Week07: Inference and Parameter estimation (one-sample)
## Tuesday ---------
### Dataset description of sampling
Using your dataset's description and/or codebook, briefly describe how the data
were collected (the sampling strategy).
(Since many of you are using NESARC, also, I'm going to let you find and write this rather than copying from me.)
In the [Introduction to the National Epidemiologic Survey on Alcohol and Related Conditions](http://pubs.niaaa.nih.gov/publications/arh29-2/74-78.htm)
they describe the sampling design.
Here's a brief summary.
Adult respondants were sampled from a subset of households
included in the US Census 2000 Supplementary Survey.
To improve information about smaller groups,
Blacks and Hispanics were oversampled by about 50%,
as were young adults ages 18-24 by 125%.
Using the US Census as a sampling frame helps to provide a representative sampling of the country.
Oversampling smaller subsets of the population means that comparisons involving those
subsets will be more reliable.
```{R, echo=FALSE, eval=FALSE}
# > NESARC DESIGN
# > The Census Supplementary Survey (C2SS) formed the sampling frame for the household portion of the NESARC sample. Of the approximately 2,000 C2SS primary sampling units (PSUs), which represented all 3,142 counties and county equivalents in the United States, 655 PSUs were selected with certainty because of their size (a population of 250,000 or more in 1996). These were designated as self-representing (SR) PSUs. The remaining PSUs were stratified within each State by sociodemographic characteristics. Two PSUs were selected from each stratum with probability proportional to size, yielding 254 additional PSUs that were designated as non-self-representing (NSR). To prevent possible respondent disclosure, the resulting 401 SR and 254 NSR PSUs were subsequently collapsed into 305 SR and 130 NSR PSUs. Within sample PSUs, households were systematically selected, and one adult respondent age 18 or older was selected at random from each sample household.
# > The Census 2000 Group Quarters Inventory formed the sampling frame for the group quarters portion of the NESARC sample. Individuals were randomly selected from a systematic sample of group quarters in these PSUs.
# > NESARC oversampled Blacks and Hispanics at the design phase of the survey, increasing the representation of Black households from 12.3 percent to 19.1 percent and the representation of Hispanic households from 12.5 percent to 19.3 percent. In addition, NESARC oversampled young adults ages 18–24 at the household level at a rate of 2.25 to 1. Again, one sample adult was randomly selected for interview in each household.
# > The NESARC sample was weighted to adjust for nonresponse at the household and person levels, the selection of one person per household, and oversampling of young adults, Hispanics, and Blacks. Once weighted, the data were adjusted to be representative of the U.S. population for various sociodemographic variables, including region, age, sex, race, and ethnicity, based on the 2000 Decennial Census.
```
```{R, echo = FALSE}
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution with
# a normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 1e4) {
n <- length(dat);
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
# draw a histogram of the means
sam.mean <- colMeans(sam);
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(2,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 6
, main = "Plot of data with smoothed density curve")
points(density(dat), type = "l")
rug(dat)
hist(sam.mean, freq = FALSE, breaks = 25
, main = "Bootstrap sampling distribution of the mean"
, xlab = paste("Data: n =", n
, ", mean =", signif(mean(dat), digits = 5)
, ", se =", signif(sd(dat)/sqrt(n)), digits = 5))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(sam.mean), max(sam.mean), length = 1000)
points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
# restore par() settings
par(old.par)
}
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data
bs.two.samp.diff.dist <- function(dat1, dat2, N = 1e4) {
n1 <- length(dat1);
n2 <- length(dat2);
# resample from data
sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
# calculate the means and take difference between populations
sam1.mean <- colMeans(sam1);
sam2.mean <- colMeans(sam2);
diff.mean <- sam1.mean - sam2.mean;
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat1, freq = FALSE, breaks = 6
, main = paste("Sample 1", "\n"
, "n =", n1
, ", mean =", signif(mean(dat1), digits = 5)
, ", sd =", signif(sd(dat1), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat1), type = "l")
rug(dat1)
hist(dat2, freq = FALSE, breaks = 6
, main = paste("Sample 2", "\n"
, "n =", n2
, ", mean =", signif(mean(dat2), digits = 5)
, ", sd =", signif(sd(dat2), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat2), type = "l")
rug(dat2)
hist(diff.mean, freq = FALSE, breaks = 25
, main = paste("Bootstrap sampling distribution of the difference in means", "\n"
, "mean =", signif(mean(diff.mean), digits = 5)
, ", se =", signif(sd(diff.mean), digits = 5)))
# overlay a density curve for the sample means
points(density(diff.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(diff.mean), max(diff.mean), length = 1000)
points(x, dnorm(x, mean = mean(diff.mean), sd = sd(diff.mean))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(diff.mean)
# restore par() settings
par(old.par)
}
```
## Thursday ---------
Using either a numerical variable or a two-level categorical variable,
calculate and interpret a confidence interval for the population mean or proportion.
### Numeric variable confidence interval for mean $\mu$
```{R}
# calcuate mean and CI
t.result <- t.test(nesarc.sub$TotalCigsSmokedLog10, conf.level = 0.95) #$
t.result
```
The sample mean of the $\log_{10}($Total Cigs Smoked$)$ is `r signif(t.result$estimate, 3)`.
We are 95% confident that the population mean is between
`r signif(t.result$conf.int[1], 3)` and
`r signif(t.result$conf.int[2], 3)`.
Yes, the interval can be very narrow when the sample size is large.
(Note that this does _not_ mean that there's 95% probability that the population mean
is in that interval. Rather, for 95% of samples, the true population mean will
be in the interval. We do not know whether the true mean is in our interval or not.)
```{R, fig.width = 7, fig.height = 7}
# plot data and create table
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmokedLog10))
p <- p + geom_histogram(binwidth = 0.2, alpha = 1)
p <- p + geom_rug(alpha = 1/8)
# est
p <- p + geom_vline(aes(xintercept = as.numeric(t.result$estimate)), colour = "blue", size = 1)
p <- p + geom_rect(aes(xmin = t.result$conf.int[1], xmax = t.result$conf.int[2], ymin = -20, ymax = 0)
, fill = "blue", alpha = 1)
p <- p + labs(title = "Log 10 Total Cigs Smoked\nblue = est +- 95% CI")
print(p)
# checking model assumptions, some slight skewness observed
bs.one.samp.dist(subset(nesarc.sub, !is.na(TotalCigsSmokedLog10))$TotalCigsSmokedLog10)
```
### Categorical variable confidence interval for proportion $p$
Let's estimate the population proportion of people with Depression
for the subset of people we've sampled from.
(Because I have at least 5 "successes" and "failures",
and a large sample size ($n=`r nrow(nesarc.sub)`$)
I know the `prop.test()` and `binom.test()` will give the same result.
Try it for your data. They will only differ with quite small samples,
especially if the proportions are close to 0 or 1.)
```{R}
tab.dep <- xtabs( ~ Depression, data = nesarc.sub)
library(reshape2)
tab.dep.long <- melt(tab.dep, value.name = "Frequency")
tab.dep.long <- data.frame(tab.dep.long, Proportion = as.numeric(prop.table(tab.dep)))
tab.dep.long
# prop.test() is an asymptotic (approximate) test for a binomial random variable
p.summary <- prop.test(x = as.numeric(subset(tab.dep.long, Depression == "Yes Depression", select = Frequency))
, n = sum(tab.dep.long$Frequency)
, conf.level = 0.95)
p.summary
```
Note how to add error bars using `geom_errorbar()`.
First determine the CI bounds from the `prop.test()` previously,
then set those as limits.
```{R, fig.height = 2}
# get names of objects in b.summary
names(p.summary)
# here's the confidence interval bounds (the attribute tells us this is a 95% interval)
p.summary$conf.int
p.summary$conf.int[1]
p.summary$conf.int[2]
library(ggplot2)
p <- ggplot(data = subset(tab.dep.long, Depression == "Yes Depression"), aes(x = Depression, y = Proportion))
p <- p + geom_hline(yintercept = c(0, 1), alpha = 1/4)
p <- p + geom_bar(stat = "identity")
p <- p + geom_errorbar(aes(min = p.summary$conf.int[1], max = p.summary$conf.int[2]), width=0.25)
p <- p + scale_y_continuous(limits = c(0, 1))
# swap axes to take less vertical space
p <- p + coord_flip()
print(p)
```
The sample proportion of the true population proportion of people depressed
is `r signif(p.summary$estimate, 3)`.
We are 95% confident that the population proportion is between
`r signif(p.summary$conf.int[1], 3)` and
`r signif(p.summary$conf.int[2], 3)`.
Yes, the interval can be very narrow when the sample size is large.
(Note that this does _not_ mean that there's 95% probability that the population proportion
is in that interval. Rather, for 95% of samples, the true population proportion will
be in the interval. We do not know whether the true mean is in our interval or not.)
# Week08: Hypothesis testing (one- and two-sample)
## Tuesday ---------
### Two-sample $t$-test: `TotalCigsSmoked` by `Depression`
1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels),
specify a two-sample $t$-test associated with your research questions.
* (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
* (0 p) use `t.test()` to calculate the mean, test statistic, and p-value,
* (3 p) state the significance level, test statistic, and p-value, and
* (2 p) state the conclusion in the context of the problem.
* (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
* (2 p) Provide an appropriate plot of the data and sample estimates in a well-labelled plot.
```{R}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
library(plyr)
est.mean.TCS.D <- ddply(nesarc.sub
, "Depression"
, summarise
, TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)
)
est.mean.TCS.D
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + labs(title = "Total Cigarettes Smoked by Depression")
print(p)
# histogram using ggplot
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 40)
p <- p + geom_rug()
p <- p + geom_vline(data = est.mean.TCS.D, aes(xintercept = TotalCigsSmoked, colour = "red"))
p <- p + facet_grid(Depression ~ .)
print(p)
```
```{R}
t.summary.TCS.D <- t.test(TotalCigsSmoked ~ Depression, nesarc.sub)
t.summary.TCS.D
```
1. "Is the population mean total cigarettes smoked different for those with depression or not?"
* $H_0: \mu_{D} = \mu_{ND}$ versus $H_A: \mu_{D} \ne \mu_{ND}$
2. Let $\alpha=0.05$, the significance level of the test and the Type-I error probability if the null hypothesis is true.
3. $t_{s} = `r signif(t.summary.TCS.D$statistic, 4)`$.
4. $p=`r signif(t.summary.TCS.D$p.value, 3)`$, this is the observed significance of the test.
5. Because $p=`r signif(t.summary.TCS.D$p.value, 3)` > 0.05$,
we have insufficient evidence to reject $H_0$, concluding that the
total cigarettes smoked does not differ by depression status.
Because we failed to reject $H_0$, I could have committed a Type-II error
(that there is a difference but we didn't detect it).
## Thursday ---------
Enjoy your Fall Break!
# Week09: ANOVA and Assessing Assumptions
## Tuesday ---------
(none)
## Thursday ---------
## ANOVA: Total cigarettes smoked by Ethnicity
### Transform the response variable to satisfy assumptions
I've decided to try to find a better transformation of the total cigarettes smoked,
`TotalCigsSmoked`,
to improves the symmetry (and normality) of the distributions
(the original scale was right skewed, and the $\log_{10}$ was left skewed).
The __Box-Cox transformation__ is a power transformation where we find a best power $\lambda$
for $y^{\lambda}$ such that the residuals from the model are most normal.
We're fitting ANOVA, so we first fit the model,
then have the `boxcox()` function estimate the power that gives us residuals that
are most "normal".
Finally, we'll create a new variable $y^{\lambda}$ for a sensible value of $\lambda$.
```{R}
# Box-Cox transformation finds the best power transformation of the response
# to make the residuals most "normal".
# A transformation close to this will help us meet our normality assumption.
# It is often preferred to choose a transformation that is a "simple" exponent,
# such as 0.5, rather than the exact estimate of the best, such as 0.4242424.
library(car)
aov.summary <- aov(TotalCigsSmoked ~ Ethnicity, data = nesarc.sub)
bc.summary <- boxCox(aov.summary
, lambda = seq(-2, 2, 0.1)
, plotit = TRUE)
opt.lambda <- bc.summary$x[which.max(bc.summary$y)]
opt.lambda
```
The best power transformation value is $\lambda = `r signif(opt.lambda, 3)`$,
so I will choose 0.5 (which is a square root).
When I reassess the power after taking the square root,
the power $\lambda = 1$ is close to the 95% interval,
so I will use this as my transformation to ease interpretation
and because the plots we see below of the data on the square-root scale
are close to normal.
```{R}
## Choose a square root transformation.
nesarc.sub$TotalCigsSmokedSqrt <- sqrt(nesarc.sub$TotalCigsSmoked)
library(car)
aov.summary <- aov(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc.sub)
bc.summary <- boxCox(aov.summary
, lambda = seq(-2, 2, 0.1)
, plotit = TRUE)
opt.lambda <- bc.summary$x[which.max(bc.summary$y)]
opt.lambda
```
### ANOVA Hypothesis test
A table of the ethnic groups to compare is below.
```{R}
summary(nesarc.sub$Ethnicity)
```
Let $\mu_j$ = pop mean number of the square-root total cigarettes smoked for the five ethnicities identified in the dataset:
Caucasian,
African American,
Native American,
Asian, and
Hispanic,
numbered $(j = 1, 2, 3, 4, 5)$, respectively.
We wish to test $H_0: \mu_1=\cdots=\mu_5$ against $H_A: \textrm{not } H_0$ (at least one pair of means differ.
Plot the data in a way that compares the means.
Error bars are 95% confidence intervals of the mean.
```{R}
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Ethnicity, y = TotalCigsSmokedSqrt))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(nesarc.sub$TotalCigsSmokedSqrt),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_violin(width = 0.5, alpha = 0.25)
p <- p + geom_boxplot(width = 0.25, alpha = 0.25)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 4,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Total cigarettes smoked by Ethnicity")
p <- p + ylab("Square-root of Total Cigs Smoked")
print(p)
```
__Hypothesis test__
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The population mean square-root total cigarettes smoked is different between ethnicities."
* In notation: $H_0: \mu_1=\cdots=\mu_5$ versus $H_A: \textrm{not } H_0$ (at least one pair of means differ).
2. Let the significance level of the test be $\alpha=0.05$.
3. Compute the __test statistic__.
```{R}
aov.summary <- aov(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc.sub)
summary(aov.summary)
```
The $F$-statistic for the ANOVA is $F = `r signif(unlist(summary(aov.summary))["F value1"], 3)`$.
4. Compute the __$p$-value__ from the test statistic.
The p-value for testing the null hypothesis is
$p = `r signif(unlist(summary(aov.summary))["Pr(>F)1"], 3)`$.
5. State the __conclusion__ in terms of the problem.
Because $p = `r signif(unlist(summary(aov.summary))["Pr(>F)1"], 3)` < 0.05$,
we reject $H_0$ in favor of $H_1$ concluding that
the mean square-root total cigarettes smoked is different between ethnicities.
### Check assumptions
6. __Check assumptions__ of the test.
a. Residuals are normal
b. Populations have equal variances.
* Check whether residuals are normal.
- Plot the residuals and assess whether they appear normal.
```{R}
# Plot the data using ggplot
df.res <- data.frame(res = aov.summary$residuals)
library(ggplot2)
p <- ggplot(df.res, aes(x = res))
p <- p + geom_histogram(aes(y = ..density..), binwidth = 4)
p <- p + geom_density(colour = "blue")
p <- p + geom_rug()
p <- p + stat_function(fun = dnorm, colour = "red", args = list(mean = mean(df.res$res), sd = sd(df.res$res)))
p <- p + labs(title = "ANOVA Residuals")
print(p)
```
There are a few extreme observations, and a large peak of points in the center.
They are not symmetric, but they are unimodal.
- Plot the residuals versus the normal quantiles.
If the residuals are normal, then the will fall on the center line and
very few will be outside the error bands.
```{R}
# QQ plot
par(mfrow=c(1,1))
library(car)
qqPlot(aov.summary$residuals, las = 1, id.n = 0, id.cex = 1, lwd = 1, main="QQ Plot")
```
There are deviations from normality, and I think this largely has to do with rounding the number of cigarettes smoked.
Those lowest values are a group of very small numbers (1 cig - 10 cig in a month),
and because those are bounded at 0,
we're getting that the residuals are larger than expected
under the assumption that they are normal.
The points in the middle under the red line are a group of points of regular smokers who smoke a moderate amount.
- A formal test of normality on the residuals tests the hypothesis
$H_0:$ The distribution is Normal vs
$H_1:$ The distribution is not Normal.
We can test the distribution of the residuals.
Three formal tests for normality are reported below.
```{R}
shapiro.test(aov.summary$residuals)
library(nortest)
ad.test(aov.summary$residuals)
cvm.test(aov.summary$residuals)
```
All three tests reject normality.
In summary, both the plots and tests for normality suggest these residuals are not normal.
The original plots do indicate that for each ethnic group
the square-root transformation has made the data roughly symmetric, but not normal.
We have a few outliers we may want to worry about (remove and see if the results change).
Ultimately, we will probably want to use a nonparametric method to test the ANOVA hypothesis,
a method that doesn't require the normality assumption.
* Check whether populations have equal variances.
- Look at the numerical summaries below.
```{R}
# calculate summaries
library(plyr)
nesarc.sub.summary <- ddply(nesarc.sub, "Ethnicity",
function(X) { data.frame( m = mean(X$TotalCigsSmokedSqrt),
s = sd(X$TotalCigsSmokedSqrt),
n = length(X$TotalCigsSmokedSqrt) ) } )
nesarc.sub.summary
```
The standard deviations are between `r signif(min(nesarc.sub.summary$s), 3)`
and `r signif(max(nesarc.sub.summary$s), 3)`,
which is not a wide range.
- Formal tests for equal variances.
We can test whether the variances are equal between our three groups.
This is similar to the ANOVA hypothesis, but instead of testing means we're tesing variances.
$H_0: \sigma^2_1=\sigma^2_2=\sigma^2_3$
versus $H_A: \textrm{not } H_0$ (at least one pair of variances differ).
```{R}
## Test equal variance
# assumes populations are normal
bartlett.test(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc.sub)
# does not assume normality, requires car package
library(car)
leveneTest(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc.sub)
# nonparametric test
fligner.test(TotalCigsSmokedSqrt ~ Ethnicity, data = nesarc.sub)
```
Interpret the result of the appropriate test.
If normality was reasonable then use Bartlett, otherwise use Levene.
Both the Levene and Fligner tests reject equal variance.
However, we have a large sample size, so we are "over powered"
to identify small deviations from equality of variance.
While this difference probably wouldn't violate our assumption so badly as to
invalidate the test, we will still want to use the nonparametric test next week, instead.
### Post Hoc pairwise comparison tests
7. If the ANOVA null hypothesis was rejected, then perform follow-up Post Hoc
pairwise comparison tests to determine which pairs of means are different.
(We'll continue assuming our assumptions are met,
for the sake of completing the exercise.
We understand that we'll want to use a nonparametric test, instead.)
There are several multiple comparison methods described in the notes.
Let's use Tukey's Honest Significant Difference (HSD) here to test which pairs of
populations differ.
```{R}
# sort the Ethnicity by means
library(plyr)
arrange(nesarc.sub.summary, m)
# Tukey 95% Individual p-values
TukeyHSD(aov.summary)
```
Summarize results by ordering the means and grouping pairs that do not differ (see notes for examples).
I find it easiest to look for the large p-values (which pairs are not significantly different)
and group those first, then look at the significant pairs and make sure none are connected.
Here's two ways to do this
(vertical is useful when there are many groups,
horizontal is easier to read).
```
(numerical sumamries are a nice addition)
NotSigDiff Mean Ethnicity m s n
| 12.5 Hispanic 12.5 7.8 333
|| 14.4 African American 14.4 7.1 210
|| 14.4 Asian 14.4 6.0 58
|| 15.7 Native American 15.7 7.9 42
| 17.3 Caucasian 17.3 8.3 1054
Ethnicity: Hispanic African American Asian Native American Caucasian
Mean: 12.5 14.4 14.4 15.7 17.3
-------------------------------------
--------------------------------------------
----------------------------
```
# Week10: Nonparametric methods and Binomial and multinomial proportion tests
## Tuesday ---------
## Thursday ---------
### Multinomial goodness-of-fit: `Ethnicity` at US census proportions
__Question:__
Did NESARC sample at random from the US population so that ethnicities are
proportionally represented?
To answer this question, I need to go back to the original NESARC dataset.
If I use my subset, I may have removed observations that will bias my answer to this question.
#### Observed
NB. All the results in this section depend on `temp.eth` and `US.eth`
the frequency tables of the full NESARC dataset and the 2010 US Census.
```{R}
# create a temporary Ethnicity factor variable from the orginal dataset
temp.eth <- factor(NESARC$ETHRACE2A
, labels = c("Caucasian"
, "African American"
, "Native American"
, "Asian"
, "Hispanic"
)
)
# observed proportions
obs.eth <- table(temp.eth)
obs.eth
obs.prop <- prop.table(obs.eth)
round(obs.prop, 3)
```
#### Expected
The US [Ethnicity](https://en.wikipedia.org/wiki/Demographics_of_the_United_States#Race_and_ethnicity) proportions
are listed below with numbered categories consistent with NESARC.
```
ETHRACE2A (Ethnicity)
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
https://en.wikipedia.org/wiki/Demographics_of_the_United_States#Race_and_ethnicity
The U.S. population's distribution by race and ethnicity in 2010 was as follows:
1. Non-Hispanic White 196,817,552 63.7 %
2. Non-Hispanic Black or African American 37,685,848 12.2 %
3. Non-Hispanic American Indian or Alaska Native 2,247,098 0.7 %
4. Non-Hispanic Asian 14,465,124 4.7 %
4. Non-Hispanic Native Hawaiian or other Pacific Islander 481,576 0.2 %
4. Non-Hispanic some other race 604,265 0.2 %
4. Non-Hispanic two or more races 5,966,481 1.9 %
5. Hispanic or Latino 50,477,594 16.4 %
```
Expected numbers
```{R}
US.eth <- c(196817552
, 37685848
, 2247098
, 14465124 + 481576 + 604265 + 5966481
, 50477594
)
exp.prop <- US.eth / sum(US.eth)
round(exp.prop, 3)
# Expected number of eth
exp.eth <- sum(obs.eth) * exp.prop
round(exp.eth, 1)
```
Combining observed and expected into a table.
```{R}
eth <- data.frame(Ethnicity = factor(names(obs.eth))
, obs = as.numeric(obs.eth)
, exp = exp.eth
, obs.prop = obs.prop
, exp.prop = exp.prop
)
str(eth)
eth
```
### Perform $\chi^2$ Goodness-of-fit test
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "The NESARC and US Census proportions are equal for people of each Ethnicity."
* In notation:
$H_0:
p_{1} = `r round(eth$exp.prop[1], 3)`,
p_{2} = `r round(eth$exp.prop[2], 3)`,
p_{3} = `r round(eth$exp.prop[3], 3)`,
p_{4} = `r round(eth$exp.prop[4], 3)`, \textrm{and}
p_{5} = `r round(eth$exp.prop[5], 3)`$
versus $H_A: \textrm{not } H_0$
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
```{R}
# calculate chi-square goodness-of-fit test
x.summary <- chisq.test(eth$obs, correct = FALSE, p = eth$exp.prop)
# print result of test
x.summary
```
3. Compute the __test statistic__.
The test statistic is $X^2 = `r signif(x.summary$statistic, 5)`$.
4. Compute the __$p$-value__.
The p-value $= `r signif(x.summary$p.value, 3)`$.
5. (1 p) State the __conclusion__ in terms of the problem.
* We reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
Because the p-value $= `r signif(x.summary$p.value, 3)` < 0.05 = \alpha$,
we reject $H_0$ in favor of $H_A$
concluding that the proportions of ethnicities sampled in NESARC differs
from the US Census.
6. (1 p) __Check assumptions__ of the test (see notes).
Because there is an expected frequency of at least 5 for each category,
the assumptions are met for this test.
#### Chi-sq statistic helps us understand the deviations
```{R}
# use output in x.summary and create table
x.table <- data.frame(Ethnicity = eth$Ethnicity
, obs = x.summary$observed
, exp = x.summary$expected
, res = x.summary$residuals
, chisq = x.summary$residuals^2
, stdres = x.summary$stdres)
x.table
```
Plot observed vs expected values to help identify `Ethnicity` groups that deviate the most.
Plot contribution to chi-square values to help identify `Ethnicity` groups that deviate the most.
The term "Contribution to Chi-Square" (`chisq`) refers to the values of
$\frac{(O-E)^2}{E}$ for each category.
$\chi^2_{\textrm{s}}$ is the sum of those contributions.
These plots suggest that African Americans and Native Americans were oversampled
and Asians were undersampled.
```{R, fig.width = 6, fig.height = 4}
library(reshape2)
x.table.obsexp <- melt(x.table,
# id.vars: ID variables
# all variables to keep but not split apart on
id.vars = c("Ethnicity"),
# measure.vars: The source columns
# (if unspecified then all other variables are measure.vars)
measure.vars = c("obs", "exp"),
# variable.name: Name of the destination column identifying each
# original column that the measurement came from
variable.name = "stat",
# value.name: column name for values in table
value.name = "value"
)
#x.table.obsexp
# Observed vs Expected counts
library(ggplot2)
p1 <- ggplot(x.table.obsexp, aes(x = Ethnicity, fill = stat, weight=value))
p1 <- p1 + geom_bar(position="dodge")
p1 <- p1 + labs(title = "Observed and Expected frequencies")
p1 <- p1 + xlab("Ethnicity category")
print(p1)
# Contribution to chi-sq
# pull out only the Ethnicity and chisq columns
x.table.chisq <- x.table[, c("Ethnicity", "chisq")]
# reorder the Ethnicity categories to be descending relative to the chisq statistic
x.table.chisq$Ethnicity <- with(x.table, reorder(Ethnicity, -chisq))
p2 <- ggplot(x.table.chisq, aes(x = Ethnicity, weight = chisq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Contribution to Chi-sq statistic")
p2 <- p2 + xlab("Sorted Ethnicity")
p2 <- p2 + ylab("Chi-sq")
print(p2)
#library(gridExtra)
#grid.arrange(p1, p2, nrow=1)
```
### Multiple Comparisons
The goodness-of-fit test suggests that at least one of the `Ethnicity` category
proportions for the observed number og eth differs from the expected number.
A reasonable next step in the analysis would be to __separately__ test
the five hypotheses to see which `Ethnicity` categories led to rejecting
the goodness-of-fit null hypothesis:
* $H_0: p_{1} = `r round(eth$exp.prop[1], 3)`$
* $H_0: p_{2} = `r round(eth$exp.prop[2], 3)`$
* $H_0: p_{3} = `r round(eth$exp.prop[3], 3)`$
* $H_0: p_{4} = `r round(eth$exp.prop[4], 3)`$
* $H_0: p_{5} = `r round(eth$exp.prop[5], 3)`$
A Bonferroni comparison with a Family Error Rate $\alpha=0.05$
sets each test's significance level to $\alpha = 0.05 / 5 = `r signif(0.05/5, 3)`$,
which corresponds to `r signif(100 * (1 - 0.05/5), 4)`% two-sided CIs
for the individual proportions.
Below I perform exact binomial tests of proportion for each of the five `Ethnicity` categories
at the Bonferroni-adjusted significance level.
I save the p-values and confidence intervals in a table along with the
observed and census proportions in order to display the table below.
```{R}
b.sum1 <- binom.test(eth$obs[1], sum(eth$obs), p = eth$exp.prop[1], alternative = "two.sided", conf.level = 1-0.05/5)
b.sum2 <- binom.test(eth$obs[2], sum(eth$obs), p = eth$exp.prop[2], alternative = "two.sided", conf.level = 1-0.05/5)
b.sum3 <- binom.test(eth$obs[3], sum(eth$obs), p = eth$exp.prop[3], alternative = "two.sided", conf.level = 1-0.05/5)
b.sum4 <- binom.test(eth$obs[4], sum(eth$obs), p = eth$exp.prop[4], alternative = "two.sided", conf.level = 1-0.05/5)
b.sum5 <- binom.test(eth$obs[5], sum(eth$obs), p = eth$exp.prop[5], alternative = "two.sided", conf.level = 1-0.05/5)
# put the p-value and CI into a data.frame
b.sum <- data.frame(
rbind( c(b.sum1$p.value, b.sum1$conf.int)
, c(b.sum2$p.value, b.sum2$conf.int)
, c(b.sum3$p.value, b.sum3$conf.int)
, c(b.sum4$p.value, b.sum4$conf.int)
, c(b.sum5$p.value, b.sum5$conf.int)
)
)
names(b.sum) <- c("p.value", "CI.lower", "CI.upper")
b.sum$Ethnicity <- eth$Ethnicity
b.sum$Observed <- x.table$obs/sum(x.table$obs)
b.sum$exp.prop <- eth$exp.prop
b.sum
```
All five hypothesis tests have p-values less than $\alpha = `r signif(0.05/5, 3)`$,
thus they all reject $H_0$,
concluding that each Ethnicity was not sampled
in perfect proportion with the 2010 US Census.
In a survey like this, typically it is desirable to oversample smaller groups,
such as Asian and Native American,
and this was not done for Asian.
```{R}
# Observed vs Expected proportions
library(ggplot2)
p <- ggplot(b.sum, aes(x = Ethnicity, y = Observed))
# observed
p <- p + geom_point(size = 4)
p <- p + geom_line(aes(group = 1))
p <- p + geom_errorbar(aes(ymin = CI.lower, ymax = CI.upper), width = 0.1)
# expected
p <- p + geom_point(aes(y = exp.prop), colour = "red", shape = "x", size = 6)
p <- p + geom_line(aes(y = exp.prop, group = 1), colour = "red")
p <- p + labs(title = "NESARC (w/99% CI) and 2010 US Census (red x) proportions")
p <- p + expand_limits(y = 0)
print(p)
```
# Week11: Two-way categorical tables and Simple linear regression, inference
## Tuesday ---------
## Two-way categorical analysis.
Using two categorical variables with two to five levels each,
specify a hypothesis test for homogeneity of proportions associated with your research questions.
### (1 p) Specify the hypotheses in words and notation.
* In words: "There is an association between TobaccoDependence and Ethnicity."
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
### (1 p) State the conclusion of the test in the context of the problem.
```{R}
# Tabulate by two categorical variables:
tab.E.TD <- xtabs(~ Ethnicity + TobaccoDependence, data = nesarc.sub)
tab.E.TD
# column proportions
prop.table(tab.E.TD, margin = 2)
```
```{R}
chisq.E.TD <- chisq.test(tab.E.TD, correct=FALSE)
chisq.E.TD
```
The test statistic is $X^2 = `r signif(chisq.E.TD$statistic, 3)`$.
Because the p-value $= `r signif(chisq.E.TD$p.value, 3)` < 0.05$
we reject the null hypothesis
concluding that there is an association between
TobaccoDependence and Ethnicity.
```{R}
# table of expected frequencies:
chisq.E.TD$expected
# smallest expected frequency:
min(chisq.E.TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
### (1 p) Plot a mosaic plot of the data and Pearson residuals.
```{R}
# The Pearson residuals
chisq.E.TD$residuals
# The sum of the squared residuals is the chi-squared statistic:
#chisq.E.TD$residuals^2
#sum(chisq.E.TD$residuals^2)
```
```{R, fig.width = 6, fig.height = 6}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
mosaic(tab.E.TD, shade=TRUE, legend=TRUE)
#mosaic(~ Ethnicity + TobaccoDependence, data = nesarc.sub, shade=TRUE, legend=TRUE, direction = "v")
```
### (1 p) Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting $H_0$ is that
Hispanics have less nicotine dependence than expected.
## Thursday ---------
## Simple linear regression.
Select two numerical variables.
I don't have great variables in my dataset related to my research question for this.
I select `TotalCigsSmokedLog10` and `DaysSmoke`.
### (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.
See below.
### (0 p) Fit the simple linear regression model.
```{R}
# fit model
lm.fit <- lm(TotalCigsSmokedSqrt ~ DaysSmoke, data = nesarc.sub)
```
Note that prediction intervals should be bounded below at 0.
```{R}
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = DaysSmoke, y = TotalCigsSmokedSqrt))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm.fit, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm.fit, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence and prediction bands")
print(p)
```
### (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and $x$-value).
```{R, fig.width = 8, fig.height = 6}
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.fit, which = c(1,4,6))
# residuals vs DaysSmoke
# can only plot observations that were did not include NA for the lm.fit
plot(subset(nesarc.sub, !is.na(TotalCigsSmokedSqrt) & !is.na(DaysSmoke))$DaysSmoke, lm.fit$residuals, main="Residuals vs DaysSmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.fit$residuals, las = 1, id.n = 3, main="QQ Plot")
# # residuals vs order of data
# plot(lm.fit$residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
hist(lm.fit$residuals, breaks=10, main="Residuals")
```
There is some nonrandom pattern.
It almost appears that there'd be a shallower slope if it wasn't for the 30 days/month smokers.
### (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).
Normality is not perfect, with some slight right skewness,
but probably about as good as it gets for this dataset.
The bigger problem is the mean pattern (mentioned above).
### (1 p) Assess the relative influence of points.
There are a few points that have large Cook's D,
which are those few people who smoke 30 days/month and smoke the most
and a person who smokes few days but smokes many times more.
It may be worth seeing the effect of removing those four observations.
__Seeing affect of removing 4 influential observations...__
```{R}
tail(sort(cooks.distance(lm.fit)))
```
As seen below, removing the 4 largest Cook's D observations
(which is a rash move -- usually we only remove one at a time
since which points are influential can change based on removing
a single observation)
does not greatly improve the model fit
or normality.
```{R}
# data with deleted points with large Cook's D
nesarc.sub.D <- subset(nesarc.sub, !(rownames(nesarc.sub) %in% names(tail(sort(cooks.distance(lm.fit)), 4))))
dim(nesarc.sub)
dim(nesarc.sub.D)
lm.fit.D <- lm(TotalCigsSmokedSqrt ~ DaysSmoke, data = nesarc.sub.D)
library(ggplot2)
p <- ggplot(nesarc.sub.D, aes(x = DaysSmoke, y = TotalCigsSmokedSqrt))
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
p <- p + geom_hline(yintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm.fit.D, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm.fit.D, data.frame(DaysSmoke)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.75)
p <- p + labs(title = "Regression with confidence and prediction bands")
print(p)
```
```{R, fig.width = 8, fig.height = 6}
# plot diagnistics
par(mfrow=c(2,3))
plot(lm.fit.D, which = c(1,4,6))
# residuals vs DaysSmoke
# can only plot observations that were did not include NA for the lm.fit
plot(subset(nesarc.sub.D, !is.na(TotalCigsSmokedSqrt) & !is.na(DaysSmoke))$DaysSmoke, lm.fit.D$residuals, main="Residuals vs DaysSmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.fit.D$residuals, las = 1, id.n = 3, main="QQ Plot")
# # residuals vs order of data
# plot(lm.fit.D$residuals, main="Residuals vs Order of data")
# # horizontal line at zero
# abline(h = 0, col = "gray75")
hist(lm.fit.D$residuals, breaks=10, main="Residuals")
```
__Continuing with the original data set ...__
### (1 p) Test whether the slope is different from zero, $H_A: \beta_1 \ne 0$.
```{R}
summary(lm.fit)
```
First note that we have insufficient evidence to conclude that
$\beta_0$ is different from 0, which makes sense ---
if you smoke 0 days,
then 0 should be the number of cigarettes you smoke.
We have sufficient evidence to conclude the slope is different from zero.
Furthermore, for every additional day increase per month a person smokes,
there is a predicted 0.6 square-root of cigarettes smoked.
When this is squared to be on the original scale, the increase is scary.
### (1 p) Interpret the $R^2$ value.
The regression model explains
$R^2 = `r signif(summary(lm.fit)$r.squared, 3)`$
proportion of the variation
in the response
above using the mean value alone.
# Week12: Logistic regression and Experiments vs Observational studies
## Tuesday ---------
## Logistic Regression
If I provide the solutions here, then there's nothing for you to do for the in-class assignment ...
## Thursday ---------
## Experiments and observational studies
* (2 p) Write a few sentences describing these characteristics of the data you're using for your project.
* What is the sampling design?
* What is the target population and sampling frame?
* How was nonresponse handled? Answer this both in terms of individual
nonresponse (when a respondant is lost to follow up, if this applies
for you (e.g., AddHealth)) and question nonresponse (when a question is
skipped).
You started doing this earlier in Week07 Tuesday, when you described the sampling.
# Week13: Complete poster in HW document
## Tuesday ---------
## Thursday ---------
### Title
Smoking behavior is (barely) associated with major depression
### 1. __(1 p)__ Introduction
* __Smoking behavior is associated with major depression.__
- Longitudinal investigations have shown that depression increases risk
of later smoking [@breslau_major_1998; @dierker_association_2001].
- Many subsequent studies find a role of major depression in increasing
the probability and amount of smoking [@dierker_defining_2004;
@rohde_psychiatric_2004; @rohde_psychiatric_2003].
* __Smoking behavior and Nicotine dependence.__
- While smoking is a necessary requirement for nicotine dependence, smoking
behavior is a poor predictor of developing nicotine dependence
[@kandel_extent_2000; @stanton_antecedents_1995].
- Heavy smokers may not have nicotine dependence, conversely
light smokers may have nicotine dependence [@kandel_extent_2000].
### 2. __(1 p)__ Research Questions
* (Week02 Research Questions) 2 bullets, one for each research question, stated as the alternative hypothesis.
The goals of the analysis include answering these two questions:
* Is there an association between major depression and nicotine dependence?
* Does the prevalence of nicotine dependence differ by ethnicity?
### 3. __(1 p)__ Methods
* (Week01 Personal Codebook) Data source(s).
A sample from the first wave of the National Epidemiologic Survey on
Alcohol and Related Conditions (NESARC) (US adults).
We focus on young adult (18--25) smokers
(`r nrow(nesarc.sub)` of `r nrow(NESARC)` respondants).
* (Week01 Personal Codebook) Variables used.
* Lifetime major depression (`Depression`, 0=No, 1=Yes) [@grant_alcohol_2003; @grant_alcohol_1995].
* Total cigarettes per month (`TotalCigsSmoked` = days smoked times daily cigs smoked, range: 0--`r max(nesarc.sub$TotalCigsSmoked)`),
a combination of these questions:
"About how often did you usually smoke in the past year?"
and "On the days that you smoked in the last year,
about how many cigarettes did you usually smoke?"
* Nicotine dependence (`TobaccoDependence`, 0=No, 1=Yes) in the last 12 months.
* (Various) Statistical methods used to answer the research questions.
* A two-sample $t$-test comparing the mean total cigarettes smoked by depression status.
* A $\chi^2$ analysis of a two-way contingency table of nicotine dependence by ethnicity.
### 4. __(1 p)__ Discussion (while this would follow your results, let's put it here so you have a full column to show the results of the analysis of both research questions)
* Put the results you found for each research question in the context you provided in your introduction.
These results support the previous literature that individuals with major
depression are more sensitive to the development of nicotine dependence
regardless of how much they smoke. Thus, people with depression are an
important population subgroup for targeted smoking intervention programs.
### 5. __(1 p)__ Further directions or Future work or Next steps or something else that indicates there more to do and you've thought about it.
* What do these results lead you to want to investigate?
We can extend focus from young adults to all adults, and (with multiple
regression techniques we'll learn in ADA2) assess the particular depressed
subpopulations most susceptible to developing nicotine dependence.
### 6. __(1 p)__ References
* By citing sources in your introduction, this section will automatically have your bibliography.
* [Bibliography will go here]
### 7. __(2 p)__ Results for your first research question.
* "Is the population mean square-root total cigarettes smoked different for those with depression or not?"
* $H_0: \mu_{D} = \mu_{ND}$ versus $H_A: \mu_{D} \ne \mu_{ND}$
```{R, echo=FALSE}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
library(plyr)
est.mean.TCS.D <- ddply(nesarc.sub
, "Depression"
, summarise
, TotalCigsSmoked = mean(TotalCigsSmoked, na.rm = TRUE)
, TotalCigsSmokedSqrt = mean(TotalCigsSmokedSqrt, na.rm = TRUE)
)
# est.mean.TCS.D
# library(ggplot2)
# p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmokedSqrt))
# p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
# p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# # diamond at mean for each group
# p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
# colour = "red", alpha = 0.8)
# p <- p + labs(title = "Square-root of Total Cigarettes Smoked by Depression Status")
# print(p)
# histogram using ggplot
p <- ggplot(nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 2)
p <- p + geom_rug(alpha = 0.1)
p <- p + geom_vline(data = est.mean.TCS.D, aes(xintercept = TotalCigsSmokedSqrt^2, colour = "red"))
p <- p + facet_grid(Depression ~ .)
p <- p + scale_x_sqrt(breaks = c(0, 5, 25, 50, 100, 250, 500, 1000, 2000, 3000))
print(p)
```
```{R, echo=FALSE, results="asis"}
t.summary.TCS.D <- t.test(TotalCigsSmokedSqrt ~ Depression, nesarc.sub)
t.summary.TCS.D
#names(t.summary.TCS.D)
# checking model assumptions, looks perfect
# note that the !is.na() first finds where there's NAs, and the "!" keeps all the non-NAs
bs.two.samp.diff.dist(subset(nesarc.sub, ((Depression == "No Depression") & !is.na(TotalCigsSmokedSqrt)))$TotalCigsSmokedSqrt,
subset(nesarc.sub, ((Depression == "Yes Depression") & !is.na(TotalCigsSmokedSqrt)))$TotalCigsSmokedSqrt)
# original scale
bs.two.samp.diff.dist(subset(nesarc.sub, ((Depression == "No Depression") & !is.na(TotalCigsSmoked)))$TotalCigsSmoked,
subset(nesarc.sub, ((Depression == "Yes Depression") & !is.na(TotalCigsSmoked)))$TotalCigsSmoked)
t.summary.TCS.DXX <- t.test(TotalCigsSmoked ~ Depression, nesarc.sub)
t.summary.TCS.DXX
```
* Model assumptions are met, the the sampling distribution of the difference in means is normal.
* Because $p=`r signif(t.summary.TCS.D$p.value, 3)` > 0.05$ (with $t_{s} = `r signif(t.summary.TCS.D$statistic, 4)`$),
we have insufficient evidence to reject $H_0$ at an $\alpha=0.05$ significance level,
concluding that the total cigarettes smoked does not differ by depression
status.
* The difference is very small, on the square-root scale it is `r signif(diff(t.summary.TCS.D$estimate), 3)`
and on the natural scale it is `r t.summary.TCS.D$estimate^2`, a difference of
`r round(diff(t.summary.TCS.D$estimate^2))` cigarettes each year.
### 8. __(2 p)__ Results for your second research question.
* Is there an association between TobaccoDependence and Ethnicity?
```{R, echo=FALSE}
# Tabulate by two categorical variables:
tab.E.TD <- xtabs(~ Ethnicity + TobaccoDependence, data = nesarc.sub)
#tab.E.TD
# column proportions
#prop.table(tab.E.TD, margin = 2)
```
```{R, echo=FALSE}
chisq.E.TD <- chisq.test(tab.E.TD, correct=FALSE)
#chisq.E.TD
```
```{R, echo=FALSE}
# table of expected frequencies:
#chisq.E.TD$expected
# smallest expected frequency:
#min(chisq.E.TD$expected)
```
The model assumptions are met since the expected count for each cell is at least 5.
Because the p-value $= `r signif(chisq.E.TD$p.value, 3)` < 0.05$ ($X^2 = `r signif(chisq.E.TD$statistic, 3)`$)
we reject the null hypothesis
concluding that there is an association between
TobaccoDependence and Ethnicity.
```{R, echo=FALSE}
# The Pearson residuals
#chisq.E.TD$residuals
# The sum of the squared residuals is the chi-squared statistic:
#chisq.E.TD$residuals^2
#sum(chisq.E.TD$residuals^2)
```
```{R, echo=FALSE, fig.width = 6, fig.height = 6}
# mosaic plot
library(vcd)
# this layout gives us the interpretation we want:
#mosaic(tab.E.TD, shade=TRUE, legend=TRUE)
mosaic(tab.E.TD, shade=TRUE, legend=TRUE, rot_labels = c(0, 90, 0, 0))
#, rot_varnames = c(0, 90, 0, 0)
#, labeling = labeling_border(just_labels = "right"))
#mosaic(~ Ethnicity + TobaccoDependence, data = nesarc.sub, shade=TRUE, legend=TRUE, direction = "v")
```
### (1 p) Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting $H_0$ is that
Hispanics have less nicotine dependence than expected by chance.
# Week14: Posters, finishing up
## Tuesday ---------
## Thursday ---------
## Statistics summary tables
We won't get to this in our class,
but anyone in biostatistics or epidemiology will appreciate how easy it is to
produce such an informative "Demographics" table.
```{R results='asis', echo = TRUE}
library(moonBook)
out <- mytable(Depression
+ TobaccoDependence
~
Age
+ SmokingFreq
+ DailyCigsSmoked
+ Ethnicity
+ Sex
+ DaysSmoke
+ TotalCigsSmoked
+ CigsSmokedFac
, data = nesarc.sub)
myhtml(out)
```
# References (from Week02)