---
title: "S4R: Cumulative project file, Example assignment document"
subtitle: "UNM Stat 145 Special, Statistics for Research, Spring 2019, Prof Erik Erhardt"
author: "Erik Erhardt"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
toc_depth: 5
code_folding: show
bibliography: NESARC_NicotineDependence_LitReview.bib
---
*Note: Each class save this file with a new name,
updating the last two digits to the class number.
Then, you'll have a record of your progress,
as well as which files you turned in for grading.*
Starting in Class 07, we will concatenate all our WSs together to retain the
relevant information needed for subsequent classes.
You will also have an opportunity to revisit previous parts to make changes or improvements,
such as updating your codebook, modifying your research questions, improving tables and plots.
I've provided an initial predicted organization of our
sections and subsections using the \# and \#\# symbols.
A table of contents is automatically generated using the "toc: true" in the yaml
and can headings in the table of contents are clickable to jump down
to each (sub)section.
You will need to update any section symbols to be at least 3 deep (\#\#\#)
if you've copied them from an older assignment into this document.
Finally, you can delete all this header text once you don't need to refer to it.
---
# Research Questions
## Class 03 Datasets, Codebooks, Personal Codebook
### Question of interest
__Dataset__:
National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),
with codebook NESARC_W1_CodeBook.pdf.
__Initial thinking__:
(My helpful narrative description to help you get going.)
_While nicotine dependence is a good starting point, I need to determine what
it is about nicotine dependence that I am interested in. It strikes me that
friends and acquaintances that I have known through the years that became
hooked on cigarettes did so across very different periods of time. Some seemed
to be dependent soon after their first few experiences with smoking and others
after many years of generally irregular smoking behavior. I decide that I am
most interested in exploring the association between level of smoking and
nicotine dependence. I add to my codebook variables reflecting smoking levels
(e.g., smoking quantity and frequency)._
__Topic of interest__:
(You need this part.)
I have decided to investigate the relationship between nicotine dependence and
the frequency and quantity of smoking on people up to 25 years old. The
association may differ by ethnicity, age, gender, and other factors.
__How I did it__:
(My helpful narrative description to help you get going.)
I look through the codebook and find some variables of interest. I searched
the text with "Ctrl-F" (find) to find these variables. For each variable, I
copy/paste the description here, then formatted it so it's organized. You can
choose to use a table or an outline format. I found this verbatim text format
to be very easy to format. I retained the "frequency" of each response because
it's interesting to know, and because it was already in the codebook --- this
value is not required for your codebook.
### Codebook
```
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
VarName (RenamedVarName)
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
IDNUM (ID)
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
SEX (Sex)
SEX
Nominal
18518 1. Male
24575 2. Female
AGE (Age)
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
TAB12MDX (TobaccoDependence)
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
ETHRACE2A (Ethnicity)
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
MAJORDEPLIFE (Depression)
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
```
Additional variables were created from the original variables:
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
```
## Class 04 Citations
I will use [@beck2014; @rich2013; @murp2012; @dean2014] for this work which was
obtained using `[@beck2014; @rich2013; @murp2012; @dean2014]`. Plus, @beck2014
says some interesting stuff and that citation was obtained using `@beck2014`.
For more documentation on bibliographies and citations with R Markdown, see
[http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html).
For general help with R Markdown, see
[https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).
## Class 05 Research Questions
See Class 06 below.
## Class 06 Literature Review
__Dataset__:
National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),
with codebook NESARC_W1_CodeBook.pdf.
Research question:
1. Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
* Google scholar search: "nicotine dependence smoking frequency"
* Citation: pdf file available for @dierker_association_2007
* Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
* Others: @kandel_extent_2000 and @caraballo_linking_2009
*You don't need to include images in your literature review.
I'm providing these tables to illustrate what these tables look like:*
2. Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
* Google scholar search: "nicotine dependence depression"
* Citation: pdf file available for @breslau1995psychiatric
* Interesting points: Table 2, Smoking and Nicotine Dependence both associated with **Education**.
Table 3, Major depression associated with being nicotine dependent and **Sex**.
* Others: @breslau_major_1998
3. Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
* In Question 2 we see differences by Education and Sex.
I have decided to further focus my question by examining whether the
association between nicotine dependence and depression differs based on how
much a person smokes. I am wondering if at low levels of smoking compared to
high levels, nicotine dependence is more common among individuals with major
depression than those without major depression. I add relevant depression
questions/items/variables to my personal codebook as well as several
demographic measures (age, gender, ethnicity, education, etc.) and any other variables I
may wish to consider.
All required variables have been found and added to my personal codebook (by expanding Class 03).
# Data Management
## Class 07 Working With Data, Data Management
### Purpose of study: smoking and depression in young adults
(Note: This is the last paragraph of the research proposal used to remind the
reader of the research objectives.)
The present study will examine young adults from the National Epidemiologic
Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis
will include 1) establishing the relationship between major depression and
nicotine dependence; and 2) determining whether or not the relationship between
major depression and nicotine dependence exists above and beyond smoking
quantity.
#### Variables
Variables from NESARC that will be used include (already in codebook, above):
* `IDNUM` (unique ID number with no alphabetics),
* `MAJORDEPLIFE` (has subject experienced a major depression?),
* `CHECK321` (has subject smoked cigarettes in past 12 months?),
* `AGE` (age of subject),
* `TAB12MDX` (tobacco dependence past 12 months),
* `S3AQ3B1` (usual frequency when cigarettes smoked),
* `ETHRACE2A` (ethnicity of subject),
* `SEX` (gender of subject), and
* `S3AQ3C1` (usual smoking quantity of cigarettes when smoked).
### Data subset columns
First, the data is placed on the search path using the `PDS` package.
```{R}
# data analysis packages
library(tidyverse) # Data manipulation and visualization suite
library(forcats) # Factor variables
library(lubridate) # Dates
# data package
library(PDS)
## Problems installing PDS package? Solution.
## If you had problems installing the PDS package, no problem; here's how to get the data:
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Where I have "library(PDS)" in my code, change it to the two lines below.
## Update the "filename" to the name of the file.
##
## # library(PDS)
## load("filename.RData")
dim(NESARC)
```
The variables of interest are specified and a subset of the dataset including
only these variables is created and stored in the data frame `nesarc.sub`.
```{R}
# variables to include in our data subset
nesarc.sub <-
NESARC %>%
select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
)
```
Check size and structure of data subset.
Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.
```{R}
# size of subset
dim(nesarc.sub)
# structure of subset
str(nesarc.sub)
```
#### EXTRA: AddHealth, joining waves 1 and 2 together
Joining AddHealth waves 1 and 2 together into a single dataset
can be done if you want to use variables from when
the participants were both adolecents and adults.
```{R}
# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
dim(addhealth_public4)
# First column is the unique ID
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Rename W4's unique ID variable "AID" and "aid" to "ID"
# so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
AddHealth %>%
rename(ID = AID)
addhealth_public4 <-
addhealth_public4 %>%
rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
colnames(addhealth_public4)[1]
# Join two datasets by unique ID
AH14 <-
full_join(AddHealth, addhealth_public4)
# now we have twice as many columns
dim(AH14)
# Removing this large object since I don't need this data for the NESARC project
# DELETE THIS LINE if you're using AddHealth
rm(AH14)
```
### Renaming Variables
Rename variables `NewName = OriginalName`.
```{R}
nesarc.sub <-
nesarc.sub %>%
rename(
ID = IDNUM
, Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, TobaccoDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
)
str(nesarc.sub)
```
The first 6 and last 6 values are given by `head()` and `tail()`.
```{R}
head(nesarc.sub)
tail(nesarc.sub)
```
A quick summary of the variables indicates many missing values.
```{R}
# summary of variables -- notice many NAs
summary(nesarc.sub)
```
## Class 08 Subsetting data and R Programming
### Coding missing values
There are two steps.
The first step is to recode any existing `NA`s to actual values, if necessary.
The method for doing this differs for numeric and categorical variables.
The second step is to recode any coded missing values, such as 9s or 99s, as actual `NA`.
#### Coding `NA`s as meaningful "missing"
First step: the existing blank values with `NA` mean "never",
and "never" has a meaning different from "missing".
For each variable we need to decide what "never" means
and code it appropriately.
##### `NA`s recoded as numeric
The easiest is `DailyCigsSmoked` where `NA` means 0 cigarettes.
We replace `NA` with 0.
The function `is.na()` identifies each observation that "is an `NA`".
I've updated the codebook to indicate that the original `NA` values were changed.
```
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
```
```{R}
table(nesarc.sub$DailyCigsSmoked)
nesarc.sub <-
nesarc.sub %>%
mutate(
# replace NA with 0
DailyCigsSmoked = replace(DailyCigsSmoked, is.na(DailyCigsSmoked), 0)
)
table(nesarc.sub$DailyCigsSmoked)
```
##### `NA`s recoded as categorical
The next two variables require defining a new category level.
Both `SmokingStatus` and `SmokingFreq` with a value of `NA` means "never",
so we need to create a new "Never" category and put in order
with the rest of the levels.
I've updated the codebook to indicate that the original `NA` values were changed.
```
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
```
```{R}
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
nesarc.sub <-
nesarc.sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
, SmokingFreq = as.character(SmokingFreq )
# replace NA with new category values
, SmokingStatus = replace(SmokingStatus , is.na(SmokingStatus), "3")
, SmokingFreq = replace(SmokingFreq , is.na(SmokingFreq) , "7")
)
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
```
#### Coding 9s and 99s as `NA`s
Second step: replace the coded missing values with `NA`,
then remove the remaining records with missing values.
Not all of these variables need this operation, since the previous subset commands
removed missing values by chance.
However, I'm going to run each as an extra precaution.
Afterall, later, if we were to make different previous data decisions,
it's possible we could end up with missing values at this point that
are incorrectly coded without this step!
```{R}
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
table(nesarc.sub$DailyCigsSmoked)
nesarc.sub <-
nesarc.sub %>%
mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
, SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
# drop unused factor levels
, SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
)
table(nesarc.sub$SmokingStatus )
table(nesarc.sub$SmokingFreq )
table(nesarc.sub$DailyCigsSmoked)
summary(nesarc.sub)
```
Finally, consider using `na.omit()` to remove any records with missing values
if it won't cause issues with analysis.
The command `na.omit()` removes a row if any columns have an `NA` in that row.
This can have __drastic__ consequences.
For example, imagine that you have 7 variables and 6 variables have complete data,
but the 7th variable is about three-quarters `NA`s!
Then `na.omit()` will drop three-quarters of your data,
even though most of your variables are complete!
There's only one advantage I can think of for running `na.omit()`,
and that's if you have a large dataset and only a few observations have a few `NA`s.
You can't analyze data with `NA`s,
since any calculation involving `NA` results in an `NA`.
Therefore imagine two scenarios:
(1) An analysis with Variables 1 and 2, each with a few `NA` distributed throughout;
first the observations with `NA` for either Variable 1 or 2 will be removed, then
the analysis will proceed.
(2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has `NA`
for different observations than Variable 2 --- therefore, different observations
will be removed before analysis.
In summary, these two analyses will include different observations from the dataset.
Therefore, by running `na.omit()` you will start with a dataset with no `NA`s,
thus every analysis will be of exactly the same set of data
(at the cost of not using all the possible data for each individual analysis).
Below, I do __not run `na.omit()`__ since I'm concerned about having lots of `NA`s for some variables.
```{R}
dim(nesarc.sub)
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
summary(nesarc.sub)
```
Now we have a dataset with no missing values.
### Creating new variables
Additional variables that we created from the original variables are listed in
this ammendment to the original codebook.
```
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
```
#### From categories to numeric
`DaysSmoke` estimates the days per month a subject smokes
by converting `SmokingFreq` (a factor with 6 levels) to a numeric variable using `as.numeric()`
and multiplying by the midpoint of the range of `SmokingFreq`.
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30 # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
)
)
summary(nesarc.sub$DaysSmoke)
```
#### From numeric to categories based on quantiles
The variable `TotalCigsSmoked` estimates the monthly number of cigarettes
a subject smokes per month by multiplying `DaysSmoke` times `DailyCigsSmoked`
(the usual quantity smoked per day).
```{R}
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
)
```
The numeric variable `TotalCigsSmoked` is converted into a factor
with four roughly equivalent numbers (quartiles) stored in each level of the
factor `CigsSmokedFac` using the `cut` function.
The label `"Q1"` indicates that the original value was in the first quarter (25%) of the ordered data,
that is, they are among the smallest 25% observations;
while `"Q4"` indicates the original value was in the last quarter (25%) of the ordered data,
that is, they are among the largest 25% observations.
```{R}
quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
nesarc.sub <-
nesarc.sub %>%
mutate(
CigsSmokedFac =
.bincode(TotalCigsSmoked
, breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
, include.lowest = TRUE
)
, CigsSmokedFac = factor( CigsSmokedFac
, levels = c(1, 2, 3, 4)
, labels = c("Q1", "Q2", "Q3", "Q4")
)
)
```
In my data, more than half of the values are 0s which is the minimum value,
thus they all get grouped in "Q1"; this leaves no values for "Q2" -- oh well.
```{R}
summary(nesarc.sub$CigsSmokedFac)
```
#### From many categories to a few
The variable `SmokingFreq3` collapses the `SmokingFreq` from 6 down to 3 categories.
This is an example to show you how to do this with your own variables.
Some of you will have categorical variables that have a dozen or so categories;
having so many categories makes interpretation very difficulty --- fewer is easier.
```{R}
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
nesarc.sub <-
nesarc.sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
)
)
# new variable
table(nesarc.sub$SmokingFreq3)
# we will label the new factor variable below, with the others
```
#### Review results of new variables
Look at the first 10 observations in the dataset, with special attention to the
new variables.
```{R}
head(nesarc.sub, 10)
```
### Labeling Categorical variable levels
Look at summary of dataset to see the state of the labelling before we make changes.
```{R}
summary(nesarc.sub)
```
Informative labels are given to the factor levels.
The order of the levels is also rearranged for the variables
`SmokingFreq`, `TobaccoDependence`, and `Sex`.
```{R}
# Label factors with meaningful names in the same order as the codebook
nesarc.sub$Depression <-
factor(
nesarc.sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)
nesarc.sub$SmokingStatus <-
factor(
nesarc.sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)
# Specify the order of the levels to be different from codebook order
nesarc.sub$Sex <-
factor(nesarc.sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc.sub$Sex)
nesarc.sub$TobaccoDependence <-
factor(nesarc.sub$TobaccoDependence
, levels = c( 1
, 0
)
, labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
)
)
# Remember the new category level from the original `NA` values.
nesarc.sub$SmokingFreq <-
factor(nesarc.sub$SmokingFreq
, levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
)
, labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
)
nesarc.sub$Ethnicity <-
factor(nesarc.sub$Ethnicity
# shorter labels are helpful for plots
, labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
)
# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
nesarc.sub$SmokingFreq3 <-
factor(nesarc.sub$SmokingFreq3
, levels = c( 1
, 2
, 3
, 4
)
, labels = c( "Daily"
, "Weekly"
, "Monthly"
, "Never"
)
)
summary(nesarc.sub)
```
### Data subset rows to young smokers
In practice, I would probably subset the data much earlier.
In this example I waited until the end to subset so that I could illustrate
all of the other steps with complete values.
In our example,
we are interested in the _subset of people 25 or younger_
who _smoked in the last 12 months_ and this is
created using the `filter()` function.
Note that `SmokingStatus == 1` is used to choose people who
smoked in the past 12 months.
Notice, the missing values have been removed
(since `NA`s are not less than or equal to numbers).
```{R}
# the subset command below will not include a row that evalutes to NA for the
# specified subset. For example:
(NA == 1)
# age and smoking subset
nesarc.sub <-
nesarc.sub %>%
filter(
Age <= 25 # people 25 or younger
, SmokingStatus == "Smoked past 12 months" # smoked in the past 12 months
)
dim(nesarc.sub)
summary(nesarc.sub)
```
### EXTRA: Working with Dates
Dates can be tricky in R. The `lubridate` package makes them much easier.
Below I'm going to create a variable for the person's age in years
based on the person's date of birth and the interview date.
There is an `AGE` variable which I'm already using,
but if there wasn't, or I wanted to be more precise about their age (such as, 24.34 years old),
then this is a way to do it.
I'm going to create a separate data frame just for this example to show how to do it;
if you needed these variables, they should be part of your original subset above.
Below, the "C" variables (`CDAY`) are the interview day, month, and year,
and the "DOB" variables (`DOBD`) are the date of birth dat, month, and year.
```{R}
dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
```
Combine the date fields together, then interpret them into a date object.
I'll do this for both the interview date and the date of birth.
I print the head (first 6 observations) after each operation.
```{R}
## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
head(dob.date)
```
Now that we have the interview date and date of birth as date objects,
we calculate the difference to give the person's age at the time of the interview.
Because the difference of dates is in the unit "days", we convert to years.
```{R}
# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
age.years <- as.numeric(age.days / 365.25) # change from "difftime" to "numeric"
head(age.years)
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
```
Add the new age variable to your data frame,
drop the ones you don't need anymore,
and you're ready to go!
```{R}
# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
# drop the original ones you don't need anymore
# by using subset() and select = -VAR, it excludes the VAR column
dat.sub <-
dat.sub %>%
select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
```
Now you've got an age (or other date object) you can use in your project.
# Graphing
## Class 11 Graphing Univariate
### Categorical variables
#### Tables for categorical variables
Basic tables can be created using the functions `table` or `xtabs`.
Frequency tables are created for the variables
`TobaccoDependence`, `CigsSmokedFac`, and `SmokingFreq`.
When only looking at one variable, `table` is the same as `xtabs`, but
`xtabs` will make looking at two-way, three-way, and higher tables possible.
```{R}
# table() produces a one-varible frequency table
table(nesarc.sub$TobaccoDependence)
# proportions available by passing these frequencies to prop.table()
table(nesarc.sub$TobaccoDependence) %>% prop.table()
# xtabs() can be used for one or more variables
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
# proportions available with prop.table()
prop.table(T1)
T2 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T2
```
(Look at the .Rmd code to see that these numbers in the next paragraph are
calculated directly from the numbers summarized in the table!
This way, if the data changes, the text will be correctly updated.)
In the data frame `nesarc.sub`, there are
`r T1[1]` nicotine dependent subjects and
`r T1[2]` subjects that are not nicotine dependent.
Most of the subjects in `nesarc.sub` are daily smokers
(`r T2[6]`) with the remainder distributed uniformly across the
first five levels of `SmokingFreq`.
#### Graphing frequency tables
The barplots are all created with the package `ggplot2`.
The barplots start with the defaults for the `geom_bar` and add more detail to the plot with each graph.
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1
p2 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Smoking Adults")
p2 <- p2 + theme_bw()
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
p2
```
#### Removing NAs from factor axes
When a factor variable has `NA`s,
then `NA` appears as a level with frequences (see bar plots above).
As of writing this, `ggplot()` does not have a way to remove the `NA`s before plotting.
Therefore, we have to do it manually.
The easiest and most transparent way is to use the `drop_na()` function below
where we only plot the non-`NA` values.
Compare the two plots below of the same data,
but the second plot had the `NA`s removed before plotting.
```{R, fig.height = 4, fig.width = 8}
library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p1
# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
# and "!is.na()" means "keep the non-NA values"
#p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p2
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")
```
#### Final versions
```{R}
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = ""
, y = "Total Number"
, title = "Nicotine Dependence for Young Smoking Adults"
)
print(p1)
```
**Interpretation:**
Among smokers 25 years and younger,
there are slightly more (`r (prop.table(T1)[1] - prop.table(T1)[2]) %>% round(3) * 100`% more)
people who are
nicotine dependent (`r prop.table(T1)[1] %>% round(3) * 100`%)
compared to those who are not nicotine dependent (`r prop.table(T1)[2] %>% round(3) * 100`%).
```{R}
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = ""
, y = "Total Number"
, title = "Smoking Frequency for Young Smoking Adults"
)
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
print(p2)
```
**Interpretation:**
Among smokers 25 years and younger,
most smoke every day (`r prop.table(T2)[7] %>% round(3) * 100`%)
while the less frequent categories are roughly evenly distributed
around 4% to 5%.
### Numeric variables
#### Graphing numeric variables
One popular way to graph a continuous variable is to use a histogram.
`R` has the base function `hist()` which can be used for this purpose.
We will also use the package `ggplot2` to create histograms.
We start with a basic histogram of the variable `TotalCigsSmoked`.
```{R}
hist(nesarc.sub$TotalCigsSmoked)
```
Experiment with the code in the next code chunk to set the `binwidth = ` (extremely important!),
decide if you need `boundary = 0`,
add a title, and if needed
the labels on the $x$- and $y$-axes.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
# Use "boundary = 0" for values that can't be negative
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month", title = "boundary at 0")
p <- p + theme_bw()
p
```
#### Creating Density Plots
A smoothed density curve can be added with the `geom_density()` function,
where `adjust = 2.0` is a fairly smooth curve.
Experiment with `adjust` values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness
and decide what level best captures the general shape of your distribution.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
```
#### Shape, center, and spread
```{R}
summary(nesarc.sub$TotalCigsSmoked)
fivenum(nesarc.sub$TotalCigsSmoked)
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
```
The "shape" of the distribution for the estimated cigarettes smoked per month
is skewed to the right.
The "center" (median) of the distribution is
`r median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`
and the "spread" (interquartile range, middle 50%) for the distribution is
`r IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`.
#### Final versions
```{R}
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
```
**Interpretation:**
Among smokers 25 years and younger,
the number of cigarettes smoked is right skewed
with a median (center) of `r median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`
and the IQR (spread) (interquartile range, middle 50%) for the distribution is
`r IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)`.
There are three distinct modes (peaks) in the distribution
because of how we constructed this variable from two others,
and because of the common numbers of cigarettes being relative to a pack size (20).
There are several outlying values greater than 1200 (representing 40 cigarettes per day)
that are possibly overestimates from the people responding.
```{R}
p <- ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
```
**Interpretation:**
Among smokers 25 years and younger,
most smoke less than 10 cigarettes per day with another peak at 20 (1 pack per day),
another small peak at 40 (2 packs per day),
and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).
## Class 13 Graphing Bivariate
# Statistical methods
## Class 17 Hypothesis Testing
## Class 19 ANOVA
## Class 21 Contingency tables
## Class 23 Correlation and Interactions
## Class 25 Linear Regression
# Poster
## Class 27 Poster Presentation
# References