Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.
Starting in Class 07, we will concatenate all our WSs together to retain the relevant information needed for subsequent classes. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, modifying your research questions, improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.
You will need to update any section symbols to be at least 3 deep (###) if you’ve copied them from an older assignment into this document.
Finally, you can delete all this header text once you don’t need to refer to it.
This document is organized by Class number.
Consider your readers (grader):
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
# and notes about the cache option.
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
# Note: The "cache=TRUE" option will save the computations of code chunks
# so R it doesn't recompute everything every time you recompile.
# This can save _tons of time_ if you're working on a small section of code
# at the bottom of the document.
# Code chunks will be recompiled only if they are edited.
# The autodep=TRUE will also update dependent code chunks.
# A folder is created with the cache in it -- if you delete the folder, then
# all the code chunks will recompute again.
# ** If things are working as expected, or I want to freshly compute everything,
# I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE) #$
Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.
Initial thinking: (My helpful narrative description to help you get going.) While nicotine dependence is a good starting point, I need to determine what it is about nicotine dependence that I am interested in. It strikes me that friends and acquaintances that I have known through the years that became hooked on cigarettes did so across very different periods of time. Some seemed to be dependent soon after their first few experiences with smoking and others after many years of generally irregular smoking behavior. I decide that I am most interested in exploring the association between level of smoking and nicotine dependence. I add to my codebook variables reflecting smoking levels (e.g., smoking quantity and frequency).
Topic of interest: (You need this part.) I have decided to investigate the relationship between nicotine dependence and the frequency and quantity of smoking on people up to 25 years old. The association may differ by ethnicity, age, gender, and other factors.
How I did it: (My helpful narrative description to help you get going.) I look through the codebook and find some variables of interest. I searched the text with “Ctrl-F” (find) to find these variables. For each variable, I copy/paste the description here, then formatted it so it’s organized. You can choose to use a table or an outline format. I found this verbatim text format to be very easy to format. I retained the “frequency” of each response because it’s interesting to know, and because it was already in the codebook — this value is not required for your codebook.
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
VarName (RenamedVarName)
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
IDNUM (ID)
UNIQUE ID NUMBER WITH NO ALPHABETICS
Nominal
43093 1-43093. Unique Identification number
SEX (Sex)
SEX
Nominal
18518 1. Male
24575 2. Female
AGE (Age)
AGE
Continuous
43079 18-97. Age in years
14 98. 98 years or older
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change NA to 3. Never
* change 9 to NA
TAB12MDX (TobaccoDependence)
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change NA to 7. Never
* change 9 to NA
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change NA to 0 Cigarett(s)
* change 99 to NA
ETHRACE2A (Ethnicity)
IMPUTED RACE/ETHNICITY
Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
MAJORDEPLIFE (Depression)
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Nominal
35254 0. No
7839 1. Yes
S1Q10A (IncomePersonal)
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Continuous
43093 0-3000000. Income in dollars
S1Q24FT (Height_ft)
HEIGHT: FEET
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
S1Q24IN (Height_in)
HEIGHT: INCHES
Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
S1Q24LB (Weight_lb)
WEIGHT: POUNDS
Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
Additional variables were created from the original variables:
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
I will use (Beckschäfer et al. 2014; Richert 2013; Murphy 2012; Dean and ebrary, Inc 2014) for this work which was obtained using [@beck2014; @rich2013; @murp2012; @dean2014]
. Plus, Beckschäfer et al. (2014) says some interesting stuff and that citation was obtained using @beck2014
. For more documentation on bibliographies and citations with R Markdown, see http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html. For general help with R Markdown, see https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.
See Class 06 below.
Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.
Research question:
You don’t need to include images in your literature review. I’m providing these tables to illustrate what these tables look like:
I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.
All required variables have been found and added to my personal codebook (by expanding Class 03).
(Note: This is the last paragraph of the research proposal used to remind the reader of the research objectives.)
The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.
Variables from NESARC that will be used include (already in codebook, above):
IDNUM
(unique ID number with no alphabetics),MAJORDEPLIFE
(has subject experienced a major depression?),CHECK321
(has subject smoked cigarettes in past 12 months?),AGE
(age of subject),TAB12MDX
(tobacco dependence past 12 months),S3AQ3B1
(usual frequency when cigarettes smoked),ETHRACE2A
(ethnicity of subject),SEX
(gender of subject), andS3AQ3C1
(usual smoking quantity of cigarettes when smoked).First, the data is placed on the search path using the PDS
package.
# data analysis packages
library(tidyverse) # Data manipulation and visualization suite
library(forcats) # Factor variables
library(lubridate) # Dates
# data package
library(PDS)
## Problems installing PDS package? Solution.
## If you had problems installing the PDS package, no problem; here's how to get the data:
## 1. Download the ".RData" file for your dataset into your ADA Folder.
## 2. Where I have "library(PDS)" in my code, change it to the two lines below.
## Update the "filename" to the name of the file.
##
## # library(PDS)
## load("filename.RData")
dim(NESARC)
[1] 43093 3008
The variables of interest are specified and a subset of the dataset including only these variables is created and stored in the data frame nesarc.sub
.
# variables to include in our data subset
nesarc.sub <-
NESARC %>%
select(
IDNUM
, SEX
, AGE
, CHECK321
, TAB12MDX
, S3AQ3B1
, S3AQ3C1
, ETHRACE2A
, MAJORDEPLIFE
, S1Q10A
, S1Q24FT
, S1Q24IN
, S1Q24LB
)
Check size and structure of data subset. Note that categorical variables are already Factor
variables, but the levels do not have meaningful labels, yet.
# size of subset
dim(nesarc.sub)
[1] 43093 13
# structure of subset
str(nesarc.sub)
'data.frame': 43093 obs. of 13 variables:
$ IDNUM : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SEX : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CHECK321 : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ TAB12MDX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ S3AQ3B1 : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ S3AQ3C1 : num NA NA NA NA NA NA NA NA NA NA ...
$ ETHRACE2A : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ S1Q10A : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ S1Q24FT : num 5 5 5 5 6 5 5 5 5 5 ...
$ S1Q24IN : num 7 1 4 7 1 5 9 4 4 10 ...
$ S1Q24LB : num 195 127 185 180 220 140 160 120 140 150 ...
Joining AddHealth waves 1 and 2 together into a single dataset can be done if you want to use variables from when the participants were both adolecents and adults.
# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
[1] 6504 2829
dim(addhealth_public4)
[1] 6504 976
# First column is the unique ID
colnames(AddHealth)[1]
[1] "AID"
colnames(addhealth_public4)[1]
[1] "aid"
# Rename W4's unique ID variable "AID" and "aid" to "ID"
# so both W1 and W4 have same variable name (lower/uppercase matters)
AddHealth <-
AddHealth %>%
rename(ID = AID)
addhealth_public4 <-
addhealth_public4 %>%
rename(ID = aid)
# note that the first column name has be updated
colnames(AddHealth)[1]
[1] "ID"
colnames(addhealth_public4)[1]
[1] "ID"
# Join two datasets by unique ID
AH14 <-
full_join(AddHealth, addhealth_public4)
# now we have twice as many columns
dim(AH14)
[1] 6504 3804
# Removing this large object since I don't need this data for the NESARC project
# DELETE THIS LINE if you're using AddHealth
rm(AH14)
Rename variables NewName = OriginalName
.
nesarc.sub <-
nesarc.sub %>%
rename(
ID = IDNUM
, Sex = SEX
, Age = AGE
, SmokingStatus = CHECK321
, TobaccoDependence = TAB12MDX
, SmokingFreq = S3AQ3B1
, DailyCigsSmoked = S3AQ3C1
, Ethnicity = ETHRACE2A
, Depression = MAJORDEPLIFE
, IncomePersonal = S1Q10A
, Height_ft = S1Q24FT
, Height_in = S1Q24IN
, Weight_lb = S1Q24LB
)
str(nesarc.sub)
'data.frame': 43093 obs. of 13 variables:
$ ID : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ Age : num 23 28 81 18 36 34 19 84 29 18 ...
$ SmokingStatus : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ TobaccoDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ SmokingFreq : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ DailyCigsSmoked : num NA NA NA NA NA NA NA NA NA NA ...
$ Ethnicity : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ Depression : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ IncomePersonal : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ Height_ft : num 5 5 5 5 6 5 5 5 5 5 ...
$ Height_in : num 7 1 4 7 1 5 9 4 4 10 ...
$ Weight_lb : num 195 127 185 180 220 140 160 120 140 150 ...
The first 6 and last 6 values are given by head()
and tail()
.
head(nesarc.sub)
ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
1 1 1 23 <NA> 0 <NA> NA
2 2 2 28 <NA> 0 <NA> NA
3 3 2 81 <NA> 0 <NA> NA
4 4 1 18 <NA> 0 <NA> NA
5 5 1 36 <NA> 0 <NA> NA
6 6 2 34 <NA> 0 <NA> NA
Ethnicity Depression IncomePersonal Height_ft Height_in Weight_lb
1 5 0 17500 5 7 195
2 5 0 11000 5 1 127
3 5 0 6000 5 4 185
4 5 0 27000 5 7 180
5 2 0 42000 6 1 220
6 2 0 500 5 5 140
tail(nesarc.sub)
ID Sex Age SmokingStatus TobaccoDependence SmokingFreq
43088 43088 2 18 1 0 1
43089 43089 2 18 <NA> 0 <NA>
43090 43090 1 19 <NA> 0 <NA>
43091 43091 1 18 1 1 6
43092 43092 1 29 1 1 1
43093 43093 1 18 <NA> 0 <NA>
DailyCigsSmoked Ethnicity Depression IncomePersonal Height_ft
43088 5 1 0 4000 5
43089 NA 1 0 4000 5
43090 NA 1 0 2500 6
43091 3 1 0 3000 5
43092 20 1 0 6500 5
43093 NA 1 1 6000 5
Height_in Weight_lb
43088 3 120
43089 1 160
43090 1 175
43091 8 150
43092 10 140
43093 11 145
A quick summary of the variables indicates many missing values.
# summary of variables -- notice many NAs
summary(nesarc.sub)
ID Sex Age SmokingStatus TobaccoDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 9 : 22
4 : 1 Mean :46.4 NA's:25080
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
1 :14836 Min. : 1.00 1:24507 0:35254 Min. : 0
6 : 772 1st Qu.: 6.00 2: 8245 1: 7839 1st Qu.: 8800
4 : 747 Median :15.00 3: 701 Median : 20000
3 : 687 Mean :16.81 4: 1332 Mean : 28185
2 : 460 3rd Qu.:20.00 5: 8308 3rd Qu.: 36000
(Other): 511 Max. :99.00 Max. :3000000
NA's :25080 NA's :25080
Height_ft Height_in Weight_lb
Min. : 4.0 Min. : 0.000 Min. : 62.0
1st Qu.: 5.0 1st Qu.: 3.000 1st Qu.:140.0
Median : 5.0 Median : 5.000 Median :168.0
Mean : 6.7 Mean : 6.934 Mean :197.1
3rd Qu.: 5.0 3rd Qu.: 8.000 3rd Qu.:197.0
Max. :99.0 Max. :99.000 Max. :999.0
There are two steps. The first step is to recode any existing NA
s to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA
.
NA
s as meaningful “missing”First step: the existing blank values with NA
mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.
NA
s recoded as numericThe easiest is DailyCigsSmoked
where NA
means 0 cigarettes. We replace NA
with 0. The function is.na()
identifies each observation that “is an NA
”. I’ve updated the codebook to indicate that the original NA
values were changed.
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
table(nesarc.sub$DailyCigsSmoked)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
934 884 923 573 1070 463 269 299 49 3077 23 230 34 25 851
16 17 18 19 20 21 22 23 24 25 27 28 29 30 33
40 22 59 5 5366 1 10 2 7 155 2 3 3 909 1
34 35 37 39 40 45 50 55 57 60 66 70 75 80 98
1 30 2 1 993 8 106 2 1 241 1 12 2 47 15
99
262
nesarc.sub <-
nesarc.sub %>%
mutate(
# replace NA with 0
DailyCigsSmoked = replace(DailyCigsSmoked, is.na(DailyCigsSmoked), 0)
)
table(nesarc.sub$DailyCigsSmoked)
0 1 2 3 4 5 6 7 8 9 10 11
25080 934 884 923 573 1070 463 269 299 49 3077 23
12 13 14 15 16 17 18 19 20 21 22 23
230 34 25 851 40 22 59 5 5366 1 10 2
24 25 27 28 29 30 33 34 35 37 39 40
7 155 2 3 3 909 1 1 30 2 1 993
45 50 55 57 60 66 70 75 80 98 99
8 106 2 1 241 1 12 2 47 15 262
NA
s recoded as categoricalThe next two variables require defining a new category level. Both SmokingStatus
and SmokingFreq
with a value of NA
means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA
values were changed.
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
table(nesarc.sub$SmokingStatus )
1 2 9
9913 8078 22
table(nesarc.sub$SmokingFreq )
1 2 3 4 5 6 9
14836 460 687 747 409 772 102
nesarc.sub <-
nesarc.sub %>%
mutate(
# first change the "factor" variables to "character" to add a new level
SmokingStatus = as.character(SmokingStatus)
, SmokingFreq = as.character(SmokingFreq )
# replace NA with new category values
, SmokingStatus = replace(SmokingStatus , is.na(SmokingStatus), "3")
, SmokingFreq = replace(SmokingFreq , is.na(SmokingFreq) , "7")
)
table(nesarc.sub$SmokingStatus )
1 2 3 9
9913 8078 25080 22
table(nesarc.sub$SmokingFreq )
1 2 3 4 5 6 7 9
14836 460 687 747 409 772 25080 102
NA
sSecond step: replace the coded missing values with NA
, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!
table(nesarc.sub$SmokingStatus )
1 2 3 9
9913 8078 25080 22
table(nesarc.sub$SmokingFreq )
1 2 3 4 5 6 7 9
14836 460 687 747 409 772 25080 102
table(nesarc.sub$DailyCigsSmoked)
0 1 2 3 4 5 6 7 8 9 10 11
25080 934 884 923 573 1070 463 269 299 49 3077 23
12 13 14 15 16 17 18 19 20 21 22 23
230 34 25 851 40 22 59 5 5366 1 10 2
24 25 27 28 29 30 33 34 35 37 39 40
7 155 2 3 3 909 1 1 30 2 1 993
45 50 55 57 60 66 70 75 80 98 99
8 106 2 1 241 1 12 2 47 15 262
nesarc.sub <-
nesarc.sub %>%
mutate(
# replace missing values
SmokingStatus = replace(SmokingStatus , SmokingStatus %in% c("9"), NA)
, SmokingFreq = replace(SmokingFreq , SmokingFreq %in% c("9"), NA)
, DailyCigsSmoked = replace(DailyCigsSmoked, DailyCigsSmoked %in% c( 99), NA)
, Height_ft = replace(Height_ft , Height_ft %in% c( 99), NA)
, Height_in = replace(Height_in , Height_in %in% c( 99), NA)
, Weight_lb = replace(Weight_lb , Weight_lb %in% c(999), NA)
# drop unused factor levels
, SmokingStatus = fct_drop(SmokingStatus)
, SmokingFreq = fct_drop(SmokingFreq )
)
table(nesarc.sub$SmokingStatus )
1 2 3
9913 8078 25080
table(nesarc.sub$SmokingFreq )
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
table(nesarc.sub$DailyCigsSmoked)
0 1 2 3 4 5 6 7 8 9 10 11
25080 934 884 923 573 1070 463 269 299 49 3077 23
12 13 14 15 16 17 18 19 20 21 22 23
230 34 25 851 40 22 59 5 5366 1 10 2
24 25 27 28 29 30 33 34 35 37 39 40
7 155 2 3 3 909 1 1 30 2 1 993
45 50 55 57 60 66 70 75 80 98
8 106 2 1 241 1 12 2 47 15
summary(nesarc.sub)
ID Sex Age SmokingStatus TobaccoDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb
Min. :4.00 Min. : 0.000 Min. : 62.0
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
Median :5.00 Median : 5.000 Median :165.0
Mean :5.11 Mean : 5.279 Mean :170.6
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
Max. :7.00 Max. :11.000 Max. :500.0
NA's :730 NA's :761 NA's :1376
Finally, consider using na.omit()
to remove any records with missing values if it won’t cause issues with analysis. The command na.omit()
removes a row if any columns have an NA
in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NA
s! Then na.omit()
will drop three-quarters of your data, even though most of your variables are complete!
There’s only one advantage I can think of for running na.omit()
, and that’s if you have a large dataset and only a few observations have a few NA
s. You can’t analyze data with NA
s, since any calculation involving NA
results in an NA
. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA
distributed throughout; first the observations with NA
for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA
for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit()
you will start with a dataset with no NA
s, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).
Below, I do not run na.omit()
since I’m concerned about having lots of NA
s for some variables.
dim(nesarc.sub)
[1] 43093 13
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
[1] 43093 13
summary(nesarc.sub)
ID Sex Age SmokingStatus TobaccoDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb
Min. :4.00 Min. : 0.000 Min. : 62.0
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
Median :5.00 Median : 5.000 Median :165.0
Mean :5.11 Mean : 5.279 Mean :170.6
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
Max. :7.00 Max. :11.000 Max. :500.0
NA's :730 NA's :761 NA's :1376
Now we have a dataset with no missing values.
Additional variables that we created from the original variables are listed in this ammendment to the original codebook.
CREATED VARIABLES
DaySmoke
estimated number of days per month a subject smokes
Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Continuous
0-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Ordinal
from SmokingFreq
1. Daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. Weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. Monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Height_ft * 12 + Height_in
DaysSmoke
estimates the days per month a subject smokes by converting SmokingFreq
(a factor with 6 levels) to a numeric variable using as.numeric()
and multiplying by the midpoint of the range of SmokingFreq
.
nesarc.sub <-
nesarc.sub %>%
mutate(
DaysSmoke =
case_when(
SmokingFreq == 1 ~ 30 # 1. Every day
, SmokingFreq == 2 ~ 4 * 5.5 # 2. 5 to 6 Day(s) a week
, SmokingFreq == 3 ~ 4 * 3.5 # 3. 3 to 4 Day(s) a week
, SmokingFreq == 4 ~ 4 * 1.5 # 4. 1 to 2 Day(s) a week
, SmokingFreq == 5 ~ 2.5 # 5. 2 to 3 Day(s) a month
, SmokingFreq == 6 ~ 1 # 6. Once a month or less
, SmokingFreq == 7 ~ 0 # 7. Never
)
)
summary(nesarc.sub$DaysSmoke)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 10.96 30.00 30.00 102
Height_inches
is the height of a person in inches. Since the data was collected in two variables (Height_ft
and Height_in
) we need to add them together for our new variable.
nesarc.sub <-
nesarc.sub %>%
mutate(
Height_inches = Height_ft * 12 + Height_in
)
summary(nesarc.sub$Height_inches)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
48.0 64.0 66.0 66.6 70.0 84.0 761
The variable TotalCigsSmoked
estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke
times DailyCigsSmoked
(the usual quantity smoked per day).
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmoked = DaysSmoke * DailyCigsSmoked
)
Log scale. I like log2 since its interpretation is that every unit is a doubling. Notice that I added 1 to the TotalCigsSmoked
before computing the log. This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmokedLog2 = log(TotalCigsSmoked + 1, base = 2)
)
summary(nesarc.sub$TotalCigsSmokedLog2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 3.284 8.234 11.522 297
Square root scale. Square root is another transformation that spreads out values between 0 and 1 and compresses values from 1 to infinity. It can be preferred to a log transformation in some cases.
nesarc.sub <-
nesarc.sub %>%
mutate(
TotalCigsSmokedSqrt = sqrt(TotalCigsSmoked)
)
summary(nesarc.sub$TotalCigsSmokedSqrt)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 7.887 17.321 54.222 297
The numeric variable TotalCigsSmoked
is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac
using the cut
function. The label "Q1"
indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4"
indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations.
quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
0% 25% 50% 75% 100%
0 0 0 300 2940
nesarc.sub <-
nesarc.sub %>%
mutate(
CigsSmokedFac =
.bincode(TotalCigsSmoked
, breaks = quantile(TotalCigsSmoked, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
, include.lowest = TRUE
)
, CigsSmokedFac = factor( CigsSmokedFac
, levels = c(1, 2, 3, 4)
, labels = c("Q1", "Q2", "Q3", "Q4")
)
)
In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.
summary(nesarc.sub$CigsSmokedFac)
Q1 Q2 Q3 Q4 NA's
25080 0 8630 9086 297
The variable SmokingFreq3
collapses the SmokingFreq
from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.
## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"
# initialize new variable with NAs
nesarc.sub <-
nesarc.sub %>%
mutate(
SmokingFreq3 =
fct_collapse(
SmokingFreq
, "1" = c("1", "2", "3") # 1. daily
, "2" = c("4", "5") # 2. weekly
, "3" = "6" # 3. monthly
, "4" = "7" # 7. never
)
)
# new variable
table(nesarc.sub$SmokingFreq3)
1 2 3 4
15983 1156 772 25080
# we will label the new factor variable below, with the others
Look at the first 10 observations in the dataset, with special attention to the new variables.
head(nesarc.sub, 10)
ID Sex Age SmokingStatus TobaccoDependence SmokingFreq DailyCigsSmoked
1 1 1 23 3 0 7 0
2 2 2 28 3 0 7 0
3 3 2 81 3 0 7 0
4 4 1 18 3 0 7 0
5 5 1 36 3 0 7 0
6 6 2 34 3 0 7 0
7 7 1 19 3 0 7 0
8 8 2 84 3 0 7 0
9 9 2 29 3 0 7 0
10 10 2 18 3 0 7 0
Ethnicity Depression IncomePersonal Height_ft Height_in Weight_lb
1 5 0 17500 5 7 195
2 5 0 11000 5 1 127
3 5 0 6000 5 4 185
4 5 0 27000 5 7 180
5 2 0 42000 6 1 220
6 2 0 500 5 5 140
7 2 1 2500 5 9 160
8 1 0 24000 5 4 120
9 1 1 65000 5 4 140
10 5 0 4000 5 10 150
DaysSmoke Height_inches TotalCigsSmoked TotalCigsSmokedLog2
1 0 67 0 0
2 0 61 0 0
3 0 64 0 0
4 0 67 0 0
5 0 73 0 0
6 0 65 0 0
7 0 69 0 0
8 0 64 0 0
9 0 64 0 0
10 0 70 0 0
TotalCigsSmokedSqrt CigsSmokedFac SmokingFreq3
1 0 Q1 4
2 0 Q1 4
3 0 Q1 4
4 0 Q1 4
5 0 Q1 4
6 0 Q1 4
7 0 Q1 4
8 0 Q1 4
9 0 Q1 4
10 0 Q1 4
Look at summary of dataset to see the state of the labelling before we make changes.
summary(nesarc.sub)
ID Sex Age SmokingStatus TobaccoDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
Height_ft Height_in Weight_lb DaysSmoke
Min. :4.00 Min. : 0.000 Min. : 62.0 Min. : 0.00
1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00
Median :5.00 Median : 5.000 Median :165.0 Median : 0.00
Mean :5.11 Mean : 5.279 Mean :170.6 Mean :10.96
3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00
Max. :7.00 Max. :11.000 Max. :500.0 Max. :30.00
NA's :730 NA's :761 NA's :1376 NA's :102
Height_inches TotalCigsSmoked TotalCigsSmokedLog2 TotalCigsSmokedSqrt
Min. :48.0 Min. : 0.0 Min. : 0.000 Min. : 0.000
1st Qu.:64.0 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.000
Median :66.0 Median : 0.0 Median : 0.000 Median : 0.000
Mean :66.6 Mean : 187.5 Mean : 3.284 Mean : 7.887
3rd Qu.:70.0 3rd Qu.: 300.0 3rd Qu.: 8.234 3rd Qu.:17.321
Max. :84.0 Max. :2940.0 Max. :11.522 Max. :54.222
NA's :761 NA's :297 NA's :297 NA's :297
CigsSmokedFac SmokingFreq3
Q1 :25080 1 :15983
Q2 : 0 2 : 1156
Q3 : 8630 3 : 772
Q4 : 9086 4 :25080
NA's: 297 NA's: 102
Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq
, TobaccoDependence
, and Sex
.
# Label factors with meaningful names in the same order as the codebook
nesarc.sub$Depression <-
factor(
nesarc.sub$Depression
, labels = c("No Depression"
, "Yes Depression"
)
)
nesarc.sub$SmokingStatus <-
factor(
nesarc.sub$SmokingStatus
, labels = c("Smoked past 12 months"
, "Smoked prior to 12 months"
, "Never smoked"
)
)
# Specify the order of the levels to be different from codebook order
nesarc.sub$Sex <-
factor(nesarc.sub$Sex
, levels = c( 2
, 1
)
, labels = c( "Female"
, "Male"
)
)
# check ordering with a frequency table
table(nesarc.sub$Sex)
Female Male
24575 18518
nesarc.sub$TobaccoDependence <-
factor(nesarc.sub$TobaccoDependence
, levels = c( 1
, 0
)
, labels = c( "Nicotine Dependence"
, "No Nicotine Dependence"
)
)
# Remember the new category level from the original `NA` values.
nesarc.sub$SmokingFreq <-
factor(nesarc.sub$SmokingFreq
, levels = c( 7
, 6
, 5
, 4
, 3
, 2
, 1
)
, labels = c( "Never"
, "Once a month or less"
, "2 to 3 Days/month"
, "1 to 2 Days/week"
, "3 to 4 Days/week"
, "5 to 6 Days/week"
, "Every Day"
)
)
nesarc.sub$Ethnicity <-
factor(nesarc.sub$Ethnicity
# shorter labels are helpful for plots
, labels = c( "Cauc"
, "AfAm"
, "NaAm"
, "Asia"
, "Hisp"
)
# , labels = c("Caucasian"
# , "African American"
# , "Native American"
# , "Asian"
# , "Hispanic"
# )
)
nesarc.sub$SmokingFreq3 <-
factor(nesarc.sub$SmokingFreq3
, levels = c( 1
, 2
, 3
, 4
)
, labels = c( "Daily"
, "Weekly"
, "Monthly"
, "Never"
)
)
summary(nesarc.sub)
ID Sex Age
1 : 1 Female:24575 Min. :18.0
2 : 1 Male :18518 1st Qu.:32.0
3 : 1 Median :44.0
4 : 1 Mean :46.4
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingStatus TobaccoDependence
Smoked past 12 months : 9913 Nicotine Dependence : 4962
Smoked prior to 12 months: 8078 No Nicotine Dependence:38131
Never smoked :25080
NA's : 22
SmokingFreq DailyCigsSmoked Ethnicity
Never :25080 Min. : 0.000 Cauc:24507
Every Day :14836 1st Qu.: 0.000 AfAm: 8245
Once a month or less: 772 Median : 0.000 NaAm: 701
1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
(Other) : 869 Max. :98.000
NA's : 102 NA's :262
Depression IncomePersonal Height_ft Height_in
No Depression :35254 Min. : 0 Min. :4.00 Min. : 0.000
Yes Depression: 7839 1st Qu.: 8800 1st Qu.:5.00 1st Qu.: 3.000
Median : 20000 Median :5.00 Median : 5.000
Mean : 28185 Mean :5.11 Mean : 5.279
3rd Qu.: 36000 3rd Qu.:5.00 3rd Qu.: 8.000
Max. :3000000 Max. :7.00 Max. :11.000
NA's :730 NA's :761
Weight_lb DaysSmoke Height_inches TotalCigsSmoked
Min. : 62.0 Min. : 0.00 Min. :48.0 Min. : 0.0
1st Qu.:140.0 1st Qu.: 0.00 1st Qu.:64.0 1st Qu.: 0.0
Median :165.0 Median : 0.00 Median :66.0 Median : 0.0
Mean :170.6 Mean :10.96 Mean :66.6 Mean : 187.5
3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.:70.0 3rd Qu.: 300.0
Max. :500.0 Max. :30.00 Max. :84.0 Max. :2940.0
NA's :1376 NA's :102 NA's :761 NA's :297
TotalCigsSmokedLog2 TotalCigsSmokedSqrt CigsSmokedFac SmokingFreq3
Min. : 0.000 Min. : 0.000 Q1 :25080 Daily :15983
1st Qu.: 0.000 1st Qu.: 0.000 Q2 : 0 Weekly : 1156
Median : 0.000 Median : 0.000 Q3 : 8630 Monthly: 772
Mean : 3.284 Mean : 7.887 Q4 : 9086 Never :25080
3rd Qu.: 8.234 3rd Qu.:17.321 NA's: 297 NA's : 102
Max. :11.522 Max. :54.222
NA's :297 NA's :297
In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values.
In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter()
function. Note that SmokingStatus == 1
is used to choose people who smoked in the past 12 months.
Notice, the missing values have been removed (since NA
s are not less than or equal to numbers).
# the subset command below will not include a row that evalutes to NA for the
# specified subset. For example:
(NA == 1)
[1] NA
# age and smoking subset
nesarc.sub <-
nesarc.sub %>%
filter(
Age <= 25 # people 25 or younger
, SmokingStatus == "Smoked past 12 months" # smoked in the past 12 months
)
dim(nesarc.sub)
[1] 1706 20
summary(nesarc.sub)
ID Sex Age
21 : 1 Female:850 Min. :18.00
77 : 1 Male :856 1st Qu.:20.00
103 : 1 Median :22.00
122 : 1 Mean :21.61
136 : 1 3rd Qu.:23.00
150 : 1 Max. :25.00
(Other):1700
SmokingStatus TobaccoDependence
Smoked past 12 months :1706 Nicotine Dependence :896
Smoked prior to 12 months: 0 No Nicotine Dependence:810
Never smoked : 0
SmokingFreq DailyCigsSmoked Ethnicity
Every Day :1320 Min. : 1.00 Cauc:1060
3 to 4 Days/week : 91 1st Qu.: 5.00 AfAm: 211
1 to 2 Days/week : 88 Median :10.00 NaAm: 42
Once a month or less: 71 Mean :11.34 Asia: 58
5 to 6 Days/week : 68 3rd Qu.:20.00 Hisp: 335
(Other) : 65 Max. :98.00
NA's : 3 NA's :9
Depression IncomePersonal Height_ft Height_in
No Depression :1260 Min. : 0 Min. :4.000 Min. : 0.000
Yes Depression: 446 1st Qu.: 4000 1st Qu.:5.000 1st Qu.: 2.000
Median : 11000 Median :5.000 Median : 5.000
Mean : 13298 Mean :5.196 Mean : 5.425
3rd Qu.: 20000 3rd Qu.:5.000 3rd Qu.: 8.000
Max. :130000 Max. :6.000 Max. :11.000
NA's :14 NA's :14
Weight_lb DaysSmoke Height_inches TotalCigsSmoked
Min. : 85.0 Min. : 1.00 Min. :58.00 Min. : 1.0
1st Qu.:135.5 1st Qu.:30.00 1st Qu.:65.00 1st Qu.: 90.0
Median :160.0 Median :30.00 Median :68.00 Median : 300.0
Mean :165.8 Mean :25.33 Mean :67.78 Mean : 320.5
3rd Qu.:185.0 3rd Qu.:30.00 3rd Qu.:71.00 3rd Qu.: 600.0
Max. :390.0 Max. :30.00 Max. :81.00 Max. :2940.0
NA's :27 NA's :3 NA's :14 NA's :9
TotalCigsSmokedLog2 TotalCigsSmokedSqrt CigsSmokedFac SmokingFreq3
Min. : 1.000 Min. : 1.000 Q1 : 0 Daily :1479
1st Qu.: 6.508 1st Qu.: 9.487 Q2 : 0 Weekly : 153
Median : 8.234 Median :17.321 Q3 :1107 Monthly: 71
Mean : 7.448 Mean :15.987 Q4 : 590 Never : 0
3rd Qu.: 9.231 3rd Qu.:24.495 NA's: 9 NA's : 3
Max. :11.522 Max. :54.222
NA's :9 NA's :9
Dates can be tricky in R. The lubridate
package makes them much easier.
Below I’m going to create a variable for the person’s age in years based on the person’s date of birth and the interview date. There is an AGE
variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.
I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY
) are the interview day, month, and year, and the “DOB” variables (DOBD
) are the date of birth dat, month, and year.
dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
'data.frame': 43093 obs. of 6 variables:
$ CDAY : num 14 12 23 9 18 7 1 18 9 16 ...
$ CMON : num 8 1 11 9 10 9 9 9 1 8 ...
$ CYEAR: num 2001 2002 2001 2001 2001 ...
$ DOBD : num 27 15 28 27 11 28 16 10 7 25 ...
$ DOBM : num 9 12 12 6 9 12 3 11 4 3 ...
$ DOBY : num 1977 1973 1919 1983 1965 ...
Combine the date fields together, then interpret them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after each operation.
## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
[1] "14 8 2001" "12 1 2002" "23 11 2001" "9 9 2001" "18 10 2001"
[6] "7 9 2001"
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
[1] "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" "2001-10-18"
[6] "2001-09-07"
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
[1] "27 9 1977" "15 12 1973" "28 12 1919" "27 6 1983" "11 9 1965"
[6] "28 12 1966"
# create date object by interpretting the day, month, year with dmy()
# if any failed to parse, it's because a day, month, or year is out of range
# such as "99" for missing values or refused to answer
# in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
head(dob.date)
[1] "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" "1965-09-11"
[6] "1966-12-28"
Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.
# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
Time differences in days
[1] 8722 10255 29916 6649 13186 12672
age.years <- as.numeric(age.days / 365.25) # change from "difftime" to "numeric"
head(age.years)
[1] 23.87953 28.07666 81.90554 18.20397 36.10130 34.69405
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
[1] 0.87953457 0.07665982 0.90554415 0.20396988 0.10130048 0.69404517
Add the new age variable to your data frame, drop the ones you don’t need anymore, and you’re ready to go!
# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
'data.frame': 43093 obs. of 7 variables:
$ CDAY : num 14 12 23 9 18 7 1 18 9 16 ...
$ CMON : num 8 1 11 9 10 9 9 9 1 8 ...
$ CYEAR : num 2001 2002 2001 2001 2001 ...
$ DOBD : num 27 15 28 27 11 28 16 10 7 25 ...
$ DOBM : num 9 12 12 6 9 12 3 11 4 3 ...
$ DOBY : num 1977 1973 1919 1983 1965 ...
$ age.years: num 23.9 28.1 81.9 18.2 36.1 ...
# drop the original ones you don't need anymore
# by using subset() and select = -VAR, it excludes the VAR column
dat.sub <-
dat.sub %>%
select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
'data.frame': 43093 obs. of 1 variable:
$ age.years: num 23.9 28.1 81.9 18.2 36.1 ...
Now you’ve got an age (or other date object) you can use in your project.
Basic tables can be created using the functions table
or xtabs
. Frequency tables are created for the variables TobaccoDependence
, CigsSmokedFac
, and SmokingFreq
. When only looking at one variable, table
is the same as xtabs
, but xtabs
will make looking at two-way, three-way, and higher tables possible.
# table() produces a one-varible frequency table
table(nesarc.sub$TobaccoDependence)
Nicotine Dependence No Nicotine Dependence
896 810
# proportions available by passing these frequencies to prop.table()
table(nesarc.sub$TobaccoDependence) %>% prop.table()
Nicotine Dependence No Nicotine Dependence
0.5252052 0.4747948
# xtabs() can be used for one or more variables
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
TobaccoDependence
Nicotine Dependence No Nicotine Dependence
896 810
# proportions available with prop.table()
prop.table(T1)
TobaccoDependence
Nicotine Dependence No Nicotine Dependence
0.5252052 0.4747948
T2 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T2
SmokingFreq
Never Once a month or less 2 to 3 Days/month
0 71 65
1 to 2 Days/week 3 to 4 Days/week 5 to 6 Days/week
88 91 68
Every Day
1320
(Look at the .Rmd code to see that these numbers in the next paragraph are calculated directly from the numbers summarized in the table! This way, if the data changes, the text will be correctly updated.)
In the data frame nesarc.sub
, there are 896 nicotine dependent subjects and 810 subjects that are not nicotine dependent. Most of the subjects in nesarc.sub
are daily smokers (68) with the remainder distributed uniformly across the first five levels of SmokingFreq
.
The barplots are all created with the package ggplot2
. The barplots start with the defaults for the geom_bar
and add more detail to the plot with each graph.
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1
p2 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Smoking Adults")
p2 <- p2 + theme_bw()
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
p2
When a factor variable has NA
s, then NA
appears as a level with frequences (see bar plots above). As of writing this, ggplot()
does not have a way to remove the NA
s before plotting. Therefore, we have to do it manually.
The easiest and most transparent way is to use the drop_na()
function below where we only plot the non-NA
values. Compare the two plots below of the same data, but the second plot had the NA
s removed before plotting.
library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(title = "Basic plot")
p1 <- p1 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p1
# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
# and "!is.na()" means "keep the non-NA values"
#p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Using drop_na()")
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
#p2
# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")
library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1 <- p1 + labs(x = ""
, y = "Total Number"
, title = "Nicotine Dependence for Young Smoking Adults"
)
print(p1)
Interpretation: Among smokers 25 years and younger, there are slightly more (5% more) people who are nicotine dependent (52.5%) compared to those who are not nicotine dependent (47.5%).
p2 <- ggplot(data = nesarc.sub %>% drop_na(SmokingFreq), aes(x = SmokingFreq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(x = ""
, y = "Total Number"
, title = "Smoking Frequency for Young Smoking Adults"
)
p2 <- p2 + theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1))
print(p2)
Interpretation: Among smokers 25 years and younger, most smoke every day (77.5%) while the less frequent categories are roughly evenly distributed around 4% to 5%.
One popular way to graph a continuous variable is to use a histogram. R
has the base function hist()
which can be used for this purpose. We will also use the package ggplot2
to create histograms. We start with a basic histogram of the variable TotalCigsSmoked
.
hist(nesarc.sub$TotalCigsSmoked)
Experiment with the code in the next code chunk to set the binwidth =
(extremely important!), decide if you need boundary = 0
, add a title, and if needed the labels on the \(x\)- and \(y\)-axes.
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
# Use "boundary = 0" for values that can't be negative
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month", title = "boundary at 0")
p <- p + theme_bw()
p
A smoothed density curve can be added with the geom_density()
function, where adjust = 2.0
is a fairly smooth curve. Experiment with adjust
values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness and decide what level best captures the general shape of your distribution.
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p
summary(nesarc.sub$TotalCigsSmoked)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 90.0 300.0 320.5 600.0 2940.0 9
fivenum(nesarc.sub$TotalCigsSmoked)
[1] 1 90 300 600 2940
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 300
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 510
The “shape” of the distribution for the estimated cigarettes smoked per month is skewed to the right. The “center” (median) of the distribution is 300 and the “spread” (interquartile range, middle 50%) for the distribution is 510.
p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 100)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
Interpretation: Among smokers 25 years and younger, the number of cigarettes smoked is right skewed with a median (center) of 300 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 510. There are three distinct modes (peaks) in the distribution because of how we constructed this variable from two others, and because of the common numbers of cigarettes being relative to a pack size (20). There are several outlying values greater than 1200 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.
p <- ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked))
p <- p + geom_histogram(aes(y = ..density..), boundary = 0, binwidth = 2)
p <- p + geom_rug()
p <- p + geom_density(alpha = 0.2, fill = "gray50", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month"
, y = ""
, title = "Total cigaretts smoked for Young Smoking Adults"
)
p <- p + theme_bw()
p
Interpretation: Among smokers 25 years and younger, most smoke less than 10 cigarettes per day with another peak at 20 (1 pack per day), another small peak at 40 (2 packs per day), and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).
Research question: With my questions, I don’t have a meaningful question to ask that will relate two numeric variable.
An exploratory question whether there is a relationship between a person’s age (Age
) and total number of cigaretts smoked (TotalCigsSmoked
). Because age is an integer varible, I jitter the points, and I reduce the opacity (increase the transparancy) with alpha = 1/4
to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).
Interpretation: I added a linear regression smoother to see whether there was any linear trend with age. The horizontal line suggests that there isn’t a (linear) relationship here.
The first plot has all the points in their original locations, but they end up stacking on top of each other so you can’t tell how many points are there.
library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)