Document overview

This document is organized by Week number. The in-class assignments are indicated by Tuesday and Thursday. Each week’s homework is often a combination of Tuesday and Thursday, with a small extension. Therefore, “fleshing out” the Tuesday and Thursday sections with a little addition is often sufficient for your homework assignment.

Consider your readers (graders):

Global code options

# I set some GLOBAL R chunk options here.
#   (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
#   and notes about the cache option.

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)

# Note: The "cache=TRUE" option will save the computations of code chunks
#   so R it doesn't recompute everything every time you recompile.
#   This can save _tons of time_ if you're working on a small section of code
#   at the bottom of the document.
#   Code chunks will be recompiled only if they are edited.
#   The autodep=TRUE will also update dependent code chunks.
#   A folder is created with the cache in it -- if you delete the folder, then
#   all the code chunks will recompute again.
#   ** If things are working as expected, or I want to freshly compute everything,
#      I delete the *_cache folder.
knitr::opts_chunk$set(cache = TRUE) #, autodep=TRUE)  #$

Starting at Week03, we will concatenate all our assignments together to retain the relevant information needed for subsequent sections. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, modifying your research questions, improving tables and plots. Please organize your headers using the # symbol appropriately, and generate a table of contents by using the “toc: true” in the yaml (see any previous Rmd file on the website).


Week01: Personal Codebook

Question of interest

Thursday ———

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook wv1codebook.pdf.

Initial thinking: While nicotine dependence is a good starting point, I need to determine what it is about nicotine dependence that I am interested in. It strikes me that friends and acquaintances that I have known through the years that became hooked on cigarettes did so across very different periods of time. Some seemed to be dependent soon after their first few experiences with smoking and others after many years of generally irregular smoking behavior. I decide that I am most interested in exploring the association between level of smoking and nicotine dependence. I add to my codebook variables reflecting smoking levels (e.g. smoking quantity and frequency).

Topic of interest: Is there an association between nicotine dependence and the frequency and quantity of smoking on people up to 25 years old? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).

How I did it: I look through the codebook wv1codebook.pdf and find some variables of interest. I searched the text with “Ctrl-F” (find) to find these variables. For each variable, I copy/paste the description here, then formatted so it’s organized. You can choose to use a table or an outline format. I found this text format to be very easy to format. I retained the “frequency” of each response because it’s interesting to know, and because it was already in the codebook — this value is not required for your codebook.

Codebook

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
VarName (RenamedVarName)
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

IDNUM
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Nominal
  43093 1-43093. Unique Identification number

SEX (Sex)
  SEX
  Nominal
  18518 1. Male
  24575 2. Female

AGE (Age)
  AGE
  Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

CHECK321 (SmokingStatus)
  CIGARETTE SMOKING STATUS
  Nominal
   9913 1. Smoked cigarettes in the past 12 months
   8078 2. Smoked cigarettes prior to the last 12 months
     22 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

TAB12MDX (TobaccoDependence)
  NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
  Nominal
  38131 0. No nicotine dependence
   4962 1. Nicotine dependence

S3AQ3B1 (SmokingFreq)
  USUAL FREQUENCY WHEN SMOKED CIGARETTES
  Ordinal
  14836 1. Every day
    460 2. 5 to 6 Day(s) a week
    687 3. 3 to 4 Day(s) a week
    747 4. 1 to 2 Day(s) a week
    409 5. 2 to 3 Day(s) a month
    772 6. Once a month or less
    102 9. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

S3AQ3C1 (DailyCigsSmoked)
  USUAL QUANTITY WHEN SMOKED CIGARETTES
  Discrete
  17751 1-98. Cigarette(s)
    262 99. Unknown
  25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

ETHRACE2A (Ethnicity)
  IMPUTED RACE/ETHNICITY
  Nominal
  24507 1. White, Not Hispanic or Latino
   8245 2. Black, Not Hispanic or Latino
    701 3. American Indian/Alaska Native, Not Hispanic or Latino
   1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
   8308 5. Hispanic or Latino

MAJORDEPLIFE (Depression)
  MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
  Nominal
  35254 0. No
   7839 1. Yes

Additional variables were created from the original variables:

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. monthly = 6. Once a month or less

Week02: Literature Review

Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook wv1codebook.pdf.

Given the association that I have decided to examine, I use such keywords as nicotine dependence, tobacco dependence, and smoking. After reading through several titles and abstracts, I notice that there has been relatively little attention in the research literature to the association between smoking exposure and nicotine dependence. I expand a bit to include other substance use that provides relevant background as well.

Tuesday ———

Research questions

Based on my reading of the articles in the References as well as others, I have noted a few common and interesting themes:

  1. While it is true that smoking exposure is a necessary requirement for nicotine dependence, frequency and quantity of smoking are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (this is true for other drugs as well).

  2. The association may differ based on ethnicity, age, and gender (although there is little work on this).

  3. One of the most potent risk factors consistently implicated in the etiology of smoking behavior and nicotine dependence is depression.

I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, etc.) and any other variables I may wish to consider.

All required variables have been found and added to my personal codebook (by expanding Week01).

Thursday ———

Citations

Breslau et al. (1998), Dierker et al. (2001), Dierker et al. (2004), B. F. Grant et al. (1995), Bridget F. Grant et al. (2003), Kandel and Chen (2000), Rohde et al. (2003), Rohde et al. (2004), Stanton, Lowe, and Silva (1995)


(Week02a Research Proposal – skipping this step in ADA1)

The Association between Nicotine Dependence and Major Depression at Different Levels of Smoking Exposure

In our class we won’t do this assignment. However, this example illustrates the next step toward clarifying your ideas and writing introductory, methods, and discussion content for your poster.

Introduction

One of the most potent risk factors consistently implicated in both the etiology of smoking behavior as well as the subsequent development of nicotine dependence is major depression. Evidence for this association comes from longitudinal investigations in which depression has been shown to increase risk of later smoking (Breslau et al. 1998; Dierker et al. 2001). This temporal ordering suggests the possibility of a causal relationship. In fact, the vast majority of research to date has focused on the role of major depression in increasing the probability and amount of smoking (Dierker et al. 2004; Rohde et al. 2004; Rohde et al. 2003).

While it is true that smoking exposure is a necessary requirement for nicotine dependence, frequency and quantity of smoking are markedly imperfect indices for determining an individual’s probability of developing nicotine dependence (Kandel and Chen 2000; Stanton, Lowe, and Silva 1995). For example, a substantial number of individuals reporting daily and/or heavy smoking do not meet criteria for nicotine dependence (Kandel and Chen 2000). Conversely, nicotine dependence has been seen among population subgroups reporting relatively low levels of daily and non daily smoking (Kandel and Chen 2000).

A complementary or alternate role that major depression may play is as a cause or signal of greater sensitivity to nicotine dependence, over and above an individual’s level of smoking exposure. While major depression has been shown to increase an individual’s probability of smoking initiation, regular use and nicotine dependence, it remains unclear whether it may signal greater sensitivity for nicotine dependence regardless of smoking quantity.

The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.

Methods

Sample

The sample from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) represents the civilian, non-institutionalized adult population of the United States, and includes persons living in households, military personnel living off base, and persons residing in the following group quarters: boarding or rooming houses, non-transient hotels and motels, shelters, facilities for housing workers, college quarters, and group homes. The NESARC included over sampling of Blacks, Hispanics and young adults aged 18 to 24 years. The sample included 43,093 participants.

Procedure

One adult was selected for interview in each household, and face-to-face computer assisted interviews were conducted in respondents’ homes following informed consent procedures.

Measures

Lifetime major depression (i.e., those experienced in the past 12 months and prior to the past 12 months) were assessed using the NIAAA, Alcohol Use Disorder and Associated Disabilities Interview Schedule DSM-IV (AUDADIS-IV) (Bridget F. Grant et al. 2003; B. F. Grant et al. 1995). The tobacco module of the AUDADIS-IV contains detailed questions on the frequency, quantity, and patterning of tobacco use as well as symptom criteria for DSM-IV nicotine dependence. Current smoking was evaluated through both smoking frequency (“About how often did you usually smoke in the past year?”) coded dichotomously in terms of the presence or absence of daily smoking, and quantity (“On the days that you smoked in the last year, about how many cigarettes did you usually smoke?”)

Implications

While chronic use is a key feature in the development of dependence, the present study will evaluate whether individual differences in nicotine dependence exist above and beyond level of exposure. If individuals with major depression are more sensitive to the development of nicotine dependence regardless of how much they smoke, they would represent an important population subgroup for targeted smoking intervention programs.


Week03: Data Subset, Univariate Summaries And Plots

Purpose of study

(Note: This is the last paragraph of the research proposal used to remind the reader of the research objectives.)

The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.

Variables

Variables from NESARC that will be used include (already in codebook, above):

  • IDNUM (unique ID number with no alphabetics),
  • MAJORDEPLIFE (has subject experienced a major depression?),
  • CHECK321 (has subject smoked cigarettes in past 12 months?),
  • AGE (age of subject),
  • TAB12MDX (tobacco dependence past 12 months),
  • S3AQ3B1 (usual frequency when cigarettes smoked),
  • ETHRACE2A (ethnicity of subject),
  • SEX (gender of subject), and
  • S3AQ3C1 (usual smoking quantity of cigarettes when smoked).

Tuesday ———

Data subset

First, the data is placed on the search path using the PDS package.

library(PDS)
dim(NESARC)
[1] 43093  3008

The variables of interest are specified and a subset of the dataset including only these variables is created and stored in the data frame nesarc.sub.

# variables to include in our data subset
var.list <- c("IDNUM"
            , "SEX"
            , "AGE"
            , "CHECK321"
            , "TAB12MDX"
            , "S3AQ3B1"
            , "S3AQ3C1"
            , "ETHRACE2A"
            , "MAJORDEPLIFE")
var.list
[1] "IDNUM"        "SEX"          "AGE"          "CHECK321"    
[5] "TAB12MDX"     "S3AQ3B1"      "S3AQ3C1"      "ETHRACE2A"   
[9] "MAJORDEPLIFE"
# subset of data
nesarc.sub <- subset(NESARC, select = var.list)

Check size and structure of data subset. Note that categorical variables are already Factor variables, but the levels do not have meaningful labels, yet.

# size of subset
dim(nesarc.sub)
[1] 43093     9
# structure of subset
str(nesarc.sub)
'data.frame':   43093 obs. of  9 variables:
 $ IDNUM       : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SEX         : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
 $ AGE         : num  23 28 81 18 36 34 19 84 29 18 ...
 $ CHECK321    : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
 $ TAB12MDX    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ S3AQ3B1     : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
 $ S3AQ3C1     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ETHRACE2A   : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
 $ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...

Renaming Variables

Rename

  # note, if the new "dplyr" package is installed, this may not work correctly
library(plyr) # for rename(dat, c("from" = "to"))
nesarc.sub <- rename(nesarc.sub, c(
                        "SEX"          = "Sex"
                      , "AGE"          = "Age"
                      , "CHECK321"     = "SmokingStatus"
                      , "TAB12MDX"     = "TobaccoDependence"
                      , "S3AQ3B1"      = "SmokingFreq"
                      , "S3AQ3C1"      = "DailyCigsSmoked"
                      , "ETHRACE2A"    = "Ethnicity"
                      , "MAJORDEPLIFE" = "Depression"
                      ))

A quick summary of the variables indicates many missing values.

# summary of variables -- notice many NAs
summary(nesarc.sub)
     IDNUM       Sex            Age       SmokingStatus TobaccoDependence
 1      :    1   1:18518   Min.   :18.0   1   : 9913    0:38131          
 2      :    1   2:24575   1st Qu.:32.0   2   : 8078    1: 4962          
 3      :    1             Median :44.0   9   :   22                     
 4      :    1             Mean   :46.4   NA's:25080                     
 5      :    1             3rd Qu.:59.0                                  
 6      :    1             Max.   :98.0                                  
 (Other):43087                                                           
  SmokingFreq    DailyCigsSmoked Ethnicity Depression
 1      :14836   Min.   : 1.00   1:24507   0:35254   
 6      :  772   1st Qu.: 6.00   2: 8245   1: 7839   
 4      :  747   Median :15.00   3:  701             
 3      :  687   Mean   :16.81   4: 1332             
 2      :  460   3rd Qu.:20.00   5: 8308             
 (Other):  511   Max.   :99.00                       
 NA's   :25080   NA's   :25080                       

AddHealth, joining waves 1 and 2 together

Joining AddHealth waves 1 and 2 together into a single dataset can be done if you want to use variables from when the participants were both adolecents and adults.

# Original AddHealth datasets
library(PDS)
# size of datasets, to verify the data is available
dim(AddHealth)
[1] 6504 2829
dim(addhealth_public4)
[1] 6504  976
# First column is the unique ID
colnames(AddHealth)[1]
[1] "AID"
colnames(addhealth_public4)[1]
[1] "AID"
# Rename W4's unique ID variable "aid" to "AID"
#   so both W1 and W4 have same variable name (lower/uppercase matters)
library(plyr)
addhealth_public4 <- rename(addhealth_public4, c("aid" = "AID"))
# note that the first column name has be updated
colnames(addhealth_public4)[1]
[1] "AID"
# Join two datasets by unique ID
AH14 <- join(AddHealth, addhealth_public4, by = "AID", type = "full")
# now we have twice as many columns
dim(AH14)
[1] 6504 3804
# Removing this large object since I don't need this data for this project
rm(AH14)

Week04: Thursday ———

Data Cleaning

From the reading (ADA1 Ch 11 notes), there are three primary steps of data cleaning.

  1. Define “edit rules” for detecting obvious inconsistencies for your variables. 2a. Determine which observations violate the rules and 2b. try to localize the reason for the violation.
  2. Correct the errors.

1. Define “edit rules” for detecting obvious inconsistencies for your variables

You should have at least one rule for each variable in your data subset. Possibly some “mixed” rules if there are sensible variable relationships.

By writing these rules, I discovered I had not carefully coded values from the codebook correctly. For example, NAs actually mean 0 (or 100+ cigarettes) for CHECK321 (SmokingStatus), S3AQ3B1 (SmokingFreq), and S3AQ3C1 (DailyCigsSmoked) — so I updated the code above to first recode NA to 0, then 9 to NA! This is common to correct previous ideas as you learn new things by looking closer at your data.

# I have saved my edit rules into this file (in the same folder as this .Rmd file):
fn.edit.rules <- "ADA1_HW_04_NESARC_DataCleaningEditRules.txt"

# You don't need to do this, but I think it's handy to verify that the correct
#   file is being read.
# This is the content of the text file:
readLines(fn.edit.rules)
 [1] "# Erik Erhardt"                                                                                                 
 [2] "# NESARC Edit Rules for subset of variables"                                                                    
 [3] "# ADA1 HW04"                                                                                                    
 [4] "# http://statacumen.com/teaching/ada1/"                                                                         
 [5] ""                                                                                                               
 [6] "# Note:"                                                                                                        
 [7] "# Many of these ranges are taken directly from the codebook ranges."                                            
 [8] "# We don't expect to see violations, but let's catch suprises at this stage"                                    
 [9] "#   rather than during an analysis later!"                                                                      
[10] "# My spacing alignment is not important, it makes it a little easier for me to read."                           
[11] ""                                                                                                               
[12] "# I use %in% when there's a range of integers, so that values such as 1.5 aren't allowed."                      
[13] ""                                                                                                               
[14] "# Original (renamed) variables:"                                                                                
[15] "  ## I comment the IDNUM because it takes a LONG time to check and the unique IDs do not have admissible ranges"
[16] "#IDNUM               %in% 1:43093           # range of ID numbers"                                              
[17] "Sex                 %in% 1:2               # 1 and 2 = c(\"Male\", \"Female\")"                                 
[18] "Age                 %in% 18:98             # based on codebook"                                                 
[19] "SmokingStatus       %in% c(1:2, 9)         # 1 = smoked in the last 12 mo"                                      
[20] "TobaccoDependence   %in% 0:1               # factor takes values 1 or 2"                                        
[21] "SmokingFreq         %in% c(0:6, 9)         # daily to once/mo"                                                  
[22] "DailyCigsSmoked     %in% c(0:98, 99)       # number cig"                                                        
[23] "Ethnicity           %in% 1:5               # 5 groups"                                                          
[24] "Depression          %in% 0:1               # no/yes"                                                            
[25] ""                                                                                                               
[26] "# mixed rules"                                                                                                  
[27] "#   (Note: == is not allowed in a mixed edit (I don't know why not), so I use %in%, instead)"                   
[28] ""                                                                                                               
[29] "# Smokers can't have a 0 frequency"                                                                             
[30] "if (SmokingStatus %in% 1) (SmokingFreq     != 0)"                                                               
[31] "# Smokers must have non-zero daily cigs"                                                                        
[32] "if (SmokingStatus %in% 1) (DailyCigsSmoked != 0)"                                                               
[33] ""                                                                                                               
[34] "## The remaining variables I constructed from the original ones."                                               
[35] "## Ideally, I'd check these, as well."                                                                          
[36] "## If the original variables are correct, these should also be correct"                                         
[37] "##    (unless I coded them incorrectly -- but, I check these when I code them)."                                
[38] "# DaysSmoke"                                                                                                    
[39] "# TotalCigsSmoked"                                                                                              
[40] "# CigsSmokedFac"                                                                                                
[41] "# TobaccoDependence01"                                                                                          
[42] "# Alcohol.binary"                                                                                               
# Encode these rules using the editrules package
library(editrules)
EditRules.N.sub <- editfile(fn.edit.rules)
EditRules.N.sub

Data model:
dat1 : Age %in% c('18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98')
dat2 : DailyCigsSmoked %in% c('0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '5', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '6', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '7', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '8', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '9', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99')
dat3 : Depression %in% c('0', '1')
dat4 : Ethnicity %in% c('1', '2', '3', '4', '5')
dat5 : Sex %in% c('1', '2')
dat6 : SmokingFreq %in% c('0', '1', '2', '3', '4', '5', '6', '9')
dat7 : SmokingStatus %in% c('1', '2', '9')
dat8 : TobaccoDependence %in% c('0', '1') 

Edit set:
mix1 : if( SmokingFreq == '0' ) SmokingStatus != '1'
mix2 : if( DailyCigsSmoked == '0' ) SmokingStatus != '1' 

Show dependencies between rules and variables. Blue circles represent variables and yellow boxes represent restrictions. The lines indicate which restrictions pertain to what variables.

op <- par(no.readonly = TRUE)         # save plot settings
par(mfrow=c(1,1), mar = c(0,0,0,0))
plot(EditRules.N.sub)

par(op)                               # restore plot settings

2a. Determine which observations violate the rules and

If there are several isolated issues, then best to correct those. Alternatively, if there are many errors this may indicate that your edit rules are incorrect; maybe you don’t understand your data.

ve.N.sub <- violatedEdits(EditRules.N.sub, nesarc.sub)
summary(ve.N.sub)
Edit violations, 43093 observations, 0 completely missing (0%):

 editname  freq   rel
     dat2 25080 58.2%
     dat6 25080 58.2%
     dat7 25080 58.2%

Edit violations per record:

 errors  freq   rel
      0 18013 41.8%
      5 25080 58.2%
plot(ve.N.sub)

Check whether all the violations are due to missing values. Based on the numbers, they most certainly are, but best to make sure.

## Strategy:
# Determine which edit rules pertain to which variables
# Find the NAs in the violatedEdits result and in the data set
# If these all match, and there are the same number as violations,
#   then all the violations are due to NAs.

# The column names of the violatedEdits
colnames(ve.N.sub)
 [1] "dat1" "dat2" "dat3" "dat4" "dat5" "dat6" "dat7" "dat8" "mix1" "mix2"
# The edits with violations
summary(ve.N.sub)
Edit violations, 43093 observations, 0 completely missing (0%):

 editname  freq   rel
     dat2 25080 58.2%
     dat6 25080 58.2%
     dat7 25080 58.2%

Edit violations per record:

 errors  freq   rel
      0 18013 41.8%
      5 25080 58.2%
# The variables assigned to each edit name
EditRules.N.sub

Data model:
dat1 : Age %in% c('18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98')
dat2 : DailyCigsSmoked %in% c('0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '5', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '6', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '7', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '8', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '9', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99')
dat3 : Depression %in% c('0', '1')
dat4 : Ethnicity %in% c('1', '2', '3', '4', '5')
dat5 : Sex %in% c('1', '2')
dat6 : SmokingFreq %in% c('0', '1', '2', '3', '4', '5', '6', '9')
dat7 : SmokingStatus %in% c('1', '2', '9')
dat8 : TobaccoDependence %in% c('0', '1') 

Edit set:
mix1 : if( SmokingFreq == '0' ) SmokingStatus != '1'
mix2 : if( DailyCigsSmoked == '0' ) SmokingStatus != '1' 
# are all TRUEs for editrule dat2 the same as NAs for DailyCigsSmoked?
all(ve.N.sub[,"dat2"] == is.na(nesarc.sub$DailyCigsSmoked))
[1] TRUE
# are all TRUEs for editrule dat6 the same as NAs for SmokingFreq?
all(ve.N.sub[,"dat6"] == is.na(nesarc.sub$SmokingFreq))
[1] TRUE
# are all TRUEs for editrule dat7 the same as NAs for SmokingStatus?
all(ve.N.sub[,"dat7"] == is.na(nesarc.sub$SmokingStatus))
[1] TRUE
## If there were any FALSE values,
##    that would indicate that at least one violatedEdit was not due to an NA.
##    That would need to be investicated and corrected.
#$

Note that all the violations are NA values, thus, there’s no issues with the variables I have (fortunately).

2b. Try to localize the reason for the violation.

Here, the le.* object contains some processing metadata and a logical array labeled adapt which indicates the minimal set of variables to be altered in each record. It can be used in correction and imputation procedures for filling in valid values.

The values of TRUE are the recommended changes.

I don’t run this code since it takes so very long and doesn’t give much additional information over the summary above from the violatedEdits().

# NOT RUN
# This can take a long time on a large data set
le.N.sub <- localizeErrors(EditRules.N.sub, nesarc.sub, method = "mip")
le.N.sub$adapt   #$

The behaviour of localizeErrors() can be tuned with various options. It is possible to supply a confidence weight for each variable allowing for fine grained control on which values should be adapted. It is also possible to choose a branch-and-bound based solver (instead of the MIP solver used here), which is typically slower but allows for more control.

3. Correct the errors

For the purposes of ADA1, we will manually correct errors, either by replacing values or by excluding observations.

Manual corrections example (not run in my code):

people[2, "status"]   <- "single"
people[5, "height"]   <- 7
people[5, "agegroup"] <- "adult"
summary(violatedEdits(EditRules.N.sub, people[id, ]))

End Week04 data cleaning —-

Back to Week03

Subset data to young smokers

Next, a subset of people 25 or younger who smoked in the last 12 months is created using the subset function. Note that SmokingStatus == 1 is used to see if subject has smoked in the past 12 months.

Notice, the missing values have been removed (since NAs are not less than or equal to numbers).

# the subset command below will not include a row that evalutes to NA for the
# specified subset.  For example:
(NA == 1)
[1] NA
# age and smoking subset
nesarc.sub <- subset(nesarc.sub
                    , subset = (Age <= 25)     # people 25 or younger
                             & (SmokingStatus == 1) # smoked in the past 12 months
                    )
dim(nesarc.sub)
[1] 1706    9
summary(nesarc.sub)
     IDNUM      Sex          Age        SmokingStatus TobaccoDependence
 21     :   1   1:856   Min.   :18.00   1:1706        0:810            
 77     :   1   2:850   1st Qu.:20.00   2:   0        1:896            
 103    :   1           Median :22.00   9:   0                         
 122    :   1           Mean   :21.61                                  
 136    :   1           3rd Qu.:23.00                                  
 150    :   1           Max.   :25.00                                  
 (Other):1700                                                          
 SmokingFreq DailyCigsSmoked Ethnicity Depression
 1:1320      Min.   : 1.0    1:1060    0:1260    
 2:  68      1st Qu.: 5.0    2: 211    1: 446    
 3:  91      Median :10.0    3:  42              
 4:  88      Mean   :11.8    4:  58              
 5:  65      3rd Qu.:20.0    5: 335              
 6:  71      Max.   :99.0                        
 9:   3                                          

Coding missing values

Recall that many variables are coded in the codebook.

From the codebook, these are the variables and the codes for the missing or unknown values.

CHECK321 (SmokingStatus)
     22 9. Unknown
S3AQ3B1 (SmokingFreq)
    102 9. Unknown
S3AQ3C1 (DailyCigsSmoked)
    262 99. Unknown

Replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!

# Note for this first variable, there are still three factor levels
str(nesarc.sub$SmokingStatus)
 Factor w/ 3 levels "1","2","9": 1 1 1 1 1 1 1 1 1 1 ...
#   even though only the first category has observations.
table(nesarc.sub$SmokingStatus)

   1    2    9 
1706    0    0 
# Replace any values that are "9" with NA
nesarc.sub$SmokingStatus[nesarc.sub$SmokingStatus == 9] <- NA
table(nesarc.sub$SmokingStatus)

   1    2    9 
1706    0    0 
# Then drop unused factor categories.
nesarc.sub$SmokingStatus <- factor(nesarc.sub$SmokingStatus)[, drop = TRUE]
table(nesarc.sub$SmokingStatus)

   1 
1706 
# Repeat this for the other two variables, using table() to see effect:
table(nesarc.sub$SmokingFreq)

   1    2    3    4    5    6    9 
1320   68   91   88   65   71    3 
nesarc.sub$SmokingFreq[nesarc.sub$SmokingFreq == 9] <- NA
table(nesarc.sub$SmokingFreq)

   1    2    3    4    5    6    9 
1320   68   91   88   65   71    0 
nesarc.sub$SmokingFreq <- factor(nesarc.sub$SmokingFreq)[, drop = TRUE]
table(nesarc.sub$SmokingFreq)

   1    2    3    4    5    6 
1320   68   91   88   65   71 
# numeric variable, no levels to drop
nesarc.sub$DailyCigsSmoked[nesarc.sub$DailyCigsSmoked == 99] <- NA

summary(nesarc.sub)
     IDNUM      Sex          Age        SmokingStatus TobaccoDependence
 21     :   1   1:856   Min.   :18.00   1:1706        0:810            
 77     :   1   2:850   1st Qu.:20.00                 1:896            
 103    :   1           Median :22.00                                  
 122    :   1           Mean   :21.61                                  
 136    :   1           3rd Qu.:23.00                                  
 150    :   1           Max.   :25.00                                  
 (Other):1700                                                          
 SmokingFreq DailyCigsSmoked Ethnicity Depression
 1   :1320   Min.   : 1.00   1:1060    0:1260    
 2   :  68   1st Qu.: 5.00   2: 211    1: 446    
 3   :  91   Median :10.00   3:  42              
 4   :  88   Mean   :11.34   4:  58              
 5   :  65   3rd Qu.:20.00   5: 335              
 6   :  71   Max.   :98.00                       
 NA's:   3   NA's   :9                           

Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!

There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).

Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.

dim(nesarc.sub)
[1] 1706    9
# remove rows with missing values
#nesarc.sub <- na.omit(nesarc.sub)
dim(nesarc.sub)
[1] 1706    9
summary(nesarc.sub)
     IDNUM      Sex          Age        SmokingStatus TobaccoDependence
 21     :   1   1:856   Min.   :18.00   1:1706        0:810            
 77     :   1   2:850   1st Qu.:20.00                 1:896            
 103    :   1           Median :22.00                                  
 122    :   1           Mean   :21.61                                  
 136    :   1           3rd Qu.:23.00                                  
 150    :   1           Max.   :25.00                                  
 (Other):1700                                                          
 SmokingFreq DailyCigsSmoked Ethnicity Depression
 1   :1320   Min.   : 1.00   1:1060    0:1260    
 2   :  68   1st Qu.: 5.00   2: 211    1: 446    
 3   :  91   Median :10.00   3:  42              
 4   :  88   Mean   :11.34   4:  58              
 5   :  65   3rd Qu.:20.00   5: 335              
 6   :  71   Max.   :98.00                       
 NA's:   3   NA's   :9                           

Now we have a dataset with no missing values.

Creating new variables

Additional variables that we created from the original variables are listed in this ammendment to the original codebook.

CREATED VARIABLES

DaySmoke
  estimated number of days per month a subject smokes
  Continuous (due to way estimated), assumes 30 days per month using SmokingFreq)
        1-30.

TotalCigsSmoked
  estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
  Continuous
        0-large.

CigsSmokedFac
  quartiles of TotalCigsSmoked
  Ordinal
        ranges for each of the four quarters

SmokingFreq3
  three levels of smoking frequency
  Ordinal
    from SmokingFreq
    1. daily   = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
    2. weekly  = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
    3. monthly = 6. Once a month or less

From categories to numeric

DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.

nesarc.sub$DaysSmoke <- as.numeric(nesarc.sub$SmokingFreq)
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 1] <- 30    # 1. Every day
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 2] <- 4*5.5 # 2. 5 to 6 Day(s) a week
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 3] <- 4*3.5 # 3. 3 to 4 Day(s) a week
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 4] <- 4*1.5 # 4. 1 to 2 Day(s) a week
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 5] <- 2.5   # 5. 2 to 3 Day(s) a month
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 6] <- 1     # 6. Once a month or less
summary(nesarc.sub$DaysSmoke)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   30.00   30.00   25.07   30.00   30.00       3 

From numeric to categories based on quantiles

The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).

nesarc.sub$TotalCigsSmoked <- nesarc.sub$DaysSmoke * nesarc.sub$DailyCigsSmoked

The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function.

quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
  0%  25%  50%  75% 100% 
   1   90  300  600 2940 
nesarc.sub$CigsSmokedFac <- cut(nesarc.sub$TotalCigsSmoked
                              , breaks = quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
                              , include.lowest = TRUE)

summary(nesarc.sub$CigsSmokedFac)
        [1,90]       (90,300]      (300,600] (600,2.94e+03]           NA's 
           429            678            503             87              9 

From many categories to a few

The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.

## SmokingFreq3 variable
# notice that I'm doing this before I've assigned the labels
#   if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below
# inspect original variable
table(nesarc.sub$SmokingFreq)

   1    2    3    4    5    6 
1320   68   91   88   65   71 
# look at numeric values
table(as.numeric(nesarc.sub$SmokingFreq))

   1    2    3    4    5    6 
1320   68   91   88   65   71 
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"

# initialize new variable with NAs
nesarc.sub$SmokingFreq3 <- NA
# find indicies for each group based on original variable, then assign value to new variable
nesarc.sub$SmokingFreq3[(nesarc.sub$SmokingFreq %in% c(1, 2, 3))] <- 1  # 1. daily
nesarc.sub$SmokingFreq3[(nesarc.sub$SmokingFreq %in% c(4, 5   ))] <- 2  # 2. weekly
nesarc.sub$SmokingFreq3[(nesarc.sub$SmokingFreq %in% c(6      ))] <- 3  # 3. monthly

# new variable
table(nesarc.sub$SmokingFreq3)

   1    2    3 
1479  153   71 
# we will label the new factor variable below, with the others

Review results of new variables

Look at the first 10 observations in the dataset, with special attention to the new variables.

head(nesarc.sub, 10)
    IDNUM Sex Age SmokingStatus TobaccoDependence SmokingFreq
21     21   2  25             1                 1           1
77     77   1  21             1                 1           2
103   103   2  24             1                 1           1
122   122   2  23             1                 1           1
136   136   1  21             1                 0           1
150   150   2  23             1                 0           1
155   155   2  21             1                 1           1
174   174   1  19             1                 0           1
178   178   2  19             1                 0           1
184   184   2  19             1                 0           1
    DailyCigsSmoked Ethnicity Depression DaysSmoke TotalCigsSmoked
21                3         2          0        30              90
77                3         5          1        22              66
103              10         1          1        30             300
122              10         1          1        30             300
136              20         1          0        30             600
150               5         1          1        30             150
155               8         2          0        30             240
174               1         1          0        30              30
178              10         5          0        30             300
184              20         1          0        30             600
    CigsSmokedFac SmokingFreq3
21         [1,90]            1
77         [1,90]            1
103      (90,300]            1
122      (90,300]            1
136     (300,600]            1
150      (90,300]            1
155      (90,300]            1
174        [1,90]            1
178      (90,300]            1
184     (300,600]            1

Working with Dates

Dates can be tricky in R. The lubridate package makes them much easier.

Below I’m going to create a variable for the person’s age in years based on the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.

I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth dat, month, and year.

dat.sub <- subset(NESARC, select = c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY))
str(dat.sub)
'data.frame':   43093 obs. of  6 variables:
 $ CDAY : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR: num  2001 2002 2001 2001 2001 ...
 $ DOBD : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY : num  1977 1973 1919 1983 1965 ...

Combine the date fields together, then interpret them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after each operation.

library(lubridate)
## interview date
# combine day, month, and year into one text string
cdate.text <- paste(dat.sub$CDAY, dat.sub$CMON, dat.sub$CYEAR)
head(cdate.text)
[1] "14 8 2001"  "12 1 2002"  "23 11 2001" "9 9 2001"   "18 10 2001"
[6] "7 9 2001"  
# create date object by interpretting the day, month, year with dmy()
cdate.date <- dmy(cdate.text)
head(cdate.date)
[1] "2001-08-14" "2002-01-12" "2001-11-23" "2001-09-09" "2001-10-18"
[6] "2001-09-07"
## date of birth
# combine day, month, and year into one text string
dob.text <- paste(dat.sub$DOBD, dat.sub$DOBM, dat.sub$DOBY)
head(dob.text)
[1] "27 9 1977"  "15 12 1973" "28 12 1919" "27 6 1983"  "11 9 1965" 
[6] "28 12 1966"
# create date object by interpretting the day, month, year with dmy()
  # if any failed to parse, it's because a day, month, or year is out of range
  #    such as "99" for missing values or refused to answer
  # in these cases, the date is NA (missing), which is what we want
dob.date <- dmy(dob.text)
head(dob.date)
[1] "1977-09-27" "1973-12-15" "1919-12-28" "1983-06-27" "1965-09-11"
[6] "1966-12-28"

Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.

# difference in days / 365.25 = difference in years
age.days <- cdate.date - dob.date
head(age.days)
Time differences in days
[1]  8722 10255 29916  6649 13186 12672
age.years <- as.numeric(age.days / 365.25)  # change from "difftime" to "numeric"
head(age.years)
[1] 23.87953 28.07666 81.90554 18.20397 36.10130 34.69405
# difference between the age we calculated and the age in the original dataset
head(age.years - NESARC$AGE)
[1] 0.87953457 0.07665982 0.90554415 0.20396988 0.10130048 0.69404517

Add the new age variable to your data frame, drop the ones you don’t need anymore, and you’re ready to go!

# add the variable
dat.sub$age.years <- age.years
str(dat.sub)
'data.frame':   43093 obs. of  7 variables:
 $ CDAY     : num  14 12 23 9 18 7 1 18 9 16 ...
 $ CMON     : num  8 1 11 9 10 9 9 9 1 8 ...
 $ CYEAR    : num  2001 2002 2001 2001 2001 ...
 $ DOBD     : num  27 15 28 27 11 28 16 10 7 25 ...
 $ DOBM     : num  9 12 12 6 9 12 3 11 4 3 ...
 $ DOBY     : num  1977 1973 1919 1983 1965 ...
 $ age.years: num  23.9 28.1 81.9 18.2 36.1 ...
# drop the original ones you don't need anymore
#   by using subset() and select = -VAR, it excludes the VAR column
dat.sub <- subset(dat.sub, select = c(-CDAY, -CMON, -CYEAR, -DOBD, -DOBM, -DOBY))
str(dat.sub)
'data.frame':   43093 obs. of  1 variable:
 $ age.years: num  23.9 28.1 81.9 18.2 36.1 ...

Now you’ve got an age (or other date object) you can use in your project.

Labeling Categorical variable levels

Look at summary of dataset to see the state of the labelling before we make changes.

summary(nesarc.sub)
     IDNUM      Sex          Age        SmokingStatus TobaccoDependence
 21     :   1   1:856   Min.   :18.00   1:1706        0:810            
 77     :   1   2:850   1st Qu.:20.00                 1:896            
 103    :   1           Median :22.00                                  
 122    :   1           Mean   :21.61                                  
 136    :   1           3rd Qu.:23.00                                  
 150    :   1           Max.   :25.00                                  
 (Other):1700                                                          
 SmokingFreq DailyCigsSmoked Ethnicity Depression   DaysSmoke    
 1   :1320   Min.   : 1.00   1:1060    0:1260     Min.   : 1.00  
 2   :  68   1st Qu.: 5.00   2: 211    1: 446     1st Qu.:30.00  
 3   :  91   Median :10.00   3:  42               Median :30.00  
 4   :  88   Mean   :11.34   4:  58               Mean   :25.07  
 5   :  65   3rd Qu.:20.00   5: 335               3rd Qu.:30.00  
 6   :  71   Max.   :98.00                        Max.   :30.00  
 NA's:   3   NA's   :9                            NA's   :3      
 TotalCigsSmoked         CigsSmokedFac  SmokingFreq3  
 Min.   :   1.0   [1,90]        :429   Min.   :1.000  
 1st Qu.:  90.0   (90,300]      :678   1st Qu.:1.000  
 Median : 300.0   (300,600]     :503   Median :1.000  
 Mean   : 319.4   (600,2.94e+03]: 87   Mean   :1.173  
 3rd Qu.: 600.0   NA's          :  9   3rd Qu.:1.000  
 Max.   :2940.0                        Max.   :3.000  
 NA's   :9                             NA's   :3      

Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, TobaccoDependence, and Sex.

nesarc.sub$Depression <- factor(nesarc.sub$Depression
                              , labels = c("No Depression"
                                         , "Yes Depression"
                                ))

# first label the existing levels
nesarc.sub$Sex <- factor(nesarc.sub$Sex
                       , labels = c("Male"
                                  , "Female"
                                   )
                        )
# check ordering with a frequency table
table(nesarc.sub$Sex)

  Male Female 
   856    850 
# then reorder the levels
nesarc.sub$Sex <- factor(nesarc.sub$Sex
                       , levels = c("Female"
                                  , "Male"
                                   )
                        )
# check ordering with a frequency table
table(nesarc.sub$Sex)

Female   Male 
   850    856 
# labels
nesarc.sub$SmokingFreq <- factor(nesarc.sub$SmokingFreq
                               , labels = c("Every Day"
                                          , "5 to 6 Days/week"
                                          , "3 to 4 Days/week"
                                          , "1 to 2 Days/week"
                                          , "2 to 3 Days/month"
                                          , "Once a month or less"
                                          )
                                )
# include levels afterwards to put in order
nesarc.sub$SmokingFreq <- factor(nesarc.sub$SmokingFreq
                               , levels = c("Once a month or less"
                                          , "2 to 3 Days/month"
                                          , "1 to 2 Days/week"
                                          , "3 to 4 Days/week"
                                          , "5 to 6 Days/week"
                                          , "Every Day"
                                          )
                                )


nesarc.sub$SmokingFreq3 <- factor(nesarc.sub$SmokingFreq3
                                , labels = c("daily"
                                           , "weekly"
                                           , "monthly"
                                           )
                                  )

nesarc.sub$TobaccoDependence <- factor(nesarc.sub$TobaccoDependence
                                     , labels = c("No Nicotine Dependence"
                                                , "Nicotine Dependence"
                                                )
                                      )
nesarc.sub$TobaccoDependence <- factor(nesarc.sub$TobaccoDependence
                                     , levels = c("Nicotine Dependence"
                                                , "No Nicotine Dependence"
                                                 )
                                      )

nesarc.sub$Ethnicity <- factor(nesarc.sub$Ethnicity
                             , labels = c("Cauc"
                                        , "AfAm"
                                        , "NaAm"
                                        , "Asia"
                                        , "Hisp"
                                         )
                             # , labels = c("Caucasian"
                             #            , "African American"
                             #            , "Native American"
                             #            , "Asian"
                             #            , "Hispanic"
                             #             )
                              )

summary(nesarc.sub)
     IDNUM          Sex           Age        SmokingStatus
 21     :   1   Female:850   Min.   :18.00   1:1706       
 77     :   1   Male  :856   1st Qu.:20.00                
 103    :   1                Median :22.00                
 122    :   1                Mean   :21.61                
 136    :   1                3rd Qu.:23.00                
 150    :   1                Max.   :25.00                
 (Other):1700                                             
              TobaccoDependence               SmokingFreq  
 Nicotine Dependence   :896     Once a month or less:  71  
 No Nicotine Dependence:810     2 to 3 Days/month   :  65  
                                1 to 2 Days/week    :  88  
                                3 to 4 Days/week    :  91  
                                5 to 6 Days/week    :  68  
                                Every Day           :1320  
                                NA's                :   3  
 DailyCigsSmoked Ethnicity            Depression     DaysSmoke    
 Min.   : 1.00   Cauc:1060   No Depression :1260   Min.   : 1.00  
 1st Qu.: 5.00   AfAm: 211   Yes Depression: 446   1st Qu.:30.00  
 Median :10.00   NaAm:  42                         Median :30.00  
 Mean   :11.34   Asia:  58                         Mean   :25.07  
 3rd Qu.:20.00   Hisp: 335                         3rd Qu.:30.00  
 Max.   :98.00                                     Max.   :30.00  
 NA's   :9                                         NA's   :3      
 TotalCigsSmoked         CigsSmokedFac  SmokingFreq3 
 Min.   :   1.0   [1,90]        :429   daily  :1479  
 1st Qu.:  90.0   (90,300]      :678   weekly : 153  
 Median : 300.0   (300,600]     :503   monthly:  71  
 Mean   : 319.4   (600,2.94e+03]: 87   NA's   :   3  
 3rd Qu.: 600.0   NA's          :  9                 
 Max.   :2940.0                                      
 NA's   :9                                           

Thursday ———

Tables for categorical variables

Basic tables can be created using the functions table or xtabs. Frequency tables are created for the variables TobaccoDependence, CigsSmokedFac, and SmokingFreq. When only looking at one variable, table is the same as xtabs, but xtabs will make looking at two-way, three-way, and higher tables possible.

table(nesarc.sub$TobaccoDependence)

   Nicotine Dependence No Nicotine Dependence 
                   896                    810 
T1 <- xtabs( ~ TobaccoDependence, data = nesarc.sub)
T1
TobaccoDependence
   Nicotine Dependence No Nicotine Dependence 
                   896                    810 
T2 <- xtabs( ~ CigsSmokedFac, data = nesarc.sub)
T2
CigsSmokedFac
        [1,90]       (90,300]      (300,600] (600,2.94e+03] 
           429            678            503             87 
T3 <- xtabs( ~ SmokingFreq, data = nesarc.sub)
T3
SmokingFreq
Once a month or less    2 to 3 Days/month     1 to 2 Days/week 
                  71                   65                   88 
    3 to 4 Days/week     5 to 6 Days/week            Every Day 
                  91                   68                 1320 

(Look at the .Rmd code to see that these numbers in the next paragraph are calculated directly from the numbers summarized in the table!)

In the data frame nesarc.sub, there are 896 nicotine dependent subjects and 810 subjects that are not nicotine dependent. A small number of smokers (87) smoke over 600 cigarettes per month. Most of the subjects in nesarc.sub are daily smokers (1320) with the remainder distributed uniformly across the first five levels of SmokingFreq.

Graphing frequency tables

The barplots are all created with the package ggplot2. The barplots start with the defaults for the geom_bar and add more detail to the plot with each graph.

library(ggplot2)
p1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence))
p1 <- p1 + geom_bar()
p1

p2 <- ggplot(data = nesarc.sub, aes(x = CigsSmokedFac))
p2 <- p2 + geom_bar()
p2

p3 <- ggplot(data = nesarc.sub, aes(x = SmokingFreq))
p3 <- p3 + geom_bar()
p3 <- p3 + labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Adults")
p3 <- p3 + theme_bw()
p3 <- p3 + theme(axis.text.x  = element_text(angle = 90, vjust = 0))
p3

Removing NAs from factor axes

When a factor variable has NAs, then NA appears as a level with frequences (see bar plots above). As of writing this, ggplot() does not have a way to remove the NAs before plotting. Therefore, we have to do it manually.

The easiest and most transparent way is to use the subset() function below where we only plot the non-NA values. Compare the two plots below of the same data, but the second plot had the NAs removed before plotting.

library(ggplot2)
# Original plot with NAs
p1 <- ggplot(data = nesarc.sub, aes(x = CigsSmokedFac))
p1 <- p1 + geom_bar()
#p1

# subset excluded the NAs for the variable being plotted
# "subset() specifies the values to keep,
#   and "!is.na()" means "keep the non-NA values"
p2 <- ggplot(data = subset(nesarc.sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))
p2 <- p2 + geom_bar()
#p2

# grid.arrange() is a way to arrange several ggplot objects
library(gridExtra)
grid.arrange(grobs = list(p1, p2), nrow=1, top = "Excluding NAs from factor variable in ggplot")

Graphing numeric variables

One popular way to graph a continuous variable is to use a histogram. R has the base function hist which can be used for this purpose. We will also use the package ggplot2 to create histograms. We start with a basic histogram of the variable TotalCigsSmoked.

hist(nesarc.sub$TotalCigsSmoked)

Experiment with the code in the previous code chunk to change the color, the title, and if needed the labels on the \(x\)- and \(y\)-axes.

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(binwidth = 200)
p <- p + geom_rug()
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p

Creating Density Plots

p <- ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked))
p <- p + geom_histogram(aes(y=..density..), binwidth = 200)
p <- p + geom_rug()
p <- p + geom_density(alpha=0.1, fill="white", adjust = 1.5)
p <- p + labs(x = "Estimated Cigarettes Smoked per Month")
p <- p + theme_bw()
p

Shape, center, and spread

summary(nesarc.sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0    90.0   300.0   319.4   600.0  2940.0       9 
fivenum(nesarc.sub$TotalCigsSmoked)
[1]    1   90  300  600 2940
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 300
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 510

The shape of the distribution for the estimated cigarettes smoked per month is skewed to the right. The center (median) of the distribution is 300 and the spread (interquartile range) for the distribution is 510.


Week04: Bivariate summaries and data cleaning

Tuesday ———

Bivariate graphs

Scatter plot (for regression): x = numerical, y = numerical

Research question: With my questions, I don’t have a meaningful question to ask that will relate two numeric variable.

An exploratory question whether there is a relationship between a person’s age (Age) and total number of cigaretts smoked (TotalCigsSmoked). Because age is an integer varible, I jitter the points, and I reduce the opacity (increase the transparancy) with alpha = 1/4 to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).

Interpretation: I added a linear regression smoother to see whether there was any linear trend with age. The horizontal line suggests that there isn’t a (linear) relationship here.

The first plot has all the points in their original locations, but they end up stacking on top of each other so you can’t tell how many points are there.

library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_point(alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)

This is the same data as above, but I have added some horizontal jitter (displacement) to each point which doesn’t affect their value on the y-axis but allows us to see how many points are at each discrete age year.

library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Age, y = TotalCigsSmoked))
p <- p + geom_jitter(position = position_jitter(width = 0.2), alpha = 1/4)
p <- p + stat_smooth(method = lm)
p <- p + labs(title = "Total Cigarettes Smoked vs Age")
print(p)

Box plots (for ANOVA): x = categorical, y = numerical

Research question: 1. Is there a relationship between depression (Depression) and total number of cigarettes smoked (TotalCigsSmoked)?

The \(x\) variable should be a factor variable, and \(y\) should be numeric.

Interpretation: The boxplots indicate that the five number summary is about the same for both depression groups. Also, the means are about the same, though slightly higher for the people with depression.

library(ggplot2)
p <- ggplot(nesarc.sub, aes(x = Depression, y = TotalCigsSmoked))
p <- p + geom_boxplot(width = 0.5, alpha = 0.5)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = 1/4)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
                      colour = "red", alpha = 0.8)
p <- p + labs(title = "Depression does not predict Total Cigarettes Smoked")
print(p)