Please don’t let the initial size and detail of this document intimidate you. You got this!
This document is organized by Week and Class number. The worksheet assignments are indicated by the Class numbers.
Consider your readers (graders):
organize the document clearly (use this document as an example)
label minor sections under each day (use this document as an example)
For each thing you do, always have these three parts:
Say what you’re going to do and why.
Do it with code, and document your code.
Interpret the results.
1.1 Document
1.1.1 Naming
Note
Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.
ADA1_ALL_05.qmd
ADA1_ALL_06.qmd
ADA1_ALL_07.qmd …
A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:
ADA1_ALL_20220903.qmd
ADA1_ALL_20220905.qmd
ADA1_ALL_20220910.qmd …
1.1.2 Structure
We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.
1.1.3 Classes not appearing in this document
Some assignments are in a separate worksheet and are indicated with “(separate worksheet)”. For these, I’ll provide a dataset I want you to analyze. Typically, you’ll then return to this document and repeat the same type of analysis with your dataset.
2 Research Questions
2.1 Class 02, Personal Codebook
Rubric
(1 p) Is there a topic of interest?
(2 p) Are the variables relevant to a set of research questions?
(4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?
1 categorical variable with only 2 levels
1 categorical variable with at least 3 levels
2 numerical variables with many possible unique values
More variables are welcome and you’re likely to add to this later in the semester
(3 p) For each variable, is there a variable description, a data type, and coded value descriptions?
Compile this qmd file to an html, print/save to pdf, and upload to UNM Canvas.
2.1.1 Topic and research questions
Topic:
Is there an association between nicotine dependence and the frequency and quantity of smoking in adults? The association may differ by ethnicity, age, gender, and other factors (though we won’t be able to test these additional associations until next semester in ADA2).
Research questions:
Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
3 Codebook
National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)
Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking
Key:
RenamedVarName
VarName original in dataset
Variable description
Data type (Continuous, Discrete, Nominal, Ordinal)
Frequency ItemValue Description
ID
IDNUM
UNIQUE ID NUMBER WITH NO ALPHABETICS
Categorical, Nominal
43093 1-43093. Unique Identification number
Sex
SEX
SEX
Categorical, Nominal
18518 1. Male
24575 2. Female
Age
AGE
AGE
Numerical, Continuous
43079 18-97. Age in years
14 98. 98 years or older
SmokingStatus
CHECK321
CIGARETTE SMOKING STATUS
Categorical, Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
NicotineDependence
TAB12MDX
NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
Categorical, Nominal
38131 0. No nicotine dependence
4962 1. Nicotine dependence
SmokingFreq
S3AQ3B1
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Categorical, Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
DailyCigsSmoked
S3AQ3C1
USUAL QUANTITY WHEN SMOKED CIGARETTES
Numerical, Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
Ethnicity
ETHRACE2A
IMPUTED RACE/ETHNICITY
Categorical, Nominal
24507 1. White, Not Hispanic or Latino
8245 2. Black, Not Hispanic or Latino
701 3. American Indian/Alaska Native, Not Hispanic or Latino
1332 4. Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino
8308 5. Hispanic or Latino
Depression
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
Categorical, Nominal
35254 0. No
7839 1. Yes
IncomePersonal
S1Q10A
TOTAL PERSONAL INCOME IN LAST 12 MONTHS
Numerical, Continuous
43093 0-3000000. Income in dollars
IncomePersonalCat
S1Q10B
TOTAL PERSONAL INCOME IN LAST 12 MONTHS: CATEGORY
Categorical, Ordinal
2462 0. $ 0 (No personal income)
3571 1. $ 1 to $ 4,999
3823 2. $ 5,000 to $ 7,999
2002 3. $ 8,000 to $ 9,999
3669 4. $ 10,000 to $12,999
1634 5. $ 13,000 to $14,999
3940 6. $ 15,000 to $19,999
3887 7. $ 20,000 to $24,999
3085 8. $ 25,000 to $29,999
3003 9. $ 30,000 to $34,999
2351 10. $ 35,000 to $39,999
3291 11. $ 40,000 to $49,999
2059 12. $ 50,000 to $59,999
1328 13. $ 60,000 to $69,999
857 14. $ 70,000 to $79,999
521 15. $ 80,000 to $89,999
290 16. $ 90,000 to $99,999
1320 17. $100,000 or more
Height_ft
S1Q24FT
HEIGHT: FEET
Numerical, Continuous
42363 4-7. Feet
730 99. Unknown
* change 99 to NA
Height_in
S1Q24IN
HEIGHT: INCHES
Numerical, Continuous
3572 0. None
38760 1-11. Inches
761 99. Unknown
* change 99 to NA
Weight_lb
S1Q24LB
WEIGHT: POUNDS
Numerical, Continuous
41717 62-500. Pounds
1376 999. Unknown
* change 999 to NA
Additional variables were created from the original variables:
CREATED VARIABLES
DaysSmoke
estimated number of days per month a subject smokes
Numerical, Continuous (due to way estimated), assumes 30 days per month using SmokingFreq
1-30.
TotalCigsSmoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Numerical, Continuous
0-large.
TotalCigsSmoked_smoked
estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked)
Smokers, only! (replaced 0s with NAs)
Continuous
1-large.
CigsSmokedFac
quartiles of TotalCigsSmoked
Categorical, Ordinal
ranges for each of the four quarters
SmokingFreq3
three levels of smoking frequency
Categorical, Ordinal
from SmokingFreq
1. daily = 1. Every day, 2. 5 to 6 Day(s) a week, 3. 3 to 4 Day(s) a week
2. weekly = 4. 1 to 2 Day(s) a week, 5. 2 to 3 Day(s) a month
3. monthly = 6. Once a month or less
4. Never = 7. Never
Height_inches
Total height in inches
Numerical, Continuous
Height_ft * 12 + Height_in
4 Data Management
4.1 Class 03, Data subset and numerical summaries
Rubric
(4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.
Scroll down to sections labeled “(Class 03)”.
(1 p) Output confirms the subset is correct (e.g., using dim() and str()).
(3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).
Scroll down to sections labeled “(Class 03)”.
(2 p) Provide numerical summaries for all variables (e.g., using summary()).
Scroll down to sections labeled “(Class 03)”.
4.1.1 Data subset (Class 03)
First, read the data and see the size (dim()).
# data analysis packageslibrary(erikmisc) # Helpful functions
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw for all plots## 1. Download the ".RData" file for your dataset into your ADA Folder.## 2. Use the load() statement for the dataset you want to use.# read data exampleload("NESARC.RData")dim(NESARC)
[1] 43093 3008
Select the variables to include in our subset.
# variables to include in our data subsetnesarc_sub <- NESARC %>%select( IDNUM , SEX , AGE , CHECK321 , TAB12MDX , S3AQ3B1 , S3AQ3C1 , ETHRACE2A , MAJORDEPLIFE , S1Q10A , S1Q10B , S1Q24FT , S1Q24IN , S1Q24LB )# Check size and structure of data subset.# Note that categorical variables are already `Factor` variables, but the levels do not have meaningful labels, yet.# size of subsetdim(nesarc_sub)
[1] 43093 14
# structure of subsetstr(nesarc_sub)
'data.frame': 43093 obs. of 14 variables:
$ IDNUM : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SEX : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CHECK321 : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ TAB12MDX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ S3AQ3B1 : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ S3AQ3C1 : num NA NA NA NA NA NA NA NA NA NA ...
$ ETHRACE2A : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ MAJORDEPLIFE: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ S1Q10A : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ S1Q10B : Factor w/ 18 levels "0","1","2","3",..: 7 5 3 9 12 2 2 8 14 2 ...
$ S1Q24FT : num 5 5 5 5 6 5 5 5 5 5 ...
$ S1Q24IN : num 7 1 4 7 1 5 9 4 4 10 ...
$ S1Q24LB : num 195 127 185 180 220 140 160 120 140 150 ...
'data.frame': 43093 obs. of 14 variables:
$ ID : Factor w/ 43093 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 2 2 2 ...
$ Age : num 23 28 81 18 36 34 19 84 29 18 ...
$ SmokingStatus : Factor w/ 3 levels "1","2","9": NA NA NA NA NA NA NA NA NA NA ...
$ NicotineDependence: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ SmokingFreq : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA NA NA NA NA ...
$ DailyCigsSmoked : num NA NA NA NA NA NA NA NA NA NA ...
$ Ethnicity : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 2 2 2 1 1 5 ...
$ Depression : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
$ IncomePersonal : num 17500 11000 6000 27000 42000 500 2500 24000 65000 4000 ...
$ IncomePersonalCat : Factor w/ 18 levels "0","1","2","3",..: 7 5 3 9 12 2 2 8 14 2 ...
$ Height_ft : num 5 5 5 5 6 5 5 5 5 5 ...
$ Height_in : num 7 1 4 7 1 5 9 4 4 10 ...
$ Weight_lb : num 195 127 185 180 220 140 160 120 140 150 ...
4.1.3 Coding missing values (Class 04)
There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.
4.1.3.1 Coding NAs as meaningful “missing”
Note
Most of you will not need to code this type of missing.
First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.
4.1.3.1.1NAs recoded as numeric
The easiest is DailyCigsSmoked where NA means 0 cigarettes. We replace NA with 0. The function is.na() identifies each observation that “is an NA”. I’ve updated the codebook to indicate that the original NA values were changed.
S3AQ3C1 (DailyCigsSmoked)
USUAL QUANTITY WHEN SMOKED CIGARETTES
Discrete
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 0 Cigarett(s)
The next two variables require defining a new category level. Both SmokingStatus and SmokingFreq with a value of NA means “never”, so we need to create a new “Never” category and put in order with the rest of the levels. I’ve updated the codebook to indicate that the original NA values were changed.
CHECK321 (SmokingStatus)
CIGARETTE SMOKING STATUS
Nominal
9913 1. Smoked cigarettes in the past 12 months
8078 2. Smoked cigarettes prior to the last 12 months
22 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 3. Never
S3AQ3B1 (SmokingFreq)
USUAL FREQUENCY WHEN SMOKED CIGARETTES
Ordinal
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
* change to 7. Never
table(nesarc_sub$SmokingStatus )
1 2 9
9913 8078 22
table(nesarc_sub$SmokingFreq )
1 2 3 4 5 6 9
14836 460 687 747 409 772 102
nesarc_sub <- nesarc_sub %>%mutate(# first change the "factor" variables to "character" to add a new levelSmokingStatus =as.character(SmokingStatus) , SmokingFreq =as.character(SmokingFreq ) ) %>%replace_na(# replace NA with new category valueslist(SmokingStatus ="3" , SmokingFreq ="7" ) )table(nesarc_sub$SmokingStatus )
Second step: replace the coded missing values with NA, then remove the remaining records with missing values. Not all of these variables need this operation, since the previous subset commands removed missing values by chance. However, I’m going to run each as an extra precaution. Afterall, later, if we were to make different previous data decisions, it’s possible we could end up with missing values at this point that are incorrectly coded without this step!
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
IncomePersonalCat Height_ft Height_in Weight_lb
6 : 3940 Min. :4.00 Min. : 0.000 Min. : 62.0
7 : 3887 1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
2 : 3823 Median :5.00 Median : 5.000 Median :165.0
4 : 3669 Mean :5.11 Mean : 5.279 Mean :170.6
1 : 3571 3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
11 : 3291 Max. :7.00 Max. :11.000 Max. :500.0
(Other):20912 NA's :730 NA's :761 NA's :1376
4.1.3.3 Removing rows with missing values (don’t do this)
Note
Do not do this.
Finally, consider using na.omit() to remove any records with missing values if it won’t cause issues with analysis. The command na.omit() removes a row if any columns have an NA in that row. This can have drastic consequences. For example, imagine that you have 7 variables and 6 variables have complete data, but the 7th variable is about three-quarters NAs! Then na.omit() will drop three-quarters of your data, even though most of your variables are complete!
There’s only one advantage I can think of for running na.omit(), and that’s if you have a large dataset and only a few observations have a few NAs. You can’t analyze data with NAs, since any calculation involving NA results in an NA. Therefore imagine two scenarios: (1) An analysis with Variables 1 and 2, each with a few NA distributed throughout; first the observations with NA for either Variable 1 or 2 will be removed, then the analysis will proceed. (2) A similar analysis, but now with Variables 1 and 3, but Variable 3 has NA for different observations than Variable 2 — therefore, different observations will be removed before analysis. In summary, these two analyses will include different observations from the dataset. Therefore, by running na.omit() you will start with a dataset with no NAs, thus every analysis will be of exactly the same set of data (at the cost of not using all the possible data for each individual analysis).
Below, I do not run na.omit() since I’m concerned about having lots of NAs for some variables.
dim(nesarc_sub)
[1] 43093 14
# remove rows with missing values#nesarc_sub <- na.omit(nesarc_sub)dim(nesarc_sub)
[1] 43093 14
summary(nesarc_sub)
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
IncomePersonalCat Height_ft Height_in Weight_lb
6 : 3940 Min. :4.00 Min. : 0.000 Min. : 62.0
7 : 3887 1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
2 : 3823 Median :5.00 Median : 5.000 Median :165.0
4 : 3669 Mean :5.11 Mean : 5.279 Mean :170.6
1 : 3571 3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
11 : 3291 Max. :7.00 Max. :11.000 Max. :500.0
(Other):20912 NA's :730 NA's :761 NA's :1376
Now we (would) have a dataset with no missing values.
4.1.4 Creating new variables (Class 04+)
4.1.4.1 From categories to numeric
Note
Most of you will not need to do this.
DaysSmoke estimates the days per month a subject smokes by converting SmokingFreq (a factor with 6 levels) to a numeric variable using as.numeric() and multiplying by the midpoint of the range of SmokingFreq.
nesarc_sub <- nesarc_sub %>%mutate(DaysSmoke =case_when( SmokingFreq ==1~30# 1. Every day , SmokingFreq ==2~4*5.5# 2. 5 to 6 Day(s) a week , SmokingFreq ==3~4*3.5# 3. 3 to 4 Day(s) a week , SmokingFreq ==4~4*1.5# 4. 1 to 2 Day(s) a week , SmokingFreq ==5~2.5# 5. 2 to 3 Day(s) a month , SmokingFreq ==6~1# 6. Once a month or less , SmokingFreq ==7~0# 7. Never ) )summary(nesarc_sub$DaysSmoke)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 10.96 30.00 30.00 102
This shows an example with many levels. Note, that there is a numeric variable in this dataset, but this categorical sitation occurs frequent enough to show you an example.
Height_inches is the height of a person in inches. Since the data was collected in two variables (Height_ft and Height_in) we need to add them together for our new variable.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
48.0 64.0 66.0 66.6 70.0 84.0 761
Note
Most of you will need to transform a variable to address extreme right skewness.
The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (the usual quantity smoked per day).
Log scale. I like log2 since its interpretation is that every unit is a doubling. Notice that I added 1 to the TotalCigsSmoked before computing the log. This is because log(0) is undefined, thus I add 1 so that log(0 + 1) = 0 and all values are defined.
Square root scale. Square root is another transformation that spreads out values between 0 and 1 and compresses values from 1 to infinity. It can be preferred to a log transformation in some cases.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.0 187.5 300.0 2940.0 297
summary(nesarc_sub$CigsSmokedFac)
No smoking (0) Lowest smoking (1-6) Low smoking (7-30)
25080 1045 1129
Medium smoking (31-300) High smoking (301+) NA's
6456 9086 297
table(nesarc_sub$CigsSmokedFac)
No smoking (0) Lowest smoking (1-6) Low smoking (7-30)
25080 1045 1129
Medium smoking (31-300) High smoking (301+)
6456 9086
Finer points
Cutting a numeric variable into equal-sized categories.
The numeric variable TotalCigsSmoked is converted into a factor with four roughly equivalent numbers (quartiles) stored in each level of the factor CigsSmokedFac using the cut function. The label "Q1" indicates that the original value was in the first quarter (25%) of the ordered data, that is, they are among the smallest 25% observations; while "Q4" indicates the original value was in the last quarter (25%) of the ordered data, that is, they are among the largest 25% observations.
In my data, more than half of the values are 0s which is the minimum value, thus they all get grouped in “Q1”; this leaves no values for “Q2” – oh well.
summary(nesarc_sub$CigsSmokedFac2)
Q1 Q2 Q3 Q4 NA's
0 0 0 0 43093
4.1.4.4 From many categories to a few
The variable SmokingFreq3 collapses the SmokingFreq from 6 down to 3 categories. This is an example to show you how to do this with your own variables. Some of you will have categorical variables that have a dozen or so categories; having so many categories makes interpretation very difficulty — fewer is easier.
## SmokingFreq3 variable# notice that I'm doing this before I've assigned the labels# if you do this after assigning labels (with the factor() command), then you'll want to use "as.numeric()" as below# inspect original variabletable(nesarc_sub$SmokingFreq)
1 2 3 4 5 6 7
14836 460 687 747 409 772 25080
# Note: it is sometimes easier to use the numeric "levels" of a factor variable, than the character "labels"# initialize new variable with NAsnesarc_sub <- nesarc_sub %>%mutate(SmokingFreq3 =fct_collapse( SmokingFreq , "1"=c("1", "2", "3") # 1. daily , "2"=c("4", "5") # 2. weekly , "3"="6"# 3. monthly , "4"="7"# 7. never ) )# new variabletable(nesarc_sub$SmokingFreq3)
1 2 3 4
15983 1156 772 25080
# we will label the new factor variable below, with the others
4.1.4.5 Working with Dates
Note
Most of you will not need to work with dates.
How to calculate the duration between two dates.
(You probably won’t need to use this for your project.)
Dates can be tricky in R. The lubridate package makes them much easier.
Below I’m going to create a variable for the person’s age in years based on the difference between the person’s date of birth and the interview date. There is an AGE variable which I’m already using, but if there wasn’t, or I wanted to be more precise about their age (such as, 24.34 years old), then this is a way to do it.
I’m going to create a separate data frame just for this example to show how to do it; if you needed these variables, they should be part of your original subset above. Below, the “C” variables (CDAY) are the interview day, month, and year, and the “DOB” variables (DOBD) are the date of birth day, month, and year.
dat_sub <- NESARC %>%select( AGE , CDAY , CMON , CYEAR , DOBD , DOBM , DOBY )str(dat_sub)
'data.frame': 43093 obs. of 7 variables:
$ AGE : num 23 28 81 18 36 34 19 84 29 18 ...
$ CDAY : num 14 12 23 9 18 7 1 18 9 16 ...
$ CMON : num 8 1 11 9 10 9 9 9 1 8 ...
$ CYEAR: num 2001 2002 2001 2001 2001 ...
$ DOBD : num 27 15 28 27 11 28 16 10 7 25 ...
$ DOBM : num 9 12 12 6 9 12 3 11 4 3 ...
$ DOBY : num 1977 1973 1919 1983 1965 ...
Combine the date fields together, then convert them into a date object. I’ll do this for both the interview date and the date of birth. I print the head (first 6 observations) after all the operations are complete.
dat_sub <- dat_sub %>%mutate(## interview date# combine day, month, and year into one text stringcdate_text =paste(dat_sub$CDAY, dat_sub$CMON, dat_sub$CYEAR)# create date object by interpretting the day, month, year with dmy() , cdate_date =dmy(cdate_text)## date of birth# combine day, month, and year into one text string , dob_text =paste(dat_sub$DOBD, dat_sub$DOBM, dat_sub$DOBY)# create date object by interpretting the day, month, year with dmy()# if any failed to parse, it's because a day, month, or year is out of range# such as "99" for missing values or refused to answer# in these cases, the date is NA (missing), which is what we want , dob_date =dmy(dob_text) )
Now that we have the interview date and date of birth as date objects, we calculate the difference to give the person’s age at the time of the interview. Because the difference of dates is in the unit “days”, we convert to years.
dat_sub <- dat_sub %>%mutate(age_days = cdate_date - dob_date , age_years =as.numeric(age_days /365.25) # change from "difftime" to "numeric"# difference between the age we calculated and the age in the original dataset , age_diff = age_years - AGE )head(dat_sub)
Finally, keep new age variable in your data frame, drop (unselect) the variables you don’t need anymore, and you’re ready to go!
# drop the original ones you don't need anymore# by select(-VAR), it excludes the VAR columndat_sub <- dat_sub %>%select(-c(CDAY, CMON, CYEAR, DOBD, DOBM, DOBY)# -c(cdate_text, cdate_date, dob_text, dob_date) , -c(age_days, age_diff) )str(dat_sub)
Look at summary of dataset to see the state of the labeling before we make changes.
summary(nesarc_sub)
ID Sex Age SmokingStatus NicotineDependence
1 : 1 1:18518 Min. :18.0 1 : 9913 0:38131
2 : 1 2:24575 1st Qu.:32.0 2 : 8078 1: 4962
3 : 1 Median :44.0 3 :25080
4 : 1 Mean :46.4 NA's: 22
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingFreq DailyCigsSmoked Ethnicity Depression IncomePersonal
7 :25080 Min. : 0.000 1:24507 0:35254 Min. : 0
1 :14836 1st Qu.: 0.000 2: 8245 1: 7839 1st Qu.: 8800
6 : 772 Median : 0.000 3: 701 Median : 20000
4 : 747 Mean : 6.462 4: 1332 Mean : 28185
3 : 687 3rd Qu.:10.000 5: 8308 3rd Qu.: 36000
(Other): 869 Max. :98.000 Max. :3000000
NA's : 102 NA's :262
IncomePersonalCat Height_ft Height_in Weight_lb
6 : 3940 Min. :4.00 Min. : 0.000 Min. : 62.0
7 : 3887 1st Qu.:5.00 1st Qu.: 3.000 1st Qu.:140.0
2 : 3823 Median :5.00 Median : 5.000 Median :165.0
4 : 3669 Mean :5.11 Mean : 5.279 Mean :170.6
1 : 3571 3rd Qu.:5.00 3rd Qu.: 8.000 3rd Qu.:192.0
11 : 3291 Max. :7.00 Max. :11.000 Max. :500.0
(Other):20912 NA's :730 NA's :761 NA's :1376
DaysSmoke IncomePersonalCat_num Height_inches DaysSmoke_log2
Min. : 0.00 Min. : 0 Min. :48.0 Min. : -Inf
1st Qu.: 0.00 1st Qu.: 9000 1st Qu.:64.0 1st Qu.: -Inf
Median : 0.00 Median : 22500 Median :66.0 Median : -Inf
Mean :10.96 Mean : 27523 Mean :66.6 Mean : -Inf
3rd Qu.:30.00 3rd Qu.: 37500 3rd Qu.:70.0 3rd Qu.:4.907
Max. :30.00 Max. :120000 Max. :84.0 Max. :4.907
NA's :102 NA's :761 NA's :102
TotalCigsSmoked TotalCigsSmoked_log2 TotalCigsSmoked_sqrt
Min. : 0.0 Min. : 0.000 Min. : 0.000
1st Qu.: 0.0 1st Qu.: 7.229 1st Qu.: 0.000
Median : 0.0 Median : 8.492 Median : 0.000
Mean : 187.5 Mean : 7.888 Mean : 7.887
3rd Qu.: 300.0 3rd Qu.: 9.229 3rd Qu.:17.321
Max. :2940.0 Max. :11.522 Max. :54.222
NA's :297 NA's :25377 NA's :297
TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
Min. : 1.0 Min. : 0.000
1st Qu.: 150.0 1st Qu.: 7.229
Median : 360.0 Median : 8.492
Mean : 453.1 Mean : 7.888
3rd Qu.: 600.0 3rd Qu.: 9.229
Max. :2940.0 Max. :11.522
NA's :25377 NA's :25377
CigsSmokedFac CigsSmokedFac2 SmokingFreq3
No smoking (0) :25080 Q1 : 0 1 :15983
Lowest smoking (1-6) : 1045 Q2 : 0 2 : 1156
Low smoking (7-30) : 1129 Q3 : 0 3 : 772
Medium smoking (31-300): 6456 Q4 : 0 4 :25080
High smoking (301+) : 9086 NA's:43093 NA's: 102
NA's : 297
Informative labels are given to the factor levels. The order of the levels is also rearranged for the variables SmokingFreq, NicotineDependence, and Sex.
# Label factors with meaningful names in the same order as the codebooknesarc_sub$Depression <-factor( nesarc_sub$Depression , labels =c("No Depression" , "Yes Depression" ) )nesarc_sub$SmokingStatus <-factor( nesarc_sub$SmokingStatus , labels =c("Smoked past 12 months" , "Smoked prior to 12 months" , "Never smoked" ) )# Specify the order of the levels to be different from codebook ordernesarc_sub$Sex <-factor(nesarc_sub$Sex , levels =c( 2 , 1 ) , labels =c( "Female" , "Male" ) )# check ordering with a frequency tabletable(nesarc_sub$Sex)
Female Male
24575 18518
nesarc_sub$NicotineDependence <-factor(nesarc_sub$NicotineDependence , levels =c( 1 , 0 ) , labels =c( "Nicotine Dependence" , "No Nicotine Dependence" ) )# Remember to include the new category level from the original `NA` values.nesarc_sub$SmokingFreq <-factor(nesarc_sub$SmokingFreq , levels =c( 7 , 6 , 5 , 4 , 3 , 2 , 1 ) , labels =c( "Never" , "Once a month or less" , "2 to 3 Days/month" , "1 to 2 Days/week" , "3 to 4 Days/week" , "5 to 6 Days/week" , "Every Day" ) )nesarc_sub$Ethnicity <-factor(nesarc_sub$Ethnicity# shorter labels are helpful for plots , labels =c( "Cauc" , "AfAm" , "NaAm" , "Asia" , "Hisp" )# , labels = c("Caucasian"# , "African American"# , "Native American"# , "Asian"# , "Hispanic"# ))nesarc_sub$SmokingFreq3 <-factor(nesarc_sub$SmokingFreq3 , levels =c( 4 , 3 , 2 , 1 ) , labels =c( "Never" , "Monthly" , "Weekly" , "Daily" ) )summary(nesarc_sub)
ID Sex Age
1 : 1 Female:24575 Min. :18.0
2 : 1 Male :18518 1st Qu.:32.0
3 : 1 Median :44.0
4 : 1 Mean :46.4
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingStatus NicotineDependence
Smoked past 12 months : 9913 Nicotine Dependence : 4962
Smoked prior to 12 months: 8078 No Nicotine Dependence:38131
Never smoked :25080
NA's : 22
SmokingFreq DailyCigsSmoked Ethnicity
Never :25080 Min. : 0.000 Cauc:24507
Every Day :14836 1st Qu.: 0.000 AfAm: 8245
Once a month or less: 772 Median : 0.000 NaAm: 701
1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
(Other) : 869 Max. :98.000
NA's : 102 NA's :262
Depression IncomePersonal IncomePersonalCat Height_ft
No Depression :35254 Min. : 0 6 : 3940 Min. :4.00
Yes Depression: 7839 1st Qu.: 8800 7 : 3887 1st Qu.:5.00
Median : 20000 2 : 3823 Median :5.00
Mean : 28185 4 : 3669 Mean :5.11
3rd Qu.: 36000 1 : 3571 3rd Qu.:5.00
Max. :3000000 11 : 3291 Max. :7.00
(Other):20912 NA's :730
Height_in Weight_lb DaysSmoke IncomePersonalCat_num
Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. : 0
1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.: 9000
Median : 5.000 Median :165.0 Median : 0.00 Median : 22500
Mean : 5.279 Mean :170.6 Mean :10.96 Mean : 27523
3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.: 37500
Max. :11.000 Max. :500.0 Max. :30.00 Max. :120000
NA's :761 NA's :1376 NA's :102
Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2
Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000
1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229
Median :66.0 Median : -Inf Median : 0.0 Median : 8.492
Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888
3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229
Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522
NA's :761 NA's :102 NA's :297 NA's :25377
TotalCigsSmoked_sqrt TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
Min. : 0.000 Min. : 1.0 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 150.0 1st Qu.: 7.229
Median : 0.000 Median : 360.0 Median : 8.492
Mean : 7.887 Mean : 453.1 Mean : 7.888
3rd Qu.:17.321 3rd Qu.: 600.0 3rd Qu.: 9.229
Max. :54.222 Max. :2940.0 Max. :11.522
NA's :297 NA's :25377 NA's :25377
CigsSmokedFac CigsSmokedFac2 SmokingFreq3
No smoking (0) :25080 Q1 : 0 Never :25080
Lowest smoking (1-6) : 1045 Q2 : 0 Monthly: 772
Low smoking (7-30) : 1129 Q3 : 0 Weekly : 1156
Medium smoking (31-300): 6456 Q4 : 0 Daily :15983
High smoking (301+) : 9086 NA's:43093 NA's : 102
NA's : 297
4.1.6 Data subset rows
Note
Most of you will not need to subset your data.
Finer points
By subsetting your data, you can focus on questions related to a specific subset of the population.
Data subset rows to young smokers
Only as an example (not used in the rest of the assignment), consider restricting our attention to only young smokers.
In practice, I would probably subset the data much earlier. In this example I waited until the end to subset so that I could illustrate all of the other steps with complete values. You may also want to subset to a new dataset object name to keep it separate from the full dataset.
In our example, we are interested in the subset of people 25 or younger who smoked in the last 12 months and this is created using the filter() function. Note that SmokingStatus == "Smoked past 12 months" is used to choose people who smoked in the past 12 months.
Notice, the missing values have been removed (since NAs are not less than or equal to numbers).
# the subset command below will not include a row that evalutes to NA for the# specified subset. For example: (NA==1)
[1] NA
# age and smoking subsetnesarc_sub_young_smokers <- nesarc_sub %>%filter( Age <=25# people 25 or younger , SmokingStatus =="Smoked past 12 months"# smoked in the past 12 months )dim(nesarc_sub_young_smokers)
[1] 1706 26
summary(nesarc_sub_young_smokers$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 20.00 22.00 21.61 23.00 25.00
summary(nesarc_sub_young_smokers$SmokingStatus)
Smoked past 12 months Smoked prior to 12 months Never smoked
1706 0 0
4.2 Data is complete (Class 04)
4.2.1 Plot entire dataset, show missing values
Only keep one of these for your assignment. I think the second one is much more informative.
e_plot_missing(nesarc_sub)
Warning: Using alpha for a discrete variable is not advised.
Adding additional options can help understand the structure of the missing values.
Warning: Using alpha for a discrete variable is not advised.
4.2.2 Numerical summaries to assess correctness (Class 03)
summary(nesarc_sub)
ID Sex Age
1 : 1 Female:24575 Min. :18.0
2 : 1 Male :18518 1st Qu.:32.0
3 : 1 Median :44.0
4 : 1 Mean :46.4
5 : 1 3rd Qu.:59.0
6 : 1 Max. :98.0
(Other):43087
SmokingStatus NicotineDependence
Smoked past 12 months : 9913 Nicotine Dependence : 4962
Smoked prior to 12 months: 8078 No Nicotine Dependence:38131
Never smoked :25080
NA's : 22
SmokingFreq DailyCigsSmoked Ethnicity
Never :25080 Min. : 0.000 Cauc:24507
Every Day :14836 1st Qu.: 0.000 AfAm: 8245
Once a month or less: 772 Median : 0.000 NaAm: 701
1 to 2 Days/week : 747 Mean : 6.462 Asia: 1332
3 to 4 Days/week : 687 3rd Qu.:10.000 Hisp: 8308
(Other) : 869 Max. :98.000
NA's : 102 NA's :262
Depression IncomePersonal IncomePersonalCat Height_ft
No Depression :35254 Min. : 0 6 : 3940 Min. :4.00
Yes Depression: 7839 1st Qu.: 8800 7 : 3887 1st Qu.:5.00
Median : 20000 2 : 3823 Median :5.00
Mean : 28185 4 : 3669 Mean :5.11
3rd Qu.: 36000 1 : 3571 3rd Qu.:5.00
Max. :3000000 11 : 3291 Max. :7.00
(Other):20912 NA's :730
Height_in Weight_lb DaysSmoke IncomePersonalCat_num
Min. : 0.000 Min. : 62.0 Min. : 0.00 Min. : 0
1st Qu.: 3.000 1st Qu.:140.0 1st Qu.: 0.00 1st Qu.: 9000
Median : 5.000 Median :165.0 Median : 0.00 Median : 22500
Mean : 5.279 Mean :170.6 Mean :10.96 Mean : 27523
3rd Qu.: 8.000 3rd Qu.:192.0 3rd Qu.:30.00 3rd Qu.: 37500
Max. :11.000 Max. :500.0 Max. :30.00 Max. :120000
NA's :761 NA's :1376 NA's :102
Height_inches DaysSmoke_log2 TotalCigsSmoked TotalCigsSmoked_log2
Min. :48.0 Min. : -Inf Min. : 0.0 Min. : 0.000
1st Qu.:64.0 1st Qu.: -Inf 1st Qu.: 0.0 1st Qu.: 7.229
Median :66.0 Median : -Inf Median : 0.0 Median : 8.492
Mean :66.6 Mean : -Inf Mean : 187.5 Mean : 7.888
3rd Qu.:70.0 3rd Qu.:4.907 3rd Qu.: 300.0 3rd Qu.: 9.229
Max. :84.0 Max. :4.907 Max. :2940.0 Max. :11.522
NA's :761 NA's :102 NA's :297 NA's :25377
TotalCigsSmoked_sqrt TotalCigsSmoked_smokers TotalCigsSmoked_smokers_log2
Min. : 0.000 Min. : 1.0 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 150.0 1st Qu.: 7.229
Median : 0.000 Median : 360.0 Median : 8.492
Mean : 7.887 Mean : 453.1 Mean : 7.888
3rd Qu.:17.321 3rd Qu.: 600.0 3rd Qu.: 9.229
Max. :54.222 Max. :2940.0 Max. :11.522
NA's :297 NA's :25377 NA's :25377
CigsSmokedFac CigsSmokedFac2 SmokingFreq3
No smoking (0) :25080 Q1 : 0 Never :25080
Lowest smoking (1-6) : 1045 Q2 : 0 Monthly: 772
Low smoking (7-30) : 1129 Q3 : 0 Weekly : 1156
Medium smoking (31-300): 6456 Q4 : 0 Daily :15983
High smoking (301+) : 9086 NA's:43093 NA's : 102
NA's : 297
5 Graphing and Tabulating
5.1 Class 04, Plotting univariate
Rubric
(3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe.
(3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers).
(2 p) Code missing variables, remove records with missing values, indicate with R output that this was done correctly (e.g., str(), dim(), summary()).
Scroll up to sections labeled “(Class 04)”.
(2 p) Label levels of factor variables.
Scroll up to sections labeled “(Class 04)”.
5.2 Categorical variables
5.2.1 Tables for categorical variables
Basic tables can be created using the function table or xtabs, though for multiple variables the group_by(), summarize(), mutate() recipe makes a nicer table with proportions.
# table() produces a one-variable frequency tabletable(nesarc_sub$NicotineDependence)
Nicotine Dependence No Nicotine Dependence
4962 38131
# proportions available by passing these frequencies to prop.table()table(nesarc_sub$NicotineDependence) %>%prop.table()
Nicotine Dependence No Nicotine Dependence
0.1151463 0.8848537
# A tibble: 2 × 3
NicotineDependence n prop
<fct> <int> <dbl>
1 Nicotine Dependence 4962 0.115
2 No Nicotine Dependence 38131 0.885
5.2.2 Graphing frequency tables
The barplots are all created with the package ggplot2.
library(ggplot2)p <-ggplot(data = nesarc_sub, aes(x = NicotineDependence))p <- p +theme_bw()p <- p +geom_bar()p <- p +labs(title ="Nicotine Dependence for Adults" , x ="Nicotine Dependence" , y ="Total Number" )print(p)
Interpretation: Among adults (n = 43093), there are about 8 times more people (88%) who are not nicotine dependent compared to those who are (12%).
5.2.2.1 Removing NAs from factor axes and tables
Finer points
Remove NAs using drop_na() from categorical variables in plots and tables when the NAs are unwanted.
When a factor variable has NAs, then NA appears as a level with frequences (see bar plot below). As of writing this, ggplot() does not have a way to remove the NAs before plotting; therefore, we have to do it manually.
The easiest and most transparent way is to use the drop_na() function below where we only plot the non-NA values.
Note
The table() function doesn’t show the NAs unless you ask it to.
# table() produces a one-variable frequency tabletable(nesarc_sub$SmokingStatus)
Smoked past 12 months Smoked prior to 12 months Never smoked
9913 8078 25080
table(nesarc_sub$SmokingStatus, useNA ="always")
Smoked past 12 months Smoked prior to 12 months Never smoked
9913 8078 25080
<NA>
22
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
tab_smoke
# A tibble: 6 × 4
Sex SmokingStatus n prop
<fct> <fct> <int> <dbl>
1 Female Smoked past 12 months 5070 0.206
2 Female Smoked prior to 12 months 3861 0.157
3 Female Never smoked 15635 0.636
4 Male Smoked past 12 months 4843 0.262
5 Male Smoked prior to 12 months 4217 0.228
6 Male Never smoked 9445 0.51
Compare the two plots below of the same data, but the second plot had the NAs removed before plotting.
library(ggplot2)# Original plot with NAsp1 <-ggplot(data = nesarc_sub, aes(x = SmokingStatus))p1 <- p1 +geom_bar()p1 <- p1 +labs(title ="Basic plot")p1 <- p1 +theme(axis.text.x =element_text(angle =90, vjust =0, hjust =1))#p1# subset excluded the NAs for the variable being plotted# "subset() specifies the values to keep, and "!is.na()" means "keep the non-NA values"# p2 <- ggplot(data = subset(nesarc_sub, !is.na(CigsSmokedFac)), aes(x = CigsSmokedFac))# tidyverse syntax uses drop_na(Var) to drop observations where Var is NAp2 <-ggplot(data = nesarc_sub %>%drop_na(SmokingStatus), aes(x = SmokingStatus))p2 <- p2 +geom_bar()p2 <- p2 +labs(title ="Using drop_na()")p2 <- p2 +theme(axis.text.x =element_text(angle =90, vjust =0, hjust =1))#p2p_arranged <- cowplot::plot_grid(plotlist =list(p1, p2) # list of plots , nrow =1# number of rows for grid of plots , ncol =NULL# number of columns, left unspecified , labels ="AUTO"# A and B panel labels )print(p_arranged)
5.3 Numeric variables
5.3.1 Graphing numeric variables
p <-ggplot(data = nesarc_sub, aes(x = DailyCigsSmoked))p <- p +theme_bw()p <- p +geom_histogram(aes(y = ..density..), boundary =0, binwidth =5)#p <- p + geom_rug() # this plots every point, takes too long for big datap <- p +geom_density(alpha =0.2, fill ="gray50", adjust =4)p <- p +labs(title ="Monthly cigaretts smoked for Young Smoking Adults" , x ="Estimated Cigarettes Smoked per Month" , y ="" )print(p)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 6.462 10.000 98.000 262
fivenum(nesarc_sub$DailyCigsSmoked)
[1] 0 0 0 10 98
median(nesarc_sub$DailyCigsSmoked, na.rm =TRUE)
[1] 0
IQR(nesarc_sub$DailyCigsSmoked, na.rm =TRUE)
[1] 10
Interpretation: Among smokers, the number of cigarettes smoked is right skewed (shape) with a median (center) of 0 and the IQR (spread) (interquartile range, middle 50%) for the distribution is 300. There are two distinct modes (peaks) in the distribution, the first between 0 and 10 cigarettes (half pack) and the second at 20 (full pack). There are several outlying values (outliers) greater than 40 (representing 40 cigarettes per day) that are possibly overestimates from the people responding.
Finer points
Experiment with the geom_histogram() code in the next code chunk to set the binwidth = for your data (extremely important!). Often, a binwidth of 1 is good if the data are frequencies and high resolution is important. Decide if you need boundary = 0. Add a title and the labels on the \(x\)- and \(y\)-axes.
A smoothed density curve can be added with the geom_density() function, where adjust = 2.0 is a fairly smooth curve. Experiment with adjust values such as 1.0, 1.5, 2.0, and 3.0 to decrease or increase the smoothness and decide what level best captures the general shape of your distribution.
Finer points
The numeric summaries can be used in the text. See the .qmd file to see the R code in the following sentences and in the interpretation.
The “shape” of the distribution for the estimated cigarettes smoked per month is skewed to the right.
The “center” (median) of the distribution is
The “spread” (interquartile range, middle 50%) for the distribution is
There are several “outliers”.
A second example follows. Notice that I filter the data to smokers, first.
Interpretation: Among smokers, about half smoke at most 10 cigarettes per day with the mode at 20 (1 pack per day), two smaller peaks at 30 and 40 (1.5 and 2 packs per day), and some extreme outlying values at 60, 80, and 100 (3, 4, and 5 packs per day).
p <-ggplot(data = nesarc_sub %>%filter(DailyCigsSmoked >0), aes(x = DailyCigsSmoked))p <- p +geom_histogram(aes(y = ..density..), boundary =0, binwidth =2)p <- p +geom_rug()p <- p +geom_density(alpha =0.2, fill ="gray50", adjust =1.5)p <- p +labs(x ="Estimated Cigarettes Smoked per Month" , y ="" , title ="Monthly cigaretts smoked for Young Smoking Adults" )p <- p +theme_bw()p
5.4 Class 05-1, Plotting bivariate, numeric response
Rubric
Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):
Scatter plot (for regression): \(x\) = numerical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.
Box plots (for ANOVA): \(x\) = categorical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.
5.4.1 Scatter plot (for regression): x = numerical, y = numerical
library(ggplot2)p <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p <- p +theme_bw()#p <- p + geom_point(alpha = 1/10)p <- p +geom_jitter(width =0.25, height =5, alpha =1/10)p <- p +geom_smooth()p <- p +stat_smooth(method = lm, colour ="red")p <- p +labs(title ="Total Cigarettes Smoked vs Age"#, x = , y ="Total Cigarettes Smoked per month" , caption ="Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)." )print(p)
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
`geom_smooth()` using formula 'y ~ x'
Interpretation: The relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line.
Finer points
Improving plots and comparing by a categorical variable using facets.
Research question: With my questions, I don’t have a meaningful question to ask that will relate two numeric variables.
An exploratory question whether there is a relationship between a person’s age (Age) and total number of cigaretts smoked (TotalCigsSmoked). Because age is an integer varible, I jitter the points, and I reduce the opacity (increase the transparancy) with alpha = 1/4 to see the density of points better (1/4 means that it takes 4 overlayed points to be fully opaque).
The first plot has all the points in their original locations, but they end up stacking on top of each other so you can’t tell how many points are there.
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))p <- p +geom_point(alpha =1/4)p <- p +stat_smooth(method = lm)p <- p +labs(title ="Total Cigarettes Smoked vs Age")print(p)
This is the same data as above, but I have added some horizontal jitter (displacement) to each point which doesn’t affect their value on the y-axis but allows us to see how many points are at each discrete age year.
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked))p <- p +geom_jitter(position =position_jitter(width =0.2), alpha =1/4)p <- p +stat_smooth(method = lm)p <- p +labs(title ="Total Cigarettes Smoked vs Age")print(p)
Here’s an example of height and weight with a strong positive relationship between the two numeric variables. Most variables in our dataset won’t exhibit such a nice relationship.
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))p <- p +theme_bw()p <- p +geom_jitter(height =0.5, alpha =1/10)p <- p +stat_smooth(method = lm)p <- p +labs(title ="Weight vs Height")print(p)
If you have two groups you want to compare, you can create facets (small multiples) and show the relationship by each group. Below, I compare by sex. The slopes appear similar; later in the semester we’ll learn how to formally compare these lines.
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = Height_inches, y = Weight_lb))p <- p +geom_point(alpha =1/4)p <- p +stat_smooth(method = lm)p <- p +labs(title ="Weight vs Height")p <- p +facet_wrap(~ Sex)print(p)
5.4.2 Box plots (for ANOVA): x = categorical, y = numerical
Research question: Is there a relationship between depression (Depression) and total number of cigarettes smoked (TotalCigsSmoked)?
library(ggplot2)p <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Depression, y = TotalCigsSmoked))p <- p +theme_bw()p <- p +geom_boxplot(width =0.5, alpha =0.5)p <- p +geom_jitter(position =position_jitter(width =0.1), alpha =1/100)# diamond at mean for each groupp <- p +stat_summary(fun = mean, geom ="point", shape =18, size =6,colour ="red", alpha =0.8)p <- p +labs(title ="Total Cigarettes Smoked by Depression" , x ="Depression status" , y ="Total Cigarettes Smoked per month" , caption ="Red diamond is the mean.\nFiltered to include only smokers (cig > 0)." )print(p)
Interpretation: For NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.
5.5 Class 05-2, Plotting bivariate, categorical response
Rubric
Each of the following (2 p for plot, 2 p for labeled axes and title, 1 p for interpretation):
Mosaic plot or bivariate bar plots (for contingency tables): \(x\) = categorical, \(y\) = categorical, include axis labels and a title. Interpret the plot: describe the relationship.
Logistic scatter plot (for logistic regression): \(x\) = numerical, \(y\) = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship.
5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical
Research question: Is there a relationship between depression status (Depression) and nicotine dependence (NicotineDependence)?
`summarise()` has grouped output by 'NicotineDependence'. You can override
using the `.groups` argument.
tab_nic_dep
# A tibble: 4 × 4
NicotineDependence Depression n prop
<fct> <fct> <int> <dbl>
1 Nicotine Dependence No Depression 3193 0.643
2 Nicotine Dependence Yes Depression 1769 0.357
3 No Nicotine Dependence No Depression 32061 0.841
4 No Nicotine Dependence Yes Depression 6070 0.159
Reversing color order: It is sometimes preferable to reverse the order of your color-filled variable to read the more interesting proportions from the bottom of the plot up the proportion scale. Use the function forcats::fct_rev() on your fill= variable.
library(ggplot2)p <-ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))p <- p +theme_bw()p <- p +geom_bar(position ="fill")p <- p +labs(x ="Depression status" , y ="Proportion" , fill ="Nicotine Dependence\nStatus"# the legend title can be modified, if desired (try this line) , title ="Proportion of young adult smokers with and without\nnicotine dependence by depression status" )print(p)
Interpretation: When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%).
Finer points
We can compare by a categorical variable with facets (or “small multiples”).
Legends can be (A) removed or (B) moved and rearranged.
Research question: Is there a relationship between smoking frequency (SmokingFreq) and nicotine dependence (NicotineDependence)?
Also compare by sex.
library(ggplot2)p <-ggplot(data =subset(nesarc_sub, !is.na(SmokingFreq)), aes(x = SmokingFreq, fill = NicotineDependence))p <- p +geom_bar(position ="fill")p <- p +theme_bw()# tilt axis labelsp <- p +theme(axis.text.x =element_text(angle =45, vjust =1, hjust =1))p <- p +labs(x ="Smoking Frequency", y ="Percent")p1 <- pp1 <- p1 +theme(legend.position ="none") # remove legend# also compare by sexp <- p +facet_grid(Sex ~ .)p <- p +theme(legend.position ="bottom") # "none"p <- p +theme(legend.box="vertical", legend.margin=margin()) # stack verticallyp <- p +guides(fill =guide_legend(ncol =1, byrow =FALSE)) # shape levels of one legendp2 <- pp_arranged <- cowplot::plot_grid(plotlist =list(p1, p2) # list of plots , nrow =1# number of rows for grid of plots , ncol =NULL# number of columns, left unspecified , labels ="AUTO"# A and B panel labels )print(p_arranged)
5.5.1.1 Using Mosaic Plots
Note
This section on Mosaic plots can be completely skipped.
Finer points
Mosaic plots will be used later in the course when we calculate statistics to test whether a pair of categorical variables are associated. These are helpful because they color-code each cell based on whether they provide evidence of an association. These plots are difficult to customize, so the ggplot() versions above will generally be preferred to these.
Interpretation: If a person has depression they are more likely to have nicotine dependence than if they do not have depression.
library(vcd)
Loading required package: grid
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub)
mosaic( ~ NicotineDependence + Depression, data = nesarc_sub, shade =TRUE)
It’s easy to look at such a fine resolution (too many variables) that it’s difficult to understand what’s happening.
mosaic(~ NicotineDependence + Depression + Sex + CigsSmokedFac, data = nesarc_sub, shade =TRUE)
5.5.1.1.1 Customizing Mosaic Plots
There are several solutions to dealing with the messy labels that overlap. The help line below will show the many options you have control over. Read through the comments of each code chunk, see the new options, and see the result.
# Abbreviate labels, specify the number of characters (try a few)mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , abbreviate_labs =4)
# Rotate and align labelsmosaic( ~ NicotineDependence + Depression, data = nesarc_sub , rot_labels =c(0, 90, 0, 0) , just_labels ="right"# right-justifies all the labels (top and side) )
# Rotate and align labels only y-axis labelmosaic( ~ NicotineDependence + Depression, data = nesarc_sub , rot_labels =c(0, 90, 0, 0) , just_labels =c("center", "right") # order is opposite as in first argument )
# Rotate and align labels only y-axis label, and offset the variable namemosaic( ~ NicotineDependence + Depression, data = nesarc_sub , rot_labels =c(0, 90, 0, 0) , just_labels =c("center", "right") # order is opposite as in first argument , offset_varnames =c(0,0,0,9) # offset the y-axis var name )
Final version
Labeling this plot well is, in my opinion, a bit too hard to be worth it. Notice in my code chunk options (you’ll need to look at the .Rmd file), that I’ve included fig.height = 8, fig.width = 8 which makes the size of the plot in the output 8 inches high by 8 inches wide.
library(vcd)mosaic( ~ NicotineDependence + Depression, data = nesarc_sub , shade =TRUE , rot_labels =c(0, 90, 0, 0) # order is: top, right, bottom, left , just_labels =c("center", "right") # order is opposite as in first argument , offset_varnames =c(0, 0, 0, 9) # offset the y-axis var name# a big left margin makes space for the plot labels , margins =c(3, 3, 3, 14) # order is: top, right, bottom, left , main ="Fraction of young adult smokers with and without\n nicotine dependence by depression status" , labeling_args =list(set_varnames =c(NicotineDependence ="Tobacco Addiction Status" , Depression ="Depression status") ) )
Interpretation: If a person has nicotine dependence they are more likely to have depression than if they do not have nicotine dependence.
5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)
Research question: Frequency and quantity of smoking (TotalCigsSmoked) are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (NicotineDependence) (this is true for other drugs as well).
table(nesarc_sub$NicotineDependence)
Nicotine Dependence No Nicotine Dependence
4962 38131
# if "Nicotine Dependence", then code it as a 1, otherwise code as a 0nesarc_sub$NicotineDependence01 <-ifelse(nesarc_sub$NicotineDependence =="Nicotine Dependence", 1, 0)# check that the frequencies are correct before and aftertable(nesarc_sub$NicotineDependence01)
0 1
38131 4962
# # Note, sometimes it is necessary to make your variable a character type first.# # If the code above doesn't produce both 0s and 1s, then try this:# nesarc_sub$NicotineDependence01 <- ifelse(as.character(nesarc_sub$NicotineDependence) == "Nicotine Dependence", 1, 0)
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))p <- p +theme_bw()p <- p +geom_jitter(position =position_jitter(height =0.1), alpha =1/4)# overlay a sigmodal curve to indicate direction of relationship# As of ggplot2 2.1.0, need to use this binomial_smooth() function: binomial_smooth <-function(...) {geom_smooth(method ="glm", method.args =list(family ="binomial"), ...) }p <- p +binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)p <- p +scale_y_continuous(breaks =seq(0, 1, by=0.2), minor_breaks =seq(0, 1, by=0.1))p <- p +labs(x ="Estimated Cigarettes Smoked per Month" , y ="Nicotine Dependence (1=Yes, 0=No)" , title ="Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked" )print(p)
Interpretation: The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases. The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.
Finer points
While the basic plot can be made without collapsing to binary, later in the semester we will learn about logistic regression where we model the probability of “success”. Categories associated with “success” are coded as 1, and all other categories are “failure” coded as 0. Below I show how to create this binary 0/1 variable from a numeric variable and a multi-level categorical variable.
Research question: 1. Frequency and quantity of smoking (TotalCigsSmoked) are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (NicotineDependence) (this is true for other drugs as well).
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence))p <- p +geom_jitter(position =position_jitter(height =0.1), alpha =1/4)print(p)
Interpretation: These two groups are quite balanced on the probability of tobacco dependence given the total number of cigarettes smoked, though they increase together slightly.
To overlay a curve, we need to convert our response variable \(y\) to be binary 0 or 1, where we let 1 indicate a “success” in the direction of interest.
5.5.2.1 Collapsing a variable to binary
Whether you have a numeric or categorical variable that you’d like to represent with two levels, you’ll need to convert either to a numeric binary variable (values of 0 or 1, only). The value of 1 indicates “success” or the feature you’re interested in (below, 1 = someone with tobacco dependence) and 0 is “failure” (not tobacco dependent). Note that “success” does not necessarily mean “good”.
5.5.2.1.1 Numeric to binary
It is convenient to collapse a continuous variable into a categorical variable (as with the cut() command) or further collapse a categorical variable into a binary variable.
One strategy to create a binary variable is to use the ifelse() function.
Let’s create a binary Age variable; if 21 years and older, than legal to purchase alcohol, otherwise it is illegal. See how the two columns below correspond as we wanted them to. I do not recommend overwriting the original column, but create a new variable in the data.frame.
# Smoking frequency has 7 categories# if at least 3 Days/week, then code as a 1, otherwise code as a 0table(nesarc_sub$SmokingFreq)
Never Once a month or less 2 to 3 Days/month
25080 772 409
1 to 2 Days/week 3 to 4 Days/week 5 to 6 Days/week
747 687 460
Every Day
14836
nesarc_sub$SmokingFreq01 <-ifelse( nesarc_sub$SmokingFreq%in%c("3 to 4 Days/week" , "5 to 6 Days/week" , "Every Day" ) , 1# success , 0# failure )# check that the frequencies are correct before and aftertable(nesarc_sub$SmokingFreq01)
0 1
27110 15983
5.6 Class 06, Figure arrangement, captions, cross-referencing
Rubric
Reorganize your Class 05 bivariate plots above using plot_grid() and quarto, creating captions and cross-referencing them from the text.
(3 p) For your numeric response plots, use cowplot::plot_grid() to create a single figure with separate plot panels.
(3 p) For your categorical response plots, use quarto chunk options fig-cap, fig-subcap, and layout-ncol to create a single figure with separate plot panels.
(2 p) Use captions to describe (not interpret) both sets of plots so a reader understands what is being plotted.
(2 p) Use cross-referencing from the text to refer to the plots when you interpret them.
Note
Go above and reformat your plots and update your interpretations with cross-referencing.
Note
In my example, for pedagogical purposes, I reproduce the plots and illustrate the cross-referencing below. This allows me to separate for you the original creation of the plots from this step of arranging and cross-referencing them.
While I show the chunk options in the html output file, it’s best to look at the .qmd file to see the chuck options.
Start by creating two plots and call them p1 and p2. Notice that I creating them with variable p, then assign p to p1 at the end.
# These two plots are copied from above.# Scatter plot (for regression): x = numerical, y = numericallibrary(ggplot2)p <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p <- p +theme_bw()#p <- p + geom_point(alpha = 1/10)p <- p +geom_jitter(width =0.25, height =5, alpha =1/10)p <- p +geom_smooth()p <- p +stat_smooth(method = lm, colour ="red")p <- p +labs(title ="Total Cigarettes Smoked vs Age"#, x = , y ="Total Cigarettes Smoked per month" , caption ="Key: Blue line is GAM smoother, Red line is simple linear regression.\nFiltered to include only smokers (cig > 0)." )p1 <- p# Box plots (for ANOVA): x = categorical, y = numericallibrary(ggplot2)p <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Depression, y = TotalCigsSmoked))p <- p +theme_bw()p <- p +geom_boxplot(width =0.5, alpha =0.5)p <- p +geom_jitter(position =position_jitter(width =0.1), alpha =1/100)# diamond at mean for each groupp <- p +stat_summary(fun = mean, geom ="point", shape =18, size =6,colour ="red", alpha =0.8)p <- p +labs(title ="Total Cigarettes Smoked by Depression" , x ="Depression status" , y ="Total Cigarettes Smoked per month" , caption ="Red diamond is the mean.\nFiltered to include only smokers (cig > 0)." )p2 <- p
Finer points
Look at the help for ?cowplot::plot_grid; there are many options that I don’t use below. I put the p1 and p2 into a list and print the one plot. In the code chunk, I specified a label and a figure caption; I also increase the size of the plot. From the text, I use the chunk label as a cross-reference; note that I have to manually specify figure panels “A” and “B”. I just copied the previous interpretations.
These are the chunk options used below, I recommend looking in the .qmd file.
#| label: fig-bivariate-plots-num
#| fig-cap: Numeric response bivariate plots. (A) Total Cigarettes Smoked vs Age. (B) Total Cigarettes Smoked by Depression
#| fig-width: 8
#| fig-height: 4
p_arranged <- cowplot::plot_grid(plotlist =list(p1, p2) # list of plots , nrow =1# number of rows for grid of plots , ncol =NULL# number of columns, left unspecified , labels ="AUTO"# A and B panel labels )
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
`geom_smooth()` using formula 'y ~ x'
print(p_arranged)
Figure 1: Numeric response bivariate plots. (A) Total Cigarettes Smoked vs Age. (B) Total Cigarettes Smoked by Depression
Interpretation:
In Figure 1 A, the relationship between Age and TotalCigsSmoked, among people who smoke, is very weakly positive, but does not resemble a straight line. TotalCigsSmoked is right-skewed which results in many extreme values above the regression line.
In Figure 1 B, for NESARC participants who smoke (cig > 0), those who are depressed have a slightly higher mean number of cigarettes per month (red diamond) than those who are not depressed. Total cigarettes smoked per month is right skewed so there are many people who smoke very little but few people who smoke a lot, which increases the mean beyond the median.
5.6.2 Plotting bivariate, categorical response, quarto chunk options
Finer points
I’ve now also added a fig-subcap (figure subcaptions) which appear under each figure. They must be specified in the same order as the plots are printed and their cross-reference label appends a suffix of -1, -2, etc., to the fig-cap. In the interpretation below, I’ve included the cross-reference parenthetically at the end of the first sentence describing them, which is another common style.
These are the chunk options used below, I recommend looking in the .qmd file.
#| label: fig-bivariate-plots-cat
#| fig-cap: Categorical response bivariate plots.
#| fig-subcap:
#| - "Proportion of young adult smokers with and without nicotine dependence by depression status"
#| - "Likelihood of Nicotine Dependence increases with number of cigarettes smoked"
#| layout-ncol: 1 # one column for plots # alternatively, #| layout: [45,-10, 45] puts 2 on one row with a 10% width space between
#| fig-width: 5
#| fig-height: 3
# Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categoricallibrary(ggplot2)p <-ggplot(data = nesarc_sub, aes(x = Depression, fill = forcats::fct_rev(NicotineDependence)))p <- p +theme_bw()p <- p +geom_bar(position ="fill")p <- p +labs(x ="Depression status" , y ="Proportion" , fill ="Nicotine Dependence\nStatus"# the legend title can be modified , title ="Proportion of young adult smokers with and without\nnicotine dependence by depression status" )print(p)# Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)library(ggplot2)p <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked, y = NicotineDependence01))p <- p +theme_bw()p <- p +geom_jitter(position =position_jitter(height =0.1), alpha =1/4)# overlay a sigmodal curve to indicate direction of relationship# As of ggplot2 2.1.0, need to use this binomial_smooth() function: binomial_smooth <-function(...) {geom_smooth(method ="glm", method.args =list(family ="binomial"), ...) }p <- p +binomial_smooth() # old way: stat_smooth(method="glm", family="binomial", se=FALSE)p <- p +scale_y_continuous(breaks =seq(0, 1, by=0.2), minor_breaks =seq(0, 1, by=0.1))p <- p +labs(x ="Estimated Cigarettes Smoked per Month" , y ="Nicotine Dependence (1=Yes, 0=No)" , title ="Likelihood of Nicotine Dependence\nincreases with number of cigarettes smoked" )print(p)
(a) Proportion of young adult smokers with and without nicotine dependence by depression status
(b) Likelihood of Nicotine Dependence increases with number of cigarettes smoked
Figure 2: Categorical response bivariate plots.
Interpretation:
Categorical response bivariate plots are in Figure 2.
When a person has depression, they are more than twice as likely to have nicotine dependence (22.6%) than those without depression (9.1%) (Figure 2 (a)).
The probability of someone being tobacco dependent increases as the number of cigarettes smoked per month increases (Figure 2 (b)). The logistic curve model probably doesn’t apply at (or near) 0 cigarettes.
6 Statistical methods
6.1 Class 07-1, Simple linear regression (separate worksheet)
Note
Find this assignment on the course website.
6.2 Class 07-2, Simple linear regression
Rubric
With your previous (or new) bivariate scatter plot, add a regression line.
(2 p) plot with regression line,
(1 p) label axes and title.
Use lm() to fit the linear regression and interpret slope and \(R^2\) (R-squared) values.
(2 p) lm summary() table is presented,
(2 p) slope is interpreted with respect to a per-unit increase of the \(x\) variable in the context of the variables in the plot,
(2 p) \(R^2\) is interpretted in a sentence.
(1 p) Interpret the intercept. Does it make sense in the context of your study?
6.2.1 1. Scatter plot, add a regression line.
library(ggplot2)p <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p <- p +theme_bw()#p <- p + geom_point(alpha = 1/10)p <- p +geom_jitter(width =0.25, height =5, alpha =1/10)#p <- p + geom_smooth()p <- p +stat_smooth(method = lm, fullrange =TRUE) # adds the regression linep <- p +labs(title ="Total Cigarettes Smoked vs Age"#, x = , y ="Total Cigarettes Smoked per month" , caption ="Key: Blue line is simple linear regression.\nFiltered to include only smokers (cig > 0)." )p <- p +xlim(c(0, NA)) # extends the range of the axis down to 0 to see the "intercept"print(p)
`geom_smooth()` using formula 'y ~ x'
6.2.2 2. Fit the linear regression, interpret slope and \(R^2\) (R-squared) values
# fit the simple linear regression modellm_TCS_Age <-lm(formula = TotalCigsSmoked ~ Age , data = nesarc_sub )# use summary() to parameters estimates (slope, intercept) and other summariessummary(lm_TCS_Age)
Call:
lm(formula = TotalCigsSmoked ~ Age, data = nesarc_sub)
Residuals:
Min 1Q Median 3Q Max
-332.1 -189.4 -139.1 106.3 2828.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.92792 4.35989 13.29 <2e-16 ***
Age 2.79797 0.08762 31.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 329.2 on 42794 degrees of freedom
(297 observations deleted due to missingness)
Multiple R-squared: 0.02327, Adjusted R-squared: 0.02325
F-statistic: 1020 on 1 and 42794 DF, p-value: < 2.2e-16
Slope: For every 1 year increase in Age, we expect an increase of 2.798 total cigarettes smoked per month.
\(R^2\): The proportion of variance explained by the regression model, compared to the grand mean, is \(R^2 = 0.0233\), which is very small.
Note
When the \(x\) and \(y\) variables are correlated (\(r \ne 0\)), \(R^2\) will be larger (since \(R^2 = r^2\), the capitalization doesn’t matter in this case).
6.2.3 3. Interpret the intercept. Does it make sense?
For someone 0 years old, the expected number of cigarettes smoked per month is 57.93.
This does not make sense since newborns do not smoke and because this is a large extrapolation from the data.
6.3 Class 08-1, Logarithm transformation (separate worksheet)
Note
Find this assignment on the course website.
6.4 Class 08-2, Logarithm transformation
Rubric
Try plotting the data on a logarithmic scale
(6 p) Each of the logarithmic relationships is plotted, axes are labeled with scale.
original scales
\(\log(x)\)-only
\(\log(y)\)-only
both \(\log(x)\) and \(\log(y)\)
What happened to your data when you transformed it?
(2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
(1 p) Choose the best scale for a linear relationship and explain why.
(1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.
6.4.1 1. Try plotting on log scale (original scale, \(\log(x)\)-only, \(\log(y)\)-only, both \(\log(x)\) and \(\log(y)\))
# IncomePersonal#p <- p + scale_x_log10()#p <- p + scale_y_log10()library(ggplot2)p1 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p1 <- p1 +theme_bw()p1 <- p1 +geom_point(alpha =1/20)p1 <- p1 +stat_smooth(method = lm)p1 <- p1 +labs(title ="Original scale" , x ="Age" , y ="Total Cigarettes Smoked per month" )#print(p1)library(ggplot2)p2 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p2 <- p2 +theme_bw()p2 <- p2 +geom_point(alpha =1/20)p2 <- p2 +stat_smooth(method = lm)p2 <- p2 +scale_x_log10() # convert the x-axis to the log10 scalep2 <- p2 +labs(title ="Log(x)" , x ="Age" , y ="Total Cigarettes Smoked per month" )#print(p2)library(ggplot2)p3 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p3 <- p3 +theme_bw()p3 <- p3 +geom_point(alpha =1/20)p3 <- p3 +stat_smooth(method = lm)p3 <- p3 +scale_y_log10() # convert the y-axis to the log10 scalep3 <- p3 +labs(title ="Log(y)" , x ="Age" , y ="Total Cigarettes Smoked per month" )#print(p3)library(ggplot2)p4 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = Age, y = TotalCigsSmoked))p4 <- p4 +theme_bw()p4 <- p4 +geom_point(alpha =1/20)p4 <- p4 +stat_smooth(method = lm)p4 <- p4 +scale_x_log10() # convert the x-axis to the log10 scalep4 <- p4 +scale_y_log10() # convert the y-axis to the log10 scalep4 <- p4 +labs(title ="Log(x) and Log(y)" , x ="Age" , y ="Total Cigarettes Smoked per month" )#print(p4)# arrange them into a single figurep_arranged <- cowplot::plot_grid(plotlist =list(p1, p2, p3, p4) # list of plots , nrow =2# number of rows for grid of plots , ncol =NULL# number of columns, left unspecified , labels ="AUTO"# A and B panel labels )
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
print(p_arranged)
6.4.2 2. What happened to your data when you transformed it?
Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).
Original: Same
log(x): no improvement, age values became left skewed
log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data.
log(x) and log(y): some improvement, cigs smoked became left skewed, but the regression line was closer to the center of the data; age transformation doesn’t help.
Choose the best scale for a linear relationship and explain why.
log(y) because the cigs smoked became slightly left skewed (better than the extreme right skewness from before), and there was no need for an age transformation.
Does your relationship benefit from a logarithmic transformation? Say why or why not.
A little bit since the degree of skewness vertically around the regression line was reduced with the log(y) transformation.
Finer points
Notice that the interpretation is on the scale of the regression. If the \(x\) or \(y\) variable is on a different scale (such as \(\log_{10}\)), then the interpretation is on that scale, too.
The plot with \((x, \log_{10}(y))\) is an improvement over \((x,y)\) since the points are more symmetric around the regression line (they are not perfectly symmetric, but they are better). There might be a better transformation (such as square-root), or we might have to live with a non-symmetric response.
nesarc_sub <- nesarc_sub %>%mutate(TotalCigsSmokedLog10 =log10(TotalCigsSmoked) )# fit the simple linear regression modellm_TCSL10_Age <-lm( TotalCigsSmokedLog10 ~ Age , data = nesarc_sub %>%filter(TotalCigsSmoked >0) )# use summary() to parameters estimates (slope, intercept) and other summariessummary(lm_TCSL10_Age)
Call:
lm(formula = TotalCigsSmokedLog10 ~ Age, data = nesarc_sub %>%
filter(TotalCigsSmoked > 0))
Residuals:
Min 1Q Median 3Q Max
-2.6113 -0.1787 0.2314 0.4303 1.2553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1093341 0.0152392 138.41 <2e-16 ***
Age 0.0054558 0.0002952 18.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6836 on 17714 degrees of freedom
Multiple R-squared: 0.01892, Adjusted R-squared: 0.01886
F-statistic: 341.6 on 1 and 17714 DF, p-value: < 2.2e-16
The slope is 0.00546. Thus, for every year increase in age, we expect an increase of 0.00546 in log 10 of total cigarettes smoked; that is, we expect a small decrease on the log scale.
Finer points
Logorithm transformations of any base (2, \(e\), 10) modify the shape of the data in the same way. The difference is a multiplicative scaling of the axis. My favorite is base 2 because many datasets have several orders of “doubling” and it’s easy for me to think of each unit on the log 2 scale being twice as big as the previous unit. Compare this to base 10 where each unit is 10 times as large as the previous — very few datasets have such a vast range!
6.4.2.0.1 Additional example plots
Finer points
Here’s an example where transforming both \(x\) and \(y\) with the log makes a big improvement.
# IncomePersonal#p <- p + scale_x_log10()#p <- p + scale_y_log10()library(ggplot2)p1 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = IncomePersonal, y = TotalCigsSmoked))p1 <- p1 +theme_bw()p1 <- p1 +geom_point(alpha =1/20)p1 <- p1 +stat_smooth(method = lm)p1 <- p1 +labs(title ="Original scale" , x ="Personal Income" , y ="Total Cigarettes Smoked per month" )#print(p1)library(ggplot2)p2 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = IncomePersonal, y = TotalCigsSmoked))p2 <- p2 +theme_bw()p2 <- p2 +geom_point(alpha =1/20)p2 <- p2 +stat_smooth(method = lm)p2 <- p2 +scale_x_log10()p2 <- p2 +labs(title ="Log(x)" , x ="Personal Income" , y ="Total Cigarettes Smoked per month" )#print(p2)library(ggplot2)p3 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = IncomePersonal, y = TotalCigsSmoked))p3 <- p3 +theme_bw()p3 <- p3 +geom_point(alpha =1/20)p3 <- p3 +stat_smooth(method = lm)p3 <- p3 +scale_y_log10()p3 <- p3 +labs(title ="Log(y)" , x ="Personal Income" , y ="Total Cigarettes Smoked per month" )#print(p3)library(ggplot2)p4 <-ggplot(nesarc_sub %>%filter(TotalCigsSmoked >0), aes(x = IncomePersonal, y = TotalCigsSmoked))p4 <- p4 +theme_bw()p4 <- p4 +geom_point(alpha =1/20)p4 <- p4 +stat_smooth(method = lm)p4 <- p4 +scale_x_log10()p4 <- p4 +scale_y_log10()p4 <- p4 +labs(title ="Log(x) and Log(y)" , x ="Personal Income" , y ="Total Cigarettes Smoked per month" )#print(p4)p_arranged <- cowplot::plot_grid(plotlist =list(p1, p2, p3, p4) # list of plots , nrow =2# number of rows for grid of plots , ncol =NULL# number of columns, left unspecified , labels ="AUTO"# A and B panel labels )
`geom_smooth()` using formula 'y ~ x'
Warning: Transformation introduced infinite values in continuous x-axis
Transformation introduced infinite values in continuous x-axis
### This is another way to arrange several ggplot objects# library(gridExtra)# grid.arrange(grobs = list(p1, p2, p3, p4), nrow=2, top = "Total Cigarettes Smoked vs Personal Income")
6.5 Class 09, Correlation (separate worksheet)
Note
Find this assignment on the course website.
6.6 Class 10, Categorical contingency tables (separate worksheet)
Note
Find this assignment on the course website.
6.7 Class 11, Correlation and Categorical contingency tables
Rubric
With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
(1 p) plot is repeated here or the plot is referenced and easy to find from a plot above,
(1 p) correlation is calculated,
(2 p) correlation is interpreted (direction, strength of LINEAR relationship).
With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
(1 p) frequency table of variables is given,
(2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
(1 p) a well-labeled plot of the proportion table is given,
(2 p) the conditional proportions are interpreted and compared between conditions.
6.7.1 Correlation
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = Age, y = TotalCigsSmoked_log2))p <- p +theme_bw()p <- p +geom_point(alpha =1/20)p <- p +stat_smooth(method = lm)p <- p +labs(title ="Log2 Cigarettes smoked vs Age" , x ="Age" , y ="Log2 Total Cigarettes Smoked per month" )print(p)
cor_A_C <-cor( nesarc_sub$Age , nesarc_sub$TotalCigsSmoked_log2 , use ="complete.obs" )cor_A_C
[1] 0.1375481
6.7.2 Interpretation of correlation
The correlation is 0.138, this is a positive moderately-weak linear relationship between Age and Log2 total cigarettes smoked per month, meaning that as people get older they tend slightly to increase the Log2 Total Number of Cigarettes smoked per month.
`summarise()` has grouped output by 'Sex', 'Depression'. You can override using
the `.groups` argument.
# tab_S_D_Ntab_S_D_N %>% knitr::kable()
Sex
Depression
NicotineDependence
Frequency
Proportion
Female
No Depression
Nicotine Dependence
1455
0.076
Female
No Depression
No Nicotine Dependence
17706
0.924
Female
Yes Depression
Nicotine Dependence
1117
0.206
Female
Yes Depression
No Nicotine Dependence
4297
0.794
Male
No Depression
Nicotine Dependence
1738
0.108
Male
No Depression
No Nicotine Dependence
14355
0.892
Male
Yes Depression
Nicotine Dependence
652
0.269
Male
Yes Depression
No Nicotine Dependence
1773
0.731
library(ggplot2)p <-ggplot(data = tab_S_D_N, aes(x = Depression, y = Proportion, fill = forcats::fct_rev(NicotineDependence)))p <- p +theme_bw()p <- p +geom_hline(yintercept =c(0, 1), alpha =1/4)p <- p +geom_bar(stat="identity")p <- p +labs(title ="Proportion of Nicotine Dependence by Depression status\nfor each Gender" , y ="Proportion of Nicotine Dependence" , fill ="Nicotine Dependence" )p <- p +scale_y_continuous(limits =c(0, 1))p <- p +facet_wrap(~ Sex)p <- p +theme(legend.position ="bottom")print(p)
6.7.4 Interpretation of conditional proportions
For Females without depression, the proportion who are nicotine dependent is 0.076, while for those with depression it is 0.206 (nearly 3 times as much).
For Males without depression, the proportion who are nicotine dependent is 0.108, while for those with depression it is 0.269 (nearly 3 times as much).
Males generally are more nicotine dependent than Females, but both are nearly 3 times more likely to be nicotine dependent if they are depressed compared to if they are not depressed.
6.8 Class 12-1, Parameter estimation (one-sample) (separate worksheet)
Note
Find this assignment on the course website.
6.9 Class 12-2, Inference and Parameter estimation (one-sample)
Rubric
Using a numerical variable, calculate and interpret a confidence interval for the population mean.
(1 p) Identify and describe the variable,
(1 p) use t.test() to calculate the mean and confidence interval, and
(1 p) interpret the confidence interval.
(2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
(1 p) Identify and describe the variable,
(1 p) use binom.test() to calculate the sample proportion and confidence interval, and
(1 p) interpret the confidence interval.
(2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labeled plot.
6.9.1 Numeric variable confidence interval for mean \(\mu\)
TotalCigsSmoked_log2 is the estimated number of cigarettes per month a subject smokes (DaysSmoke * DailyCigsSmoked) on the log2 scale.
# calculate mean and CIt_result <-t.test( nesarc_sub$TotalCigsSmoked_log2 )t_result
One Sample t-test
data: nesarc_sub$TotalCigsSmoked_log2
t = 457.94, df = 17715, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
7.854201 7.921726
sample estimates:
mean of x
7.887963
Interpretation:
The mean of TotalCigsSmoked_log2 is 7.89 and the 95% confidence interval is (7.85, 7.92).
In 95% of samples, a confidence interval constructed from the data will contain the true population mean. In our case, our CI was (7.85, 7.92). It either contains the true population mean or it doesn’t, but we don’t know.
# plot data and create tablelibrary(ggplot2)p <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked_log2))p <- p +theme_bw()p <- p +geom_histogram(binwidth =0.5, alpha =1)#p <- p + geom_rug(alpha = 1/8)# estp <- p +geom_vline(aes(xintercept =as.numeric(t_result$estimate)), colour ="blue", size =1, linetype =3)p <- p +geom_rect(aes(xmin = t_result$conf.int[1] , xmax = t_result$conf.int[2] , ymin =-150 , ymax =0) , fill ="blue", alpha =1)p <- p +labs(title ="Log2 Total Cigs Smoked\nblue = est +- 95% CI")print(p)
6.9.2 Categorical variable confidence interval for proportion \(p\)
Depression is a Yes/No variable indicating that the person has major depression (lifetime).
# proportions are of the last group_by() variable within combinations of the other variablestab_dep_long <- nesarc_sub %>%group_by(Depression) %>%summarise(Frequency =n()) %>%mutate(Proportion = Frequency /sum(Frequency)) %>%ungroup()tab_dep_long
# A tibble: 2 × 3
Depression Frequency Proportion
<fct> <int> <dbl>
1 No Depression 35254 0.818
2 Yes Depression 7839 0.182
# prop.test() is an asymptotic (approximate) test for a binomial random variablep_summary <-prop.test(x = tab_dep_long %>%filter(Depression =="Yes Depression") %>%pull(Frequency) , n =sum(tab_dep_long$Frequency)#, conf.level = 0.95 )p_summary
1-sample proportions test with continuity correction
data: tab_dep_long %>% filter(Depression == "Yes Depression") %>% pull(Frequency) out of sum(tab_dep_long$Frequency), null probability 0.5
X-squared = 17440, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.1782835 0.1855912
sample estimates:
p
0.1819089
Interpretation:
The proportion of people who have Depression is 0.182 and the 95% confidence interval is (0.178, 0.186).
In 95% of samples, a confidence interval constructed from the data will contain the true population proportion In our case, our CI was (0.178, 0.186). It either contains the true population proportion or it doesn’t, but we don’t know.
library(ggplot2)p <-ggplot(data = tab_dep_long %>%filter(Depression =="Yes Depression"), aes(x = Depression, y = Proportion))p <- p +theme_bw()p <- p +geom_hline(yintercept =c(0, 1), alpha =1/4)#p <- p + geom_bar(stat = "identity")p <- p +geom_errorbar(aes(min = p_summary$conf.int[1], max = p_summary$conf.int[2]), width=0.25)p <- p +geom_point(colour ="red", size =6, pch =18)p <- p +scale_y_continuous(limits =c(0.17, 0.19))# swap axes to take less vertical space#p <- p + coord_flip()print(p)
6.10 Class 13, Hypothesis testing (one- and two-sample) (separate worksheet)
Note
Find this assignment on the course website.
6.11 Class 14, Paired data, assumption assessment (separate worksheet)
Note
Find this assignment on the course website.
6.12 Class 15, Hypothesis testing (one- and two-sample)
6.12.1 Mechanics of a hypothesis test (review)
Set up the null and alternative hypotheses in words and notation.
In words: ``The population mean for [what is being studied] is different from [value of \(\mu_0\)].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
In notation: \(H_0: \mu=\mu_0\) versus \(H_A: \mu \ne \mu_0\) (where \(\mu_0\) is specified by the context of the problem).
Choose the significance level of the test, such as \(\alpha=0.05\).
Compute the test statistic, such as \(t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}\), where \(SE_{\bar{Y}}=s/\sqrt{n}\) is the standard error.
Determine the tail(s) of the sampling distribution where the \(p\)-value from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, \(t_{s}\), with the critical value\(t_{\textrm{crit}}=t_{\alpha/2}\) in the direction of the alternative hypothesis from the \(t\)-distribution table with degrees of freedom \(df = n-1\).)
State the conclusion in terms of the problem.
Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT accept\(H_0\).)
Check assumptions of the test (for now we skip this).
Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more).
In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph.
“The difference between the two groups turns out to be small (8%), while the probability (\(p\)) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is \(p = 0.003\) (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold (\(p < 0.05\)). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.”
6.12.3 Two-sample \(t\)-test
Rubric
Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample \(t\)-test associated with your research questions.
(2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
(0 p) use t.test() to calculate the mean, test statistic, and p-value,
(2 p) state the significance level, test statistic, and p-value, and
(2 p) state the conclusion in the context of the problem.
(1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
(2 p) Provide an appropriate plot of the data and sample estimates in a well-labeled plot.
(1 p) Check the assumptions of the test using the bootstrap and interpret the bootstrap sampling distribution plot.
Set up the null and alternative hypotheses in words and notation.
In words: “The population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).”
In notation: \(H_0: \mu_D = \mu_N\) versus \(H_A: \mu_D > \mu_N\).
t_summary_TCS_D <-t.test( TotalCigsSmoked ~ Depression , data = nesarc_sub , alternative ="less" )t_summary_TCS_D
Welch Two Sample t-test
data: TotalCigsSmoked by Depression
t = -17.226, df = 10439, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No Depression and group Yes Depression is less than 0
95 percent confidence interval:
-Inf -71.94239
sample estimates:
mean in group No Depression mean in group Yes Depression
173.0296 252.5677
Choose the significance level of the test, such as \(\alpha=0.05\).
Compute the test statistic, such as \(t_{s} = -17.2\).
\(p\)-value is \(p = 6.85\times 10^{-66}\).
State the conclusion in terms of the problem.
Reject \(H_0\) in favor of \(H_A\) concluding that the population mean total cigarettes smoked (TotalCigsSmoked) is greater for people with depression than for those without depression (Depression).
Because we rejected the Null hypothesis, we could have made a Type I error.
## If we create a summary data.frame with a similar structure as our data, then we## can annotate our plot with those summaries.est_mean_TCS_D <- nesarc_sub %>%group_by(Depression) %>%summarise(TotalCigsSmoked =mean(TotalCigsSmoked, na.rm =TRUE)) %>%ungroup()est_mean_TCS_D
# A tibble: 2 × 2
Depression TotalCigsSmoked
<fct> <dbl>
1 No Depression 173.
2 Yes Depression 253.
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = Depression, y = TotalCigsSmoked))p <- p +theme_bw()p <- p +geom_boxplot(width =0.5, alpha =0.5)p <- p +geom_jitter(position =position_jitter(width =0.1), alpha =1/4)# diamond at mean for each groupp <- p +stat_summary(fun = mean, geom ="point", shape =18, size =6,colour ="red", alpha =0.8)p <- p +labs(title ="Total Cigarettes Smoked by Depression" , x ="Depression status" , y ="Total cigarettes smoked per month" )print(p)
# histogram using ggplotp <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked))p <- p +theme_bw()p <- p +geom_histogram(binwidth =40)#p <- p + geom_rug()p <- p +geom_vline(data = est_mean_TCS_D, aes(xintercept = TotalCigsSmoked), colour ="red")p <- p +facet_grid(Depression ~ .)p <- p +labs(title ="Total Cigarettes Smoked by Depression" , x ="Depression status" , y ="Total cigarettes smoked per month" )print(p)
The bootstrap sampling distribution of the difference in means is close to normal, so the model assumptions are met and we can rely on the decision we made in the hypothesis test above.
6.13 Class 16, ANOVA, Pairwise comparisons (separate worksheet)
Note
Find this assignment on the course website.
6.14 Class 17, ANOVA and Assessing Assumptions
Rubric
Using a numerical response variable and a categorical variable with three to five levels (or a categorical variable you can reduce to three to five levels), specify an ANOVA hypothesis associated with your research questions.
(1 p) Specify the ANOVA hypotheses in words and notation,
(1 p) plot the data in a way that is consistent with hypothesis test (comparing means, assess equal variance assumption),
(1 p) use aov() to calculate the hypothesis test statistic and p-value,
(1 p) state the significance level, test statistic, and p-value,
(1 p) state the conclusion in the context of the problem,
(2 p) assess the normality assumption of the residuals using appropriate methods (QQ-plot and Anderson-Darling test), and
(1 p) assess the assumption of equal variance between your groups using an appropriate test (also mention standard deviations of each group).
(2 p) If you rejected the ANOVA null hypothesis, perform follow-up pairwise comparisons using Tukey’s HSD to indicate which groups have statistically different means and summarize the results.
6.14.1 Hypothesis and plot
Let \(\mu_j\) = pop mean number of the total cigarettes smoked for the five ethnicities identified in the dataset: Caucasian, African American, Native American, Asian, and Hispanic, numbered \((j = 1, 2, 3, 4, 5)\), respectively. We wish to test \(H_0: \mu_1 = \cdots = \mu_5\) against \(H_A: \textrm{not } H_0\) (at least one pair of means differ).
A table of the ethnicities to compare is below.
summary(nesarc_sub$Ethnicity)
Cauc AfAm NaAm Asia Hisp
24507 8245 701 1332 8308
Plot the data in a way that compares the means. Error bars are 95% confidence intervals of the mean.
# Plot the data using ggplotlibrary(ggplot2)p <-ggplot(nesarc_sub, aes(x = Ethnicity, y = TotalCigsSmoked))p <- p +theme_bw()# plot a reference line for the global mean (assuming no groups)p <- p +geom_hline(yintercept =mean(nesarc_sub$TotalCigsSmoked),colour ="black", linetype ="dashed", size =0.3, alpha =0.5)# boxplot, size=.75 to stand out behind CIp <- p +geom_violin(width =0.5, alpha =0.25)p <- p +geom_boxplot(width =0.25, alpha =0.25)# points for observed datap <- p +geom_point(position =position_jitter(w =0.05, h =0), alpha =0.2)# diamond at mean for each groupp <- p +stat_summary(fun = mean, geom ="point", shape =18, size =4,colour ="red", alpha =0.8)# confidence limits based on normal distributionp <- p +stat_summary(fun.data ="mean_cl_normal", geom ="errorbar",width = .2, colour ="red", alpha =0.8)p <- p +labs(title ="Total cigarettes smoked by Ethnicity")p <- p +ylab("Total Cigs Smoked")print(p)
6.14.2 (NOT USED) Transform the response variable to satisfy assumptions
6.14.3 ANOVA Hypothesis test
Hypothesis test
Set up the null and alternative hypotheses in words and notation.
In words: “The population mean total cigarettes smoked is different between ethnicities.”
In notation: \(H_0: \mu_1 = \cdots = \mu_5\) versus \(H_A: \textrm{not } H_0\) (at least one pair of means differ).
Let the significance level of the test be \(\alpha = 0.05\).
Compute the test statistic.
aov_summary <-aov( TotalCigsSmoked ~ Ethnicity , data = nesarc_sub )summary(aov_summary)
Df Sum Sq Mean Sq F value Pr(>F)
Ethnicity 4 2.106e+08 52648664 496.4 <2e-16 ***
Residuals 42791 4.538e+09 106050
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
297 observations deleted due to missingness
The \(F\)-statistic for the ANOVA is \(F = 496\).
Compute the \(p\)-value from the test statistic.
The p-value for testing the null hypothesis is \(p = 0\).
State the conclusion in terms of the problem.
Because \(p = 0 < 0.05\), we reject \(H_0\) in favor of \(H_1\) concluding that the mean total cigarettes smoked is different between ethnicities.
6.14.4 Check assumptions
Check assumptions of the test.
Residuals are normal
Populations have equal variances.
Check whether residuals are normal.
Plot the residuals and assess whether they appear normal.
# Plot the data using ggplotdf_res <-data.frame(res = aov_summary$residuals)library(ggplot2)p <-ggplot(df_res, aes(x = res))p <- p +theme_bw()p <- p +geom_histogram(aes(y = ..density..), binwidth =50)p <- p +geom_density(colour ="blue", adjust =5)#p <- p + geom_rug()p <- p +stat_function(fun = dnorm, colour ="red", args =list(mean =mean(df_res$res), sd =sd(df_res$res)))p <- p +labs(title ="ANOVA Residuals" , caption ="Blue = Kernal density curve\nRed = Normal distribution")print(p)
The plot of the residuals is very right skewed (not normal).
# QQ plotpar(mfrow=c(1,1))#library(car)car::qqPlot( aov_summary$residuals , las =1 , id =list(n =4, cex =1) , lwd =1 , main="QQ Plot" )
38173 279 1034 2883
37909 276 1026 2863
The QQ-plot of the residuals versus the normal quantiles is very right skewed (U-shaped, not normal).
A formal test of normality on the residuals tests the hypothesis \(H_0:\) The distribution is Normal vs \(H_1:\) The distribution is not Normal. We can test the distribution of the residuals.
# A tibble: 5 × 4
Ethnicity m s n
<fct> <dbl> <dbl> <int>
1 Cauc 244. 376. 24507
2 AfAm 127. 250. 8245
3 NaAm 293. 420. 701
4 Asia 76.9 194. 1332
5 Hisp 89.7 226. 8308
The standard deviations appear different between groups, in particular, the Asian group has less than half the standard deviation of the Native American group.
Formal tests for equal variances. We can test whether the variances are equal between our three groups. This is similar to the ANOVA hypothesis, but instead of testing means we’re tesing variances. \(H_0: \sigma^2_1 = \cdots = \sigma^2_5\) versus \(H_A: \textrm{not } H_0\) (at least one pair of variances differ).
## Test equal variance# assumes populations are normalbartlett.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
Bartlett test of homogeneity of variances
data: TotalCigsSmoked by Ethnicity
Bartlett's K-squared = 4464.9, df = 4, p-value < 2.2e-16
# does not assume normality, requires car package#library(car)car::leveneTest(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 498.24 < 2.2e-16 ***
42791
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# nonparametric testfligner.test(TotalCigsSmoked ~ Ethnicity, data = nesarc_sub)
Fligner-Killeen test of homogeneity of variances
data: TotalCigsSmoked by Ethnicity
Fligner-Killeen:med chi-squared = 2160.6, df = 4, p-value < 2.2e-16
Because the data were not normal, we interpret the Levene test. The p-value is less than 0.05, therefore we reject \(H_0\) of equal variances in favor of \(H_A\) that the variances are not equal.
6.14.5 Post Hoc pairwise comparison tests
EMM plot interpretation
This EMM plot (Estimated Marginal Means, aka Least-Squares Means) is only available when conditioning on one variable. The blue bars are confidence intervals for the EMMs; don’t ever use confidence intervals for EMMs to perform comparisons – they can be very misleading. The red arrows are for the comparisons among means; the degree to which the “comparison arrows” overlap reflects as much as possible the significance of the comparison of the two estimates. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”).
## Contrastsadjust_method <-c("none", "tukey", "bonferroni")[2]library(emmeans)emm_cont <- emmeans::emmeans( aov_summary , specs ="Ethnicity" )# means and CIsemm_cont %>%print()
All pairs of Ethnicities differ except Asian and Hispanic.
Cauc - AfAm: sig diff
Cauc - NaAm: sig diff
Cauc - Asia: sig diff
Cauc - Hisp: sig diff
AfAm - NaAm: sig diff
AfAm - Asia: sig diff
AfAm - Hisp: sig diff
NaAm - Asia: sig diff
NaAm - Hisp: sig diff
Asia - Hisp: NOT sig diff
dat_EthCig_summary %>%arrange(m)
# A tibble: 5 × 4
Ethnicity m s n
<fct> <dbl> <dbl> <int>
1 Asia 76.9 194. 1332
2 Hisp 89.7 226. 8308
3 AfAm 127. 250. 8245
4 Cauc 244. 376. 24507
5 NaAm 293. 420. 701
Summarize results by ordering the means and grouping pairs that do not differ.
Ethnicity m s n
<fct> <dbl> <dbl> <int>
| 1 Asia 76.9 194. 1332
| 2 Hisp 89.7 226. 8308
| 3 AfAm 127. 250. 8245
| 4 Cauc 244. 376. 24507
| 5 NaAm 293. 420. 701
6.15 Class 18, Nonparametric methods (separate worksheet)
Note
Find this assignment on the course website.
6.16 Class 19, Binomial and Multinomial tests (separate worksheet)
Note
Find this assignment on the course website.
6.17 Class 20-1, Two-way categorical tables (separate worksheet)
Note
Find this assignment on the course website.
6.18 Class 20-2, Simple linear regression (separate worksheet)
Note
Find this assignment on the course website.
6.19 Class 21, Two-way categorical and simple linear regression
6.19.1 Two-way categorical analysis
Two-way categorical analysis.
Using two categorical variables with two to five levels each, specify a hypothesis test for homogeneity of proportions associated with your research questions.
6.19.1.1 (1 p) Specify the hypotheses in words and notation.
In words: “There is an association between NicotineDependence and Ethnicity.”
In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) for all row categories \(i\) and column categories \(j\), versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\) for at least one row category \(i\) and column category \(j\).
6.19.1.2 (1 p) State the conclusion of the test in the context of the problem.
# Tabulate by two categorical variables:tab_E_TD <-xtabs(~ Ethnicity + NicotineDependence , data = nesarc_sub )tab_E_TD
NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
Cauc 3337 21170
AfAm 836 7409
NaAm 158 543
Asia 95 1237
Hisp 536 7772
Because the p-value \(= 8.84\times 10^{-94} < 0.05\) we reject the null hypothesis concluding that there is an association between NicotineDependence and Ethnicity.
# table of expected frequencies:chisq_E_TD$expected
NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
Cauc 2821.89066 21685.1093
AfAm 949.38134 7295.6187
NaAm 80.71756 620.2824
Asia 153.37489 1178.6251
Hisp 956.63556 7351.3644
The model assumptions are met since the expected count for each cell is at least 5.
6.19.1.2.1 Nicer version of contingency (chi-square) table for publication
## gtsummary tablestab_cross <- nesarc_sub %>% gtsummary::tbl_cross(row ="Ethnicity" , col ="NicotineDependence" , label =NULL , statistic =NULL , digits =NULL , percent =c("none", "column", "row", "cell")[3] , margin =c("column", "row") , missing =c("ifany", "always", "no")[1] , missing_text ="Unknown" , margin_text ="Total" ) %>% gtsummary::add_p() %>%# add p-values to the output comparing values across groups (tbl_cross options: test = c("chisq.test", "fisher.test")[2]) gtsummary::bold_labels() %>%# bold variable name labels gtsummary::italicize_levels() %>%# italicize levels gtsummary::modify_caption("Nicotine dependence by Ethnicity")tab_cross
Nicotine dependence by Ethnicity
NicotineDependence
Total
p-value1
Nicotine Dependence
No Nicotine Dependence
Ethnicity
<0.001
Cauc
3,337 (14%)
21,170 (86%)
24,507 (100%)
AfAm
836 (10%)
7,409 (90%)
8,245 (100%)
NaAm
158 (23%)
543 (77%)
701 (100%)
Asia
95 (7.1%)
1,237 (93%)
1,332 (100%)
Hisp
536 (6.5%)
7,772 (94%)
8,308 (100%)
Total
4,962 (12%)
38,131 (88%)
43,093 (100%)
1 Pearson's Chi-squared test
6.19.1.3 (1 p) Plot a mosaic plot of the data and Pearson residuals.
# The Pearson residualschisq_E_TD$residuals
NicotineDependence
Ethnicity Nicotine Dependence No Nicotine Dependence
Cauc 9.696820 -3.497990
AfAm -3.679775 1.327427
NaAm 8.601947 -3.103031
Asia -4.713559 1.700350
Hisp -13.599806 4.905937
# The sum of the squared residuals is the chi-squared statistic:#chisq_E_TD$residuals^2#sum(chisq_E_TD$residuals^2)
# mosaic plotlibrary(vcd)# this layout gives us the interpretation we want:mosaic( tab_E_TD , shade =TRUE , legend =TRUE , abbreviate_labs =4 , rot_labels =c(0, 90, 0, 0) , just_labels =c("center", "right") # order is opposite as in first argument , offset_varnames =c(0, 0, 0, 2) # offset the y-axis var name#, margins = c(3, 3, 3, 14) # order is: top, right, bottom, left , main ="Fraction of smokers with and without\n nicotine dependence by Ethnicity" , labeling_args =list(set_varnames =c(NicotineDependence ="Nicotine Dependence/Addiction Status" , Ethnicity ="Ethnicity") ) )
6.19.1.4 (1 p) Interpret the mosaic plot with reference to the Pearson residuals.
The primary cause of rejecting \(H_0\) is that Hispanics and Asians have less nicotine dependence than expected while Caucasians and (especially) Native Americans have higher nicotine dependence than expected.
6.19.2 Simple linear regression
Simple linear regression.
Select two numerical variables.
6.19.2.1 (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.
(See below.)
6.19.2.2 (0 p) Fit the simple linear regression model.
# fit model#lm_fit <- lm(TotalCigsSmoked_sqrt ~ DaysSmoke, data = nesarc_sub)lm_fit <-lm(TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)coef(lm_fit)
(Intercept) DaysSmoke_log2
0.4585303 1.6635414
Note that prediction intervals should be bounded below at 0.
library(ggplot2)p <-ggplot(nesarc_sub, aes(x = DaysSmoke_log2, y = TotalCigsSmoked_log2))p <- p +theme_bw()p <- p +geom_vline(xintercept =0, alpha =0.25)p <- p +geom_hline(yintercept =0, alpha =0.25)# prediction bandsp <- p +geom_ribbon(aes(ymin =predict(lm_fit, data.frame(DaysSmoke_log2) , interval ="prediction", level =0.95)[, 2],ymax =predict(lm_fit, data.frame(DaysSmoke_log2) , interval ="prediction", level =0.95)[, 3],) , alpha=0.1, fill="darkgreen")# linear regression fit and confidence bandsp <- p +geom_smooth(method = lm, se =TRUE)# jitter a little to uncover duplicate pointsp <- p +geom_jitter(position =position_jitter(0.05), alpha =0.01)p <- p +labs(title ="Regression with confidence and prediction bands" , x ="log2 of Days smoke" , y ="log2 of Total cigarettes smoked" , caption ="Blue line is fitted regression line.\nGray band is the confidence band.\nGreen band is the prediction band.")print(p)
Residuals vs x: each group (based on x-variable) of values is roughly symmetric and the y=0 line passes through the center of most groups. The x=0 group has many observations at the lowest value which is why the x=0 (or regression) is below the distribution that is spread above it.
6.19.2.4 (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).
QQ-Plot: On original scale, the pattern is non-normal with right-skewness in y-variable. After log2 transformation of both variables, we still do not have normality (but it is much better).
6.19.2.5 (1 p) Assess the relative influence of points.
Cook’s distance: several points with slightly higher influence, but none with extremely large influence.
Cook’s distance vs Leverage: Those points with high influence also have high leverage, thus, those points that are extremes in the x-direction (least frequent or most frequent smokers) have the highest influence.
6.19.2.6 (1 p) Test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).
Though the model assumptions are not perfectly met, we will continue to interpret the model that we have.
summary(lm_fit)
Call:
lm(formula = TotalCigsSmoked_log2 ~ DaysSmoke_log2, data = nesarc_sub)
Residuals:
Min 1Q Median 3Q Max
-3.7145 -0.4585 0.1924 0.6075 5.5859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.45853 0.03447 13.3 <2e-16 ***
DaysSmoke_log2 1.66354 0.00746 223.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.175 on 17714 degrees of freedom
(25377 observations deleted due to missingness)
Multiple R-squared: 0.7373, Adjusted R-squared: 0.7373
F-statistic: 4.973e+04 on 1 and 17714 DF, p-value: < 2.2e-16
Because the p-value = \(0 < 0.05\), we reject \(H_0\) in favor of \(H_A\) concluding that the slope is different from 0.
Interpretation (extra): For every unit increase in log2 of DaysSmoke, we expect an increase of 1.66 in the log2 of TotalCigsSmoked. On the original scale, this increase is scary.
6.19.2.7 (1 p) Interpret the \(R^2\) value.
The regression model explains \(R^2 = 0.737\) proportion of the variation in the response above using the mean value alone.
6.19.2.7.1 Nicer version of linear model parameter estimate table for publication
tab_fit_lm <- lm_fit %>% gtsummary::tbl_regression(# label = NULL#, exponentiate = FALSE#, include = everything()#, show_single_row = NULLconf.level =0.95 , intercept =TRUE#, estimate_fun = NULL#, pvalue_fun = NULL#, tidy_fun = NULL#, add_estimate_to_reference_rows = FALSE , conf.int =TRUE ) %>%#gtsummary::add_q() %>% # add a column of q values to control for multiple comparisons gtsummary::bold_p(t =0.05, q =FALSE) %>%# Bold significant p-values or q-values gtsummary::add_global_p() %>%# adds the global p-value for a categorical variables, instead of difference from intercept gtsummary::add_glance_source_note() %>%# adds statistics from `broom::glance()` as source note#gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)]) %>% # adds column of the variance inflation factors (VIF) gtsummary::bold_labels() %>%# bold variable name labels gtsummary::italicize_levels() %>%# italicize levels gtsummary::modify_caption("Table title")tab_fit_lm
6.21.4 (1 p) Plot the \(\hat{p}\) values vs the \(x\)-variable and plot the empirical logits vs the \(x\)-variable.
6.21.4.1 Probability/proportion scale
Plots on the probability scale should follow a sigmoidal curve (a little difficult to see here). Note that the overlayed reference curve (red) is a weighted smoothed curve (loess), not the model fit.
library(ggplot2)p1 <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))p1 <- p1 +theme_bw()# datap1 <- p1 +geom_jitter(width =0.05, height =0.025, size =0.5, alpha =1/10, colour ="green")# summariesp1 <- p1 +geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))p1 <- p1 +geom_smooth(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, weight = Total), se =FALSE, colour ="red") # just for reference# axesp1 <- p1 +scale_y_continuous(breaks =seq(0, 1, by =0.2))p1 <- p1 +expand_limits(y =c(0, 1))#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))# labelsp1 <- p1 +labs(title ="Proportion of nicotine dependence by log2 Total cigarettes smoked" , subtitle="Observed data, smokers only" , x ="log2 Total cigarettes smoked" , y ="Nicotine dependence (0/1)\nObserved probability of nicotine dependence" , caption =paste("Green = Indicator points of nicotine dependence (1) or not (0)." , "Black = Observed proportions of nicotine dependence" , "Red = Smoothed curve to proportions" , sep ="\n"# separate by new lines ) )print(p1)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
On the logit scale, if points follow a straight line, then we can fit a simple logistic regression model. Note that the overlayed reference straight line (blue) is from weighted least squares (not the official fitted line), and the curve (red) is a weighted smoothed curve (loess). If these two lines are similar, that suggests a straight line on the logit scale is a good fit. Both lines are weighted by the total number of observations that each point represents, so that points representing few observations don’t contribute as much as points representing many observations, thus our decision should not be heavily influenced by random deviations where there is little data.
# plot on logit scalelibrary(ggplot2)p1 <-ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))p1 <- p1 +theme_bw()# summariesp1 <- p1 +geom_point(aes(size = Total))p1 <- p1 +stat_smooth(aes(weight = Total), method ="lm", se =FALSE, linetype =3) # just for referencep1 <- p1 +geom_smooth(aes(weight = Total), se =FALSE, colour ="red") # just for reference# axes#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))# labelsp1 <- p1 +labs(title ="Logit of nicotine dependence by log2 Total cigarettes smoked" , subtitle="Observed data, smokers only" , x ="log2 Total cigarettes smoked" , y ="Empirical logit of the probability of nicotine dependence" , caption =paste("Black = Observed logit proportions of nicotine dependence" , "Blue = Naive LM fit of logit proportions" , "Red = Loess smooth curve of empirical logits" , sep ="\n"# separate by new lines ) )print(p1)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
6.21.5 (1 p) Describe the logit-vs-\(x\) plot. Is it linear? If not, consider a transformation of \(x\) to improve linearity; describe the transformation you chose if you needed one.
Log2 of Total cigarettes smoked improved the linear relationship. A straight line describes the data well from 0 to 10, but does not fit the data when x is greater than 10.
6.21.6 (1 p) Fit the glm() model and assess the deviance lack-of-fit test.
The simple logistic regression model expresses the population proportion \(p\) of individuals with a given attribute (called the probability of success) as a function of a single predictor variable \(X\). The model assumes that \(p\) is related to \(X\) through \[
\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X
\] or, equivalently, as \[
p = \frac{ \exp( \beta_0 + \beta_1 X ) }{ 1 + \exp( \beta_0 + \beta_1 X ) }.
\] The logistic regression model is a binary response model, where the response for each case falls into one of two exclusive and exhaustive categories, success (cases with the attribute of interest) and failure (cases without the attribute of interest).
Fit the model.
# For our summarized data (with frequencies and totals for each age)# The left-hand side of our formula binds two columns together with cbind():# the columns are the number of "successes" and "failures".# For logistic regression with logit link we specify family = binomial,# where logit is the default link function for the binomial family.glm_n_c <-glm(cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2 , family = binomial , data = dat_Nic_Cigs_sum )
6.21.6.1 Deviance statistic for lack-of-fit
Unfortunately, there aren’t many model diagnostics for logistic regression. Visual inspection of the empirical logits is good when you can summarize the data as we’ve done.
One simple test for lack-of-fit uses the deviance statistic. Under the null hypothesis (that you’ll state below), the residual deviance follows a \(\chi^2\) distribution with the associated degrees-of-freedom.
Below, we calculate the deviance p-value and perform the test for lack-of-fit.
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)glm_n_c$deviance
State the null and alternative hypotheses for lack-of-fit.
\(H_0:\) Model fits the data.
\(H_A:\) Model does not fit the data.
For your preferred model, the deviance statistic is
D = 310 with
101 df, giving
p-value = 0.
Because the p-value is less than 0.05, we reject \(H_0\) in favor of \(H_A\) concluding that the model does not fit the data. However, for this assignment, we will continue with interpretation.
6.21.7 (1 p) Calculate the confidence bands around the model fit/predictions. Plot on both the logit and \(\hat{p}\) scales.
The glm() statement creates an object which we can use to create the fitted probabilities and 95% CIs for the population proportions at the ages at first vaginal intercourse. The fitted probabilities and the limits are stored in columns labeled fit_p, fit_p_lower, and fit_p_upper, respectively.
Below I create confidence bands for the model and plot the fit against the data, first on the logit scale to assess model fit and then on the probability scale for interpretation.
# predict() uses all the Load values in dataset, including appended valuesfit_logit_pred <-predict( glm_n_c#, newdata = data.frame(TotalCigsSmoked_smokers_log2 = dat_Nic_Cigs_sum$TotalCigsSmoked_smokers_log2) , type ="link" , se.fit =TRUE ) %>%as_tibble()# put the fitted values in the data.framedat_Nic_Cigs_sum <- dat_Nic_Cigs_sum %>%mutate(# logit scale valuesfit_logit = fit_logit_pred$fit , fit_logit_se = fit_logit_pred$se.fit , fit_logit_lower = fit_logit -1.96* fit_logit_se , fit_logit_upper = fit_logit +1.96* fit_logit_se# proportion scale values , fit_p =exp(fit_logit) / (1+exp(fit_logit)) , fit_p_lower =exp(fit_logit_lower) / (1+exp(fit_logit_lower)) , fit_p_upper =exp(fit_logit_upper) / (1+exp(fit_logit_upper)) )
6.21.7.1 Logit scale
# plot on logit scalelibrary(ggplot2)p1 <-ggplot(dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = emp_logit))p1 <- p1 +theme_bw()# MODEL FIT# predicted curve and point-wise 95% CIp1 <- p1 +geom_ribbon(aes(x = TotalCigsSmoked_smokers_log2, ymin = fit_logit_lower, ymax = fit_logit_upper), fill ="blue", linetype =1, alpha =0.2)p1 <- p1 +geom_line(aes(x = TotalCigsSmoked_smokers_log2, y = fit_logit), colour ="blue", size =1)# fitted valuesp1 <- p1 +geom_point(aes(y = fit_logit), colour ="blue", size =2)# summariesp1 <- p1 +geom_point(aes(size = Total))# axes#p1 <- p1 + scale_y_continuous(breaks = seq(-10, 10, by = 0.5))#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))# labelsp1 <- p1 +labs(title ="Logit of nicotine dependence by log2 Total cigarettes smoked" , subtitle="Logistic model fit" , x ="log2 Total cigarettes smoked" , y ="Logit scale of the probability of nicotine dependence" , caption =paste("Black = Observed logit proportions of nicotine dependence given log2 Total cigarettes smoked" , "Blue = Logistic model fitted logit proportions" , sep ="\n"# separate by new lines ) )print(p1)
6.21.7.2 Probability/proportion scale
# plot on probability scalelibrary(ggplot2)p1 <-ggplot(nesarc_sub, aes(x = TotalCigsSmoked_smokers_log2, y = NicotineDependence01))p1 <- p1 +theme_bw()# datap1 <- p1 +geom_jitter(width =0.05, height =0.025, size =0.5, alpha =1/10, colour ="green")# summariesp1 <- p1 +geom_point(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = p_hat, size = Total))# MODEL FIT# predicted curve and point-wise 95% CIp1 <- p1 +geom_ribbon(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p, ymin = fit_p_lower, ymax = fit_p_upper), fill ="blue", linetype =1, alpha =0.2)p1 <- p1 +geom_line(data = dat_Nic_Cigs_sum, aes(x = TotalCigsSmoked_smokers_log2, y = fit_p), colour ="blue", size =1)# fitted valuesp1 <- p1 +geom_point(data = dat_Nic_Cigs_sum, aes(y = fit_p), colour ="blue", size =2)# axesp1 <- p1 +scale_y_continuous(breaks =seq(0, 1, by =0.2))p1 <- p1 +expand_limits(y =c(0, 1))#p1 <- p1 + scale_x_continuous(breaks = seq(0, 100, by = 2))# labelsp1 <- p1 +labs(title ="Proportion of nicotine dependence by log2 Total cigarettes smoked" , subtitle="Logistic model fit" , x ="log2 Total cigarettes smoked" , y ="At least one pregnancy (0/1)\nObserved probability of nicotine dependence" , caption =paste("Green = Indicator points of nicotine dependence (1) or not (0)." , "Black = Observed proportions of nicotine dependence given log2 Total cigarettes smoked" , "Blue = Logistic model fitted proportions" , sep ="\n"# separate by new lines ) )print(p1)
6.21.8 (1 p) Interpret the sign (\(+\) or \(-\)) of the slope parameter and test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).
The summary table gives MLEs and standard errors for the regression parameters. The z-value column is the parameter estimate divided by its standard error. The p-values are used to test whether the corresponding parameters of the logistic model are zero.
summary(glm_n_c)
Call:
glm(formula = cbind(Success, Total - Success) ~ TotalCigsSmoked_smokers_log2,
family = binomial, data = dat_Nic_Cigs_sum)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.9776 -1.1004 -0.6275 0.4655 3.8816
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.94584 0.08730 -33.74 <2e-16 ***
TotalCigsSmoked_smokers_log2 0.23882 0.01014 23.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1020.01 on 102 degrees of freedom
Residual deviance: 310.04 on 101 degrees of freedom
AIC: 607.54
Number of Fisher Scoring iterations: 4
# see names(summary(glm_n_c)) to find the object that has the coefficients.# can also use coef(glm_n_c)
Interpret the sign (\(+\) or \(-\)) of the slope.
Because our slope is positive, as (x) log2 Total cigarettes smoked increases, the probability of “success” (y) of nicotine dependence increases.
Hypothesis test
\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \ne 0\)
The p-value = \(1.41\times 10^{-122} < 0.05\), therefore we reject \(H_0\) in favor of \(H_A\) concluding that the slope is different from 0.
6.21.8.0.1 Nicer version of logistic linear model parameter estimate table for publication
# Logit scale uses exponentiate = FALSEtab_fit_glm <- glm_n_c %>% gtsummary::tbl_regression(# label = NULLexponentiate =FALSE## do not exponentiate gives logit scale#, include = everything()#, show_single_row = NULL , conf.level =0.95 , intercept =TRUE#, estimate_fun = NULL#, pvalue_fun = NULL#, tidy_fun = NULL#, add_estimate_to_reference_rows = FALSE , conf.int =TRUE ) %>%#gtsummary::add_q() %>% # add a column of q values to control for multiple comparisons gtsummary::bold_p(t =0.05, q =FALSE) %>%# Bold significant p-values or q-values gtsummary::add_global_p() %>%# adds the global p-value for a categorical variables, instead of difference from intercept gtsummary::add_glance_source_note() %>%# adds statistics from `broom::glance()` as source note#gtsummary::add_vif(statistic = c("VIF", "GVIF", "aGVIF", "df")[c(2, 4)]) %>% # adds column of the variance inflation factors (VIF) gtsummary::bold_labels() %>%# bold variable name labels gtsummary::italicize_levels() %>%# italicize levels gtsummary::modify_caption("Logit scale results")tab_fit_glm
7.1.2.1 Class Literature review (detailed example)
A slightly more detailed literature review would typically be completed for a research project. Here I show what that might look like. However, in our class I’m not expecting so much detail. See the bullet points that I have used in the “Class 26 poster preparation” section.
Dataset: National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), with codebook NESARC_W1_CodeBook.pdf.
Research question:
Is nicotine dependence [S3AQ10D] associated with smoking frequency [S3AQ3B1] and quantity [S3AQ3C1]?
Google scholar search: “nicotine dependence smoking frequency”
Citation: pdf file available for Dierker et al. (2007)
Interesting points: Figures 2 and 3, quantity and frequency both positively related to probability of dependence.
Others: Kandel and Chen (2000) and Caraballo, Novak, and Asman (2009)
You don’t need to include images in your literature review. I’m providing these figures and tables to illustrate what they look like.
Is nicotine dependence [S3AQ10D] associated with depression [S4AQ1]?
Google scholar search: “nicotine dependence depression”
Citation: pdf file available for Naomi Breslau (1995)
Interesting points: Table 2, Smoking and Nicotine Dependence both associated with Education. Table 3, Major depression associated with being nicotine dependent and Sex.
Is the associated between nicotine dependence [S3AQ10D] and depression [S4AQ1] different by demographics, such as Education or Sex?
In Question 2 we see differences by Education and Sex.
I have decided to further focus my question by examining whether the association between nicotine dependence and depression differs based on how much a person smokes. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, education, etc.) and any other variables I may wish to consider.
All required variables have been found and added to my personal codebook (by expanding Class 03).
Citations are at the bottom of this document in the “References” section.
7.1.3 Class 26, Poster Preparation, complete content
Rubric
Organize the content of your poster.
Complete the content for each of these sections:
Title: Smoking behavior is (barely) associated with major depression in young adults
(Class 25) (3 p) Introduction
(Lit Review) 2-4 bullets describing the study, previous research.
Smoking behavior is associated with major depression.
Heavy smokers may not have nicotine dependence, conversely light smokers may have nicotine dependence (Kandel and Chen 2000).
(Class 24) (2 p) Research Questions
(Class 02, Personal Codebook) 2 bullets, one for each research question, stated as the alternative hypothesis.
The goals of the analysis include answering these two questions:
Is there an association between major depression and nicotine dependence?
Does the prevalence of nicotine dependence differ by ethnicity?
(Class 24) (2 p) Methods
(Class 02, Personal Codebook) Data source(s).
A sample from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) (US adults). We focus on young adult (18–25) smokers (43093 of 43093 respondants).
In the Introduction to the National Epidemiologic Survey on Alcohol and Related Conditions they describe the sampling design. Adult respondants were sampled from a subset of households included in the US Census 2000 Supplementary Survey. To improve information about smaller groups, Blacks and Hispanics were oversampled by about 50%, as were young adults ages 18-24 by 125%. Using the US Census as a sampling frame helps to provide a representative sampling of the country. Oversampling smaller subsets of the population means that comparisons involving those subsets will be more reliable.
Total cigarettes per month (TotalCigsSmoked = days smoked times daily cigs smoked, range: 0–NA), a combination of these questions: “About how often did you usually smoke in the past year?” and “On the days that you smoked in the last year, about how many cigarettes did you usually smoke?”
Nicotine dependence (NicotineDependence, 0=No, 1=Yes) in the last 12 months.
(Various) Statistical methods used to answer the research questions.
A two-sample \(t\)-test comparing the mean total cigarettes smoked by depression status.
A \(\chi^2\) analysis of a two-way contingency table of nicotine dependence by ethnicity.
(Class 24) (3 p) Results for your first research question.
Plot and describe the data, as well as the statistical model. This can often be done in a single plot. Examples:
ANOVA: A mean with CI bars is the statistical model overlayed on the data points.
Contingency table: A mosaic plot with colored boxes relative to contribution to Pearson \(\chi^2\) shows the data with evidence towards the alternative hypothesis.
Simple linear regression: A regression line is the statistical model overlayed on the data points.
Logistic regression: The logistic curve is the statistical model overlayed on the top/bottom histograms of the data.
State the conclusion of the hypothesis test and interpret it in the context of the research question.
“Is the population mean square-root total cigarettes smoked different for those with depression or not?”
\(H_0: \mu_{D} = \mu_{ND}\) versus \(H_A: \mu_{D} \ne \mu_{ND}\)
data: TotalCigsSmoked_sqrt by Depression t = -19.317, df = 10794, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is not equal to 0 95 percent confidence interval: -3.176909 -2.591550 sample estimates: mean in group No Depression mean in group Yes Depression 7.360964 10.245194
Welch Two Sample t-test
data: TotalCigsSmoked by Depression t = -17.226, df = 10439, p-value < 2.2e-16 alternative hypothesis: true difference in means between group No Depression and group Yes Depression is not equal to 0 95 percent confidence interval: -88.58909 -70.48702 sample estimates: mean in group No Depression mean in group Yes Depression 173.0296 252.5677
Model assumptions are met, the the sampling distribution of the difference in means is normal.
Because \(p=9.24\times 10^{-82} > 0.05\) (with \(t_{s} = -19.32\)), we have insufficient evidence to reject \(H_0\) at an \(\alpha=0.05\) significance level, concluding that the total cigarettes smoked does not differ by depression status.
The difference is very small, on the square-root scale it is 2.88 and on the natural scale it is 54.1837915, 104.9639984, a difference of 51 cigarettes each year.
(Class 24) (3 p) Results for your second research question.
Plot and describe the data, as well as the statistical model. This can often be done in a single plot.
State the conclusion of the hypothesis test and interpret it in the context of the research question.
Is there an association between NicotineDependence and Ethnicity?
The model assumptions are met since the expected count for each cell is at least 5.
Because the p-value \(= 8.84\times 10^{-94} < 0.05\) (\(X^2 = 439\)) we reject the null hypothesis concluding that there is an association between NicotineDependence and Ethnicity.
The primary cause of rejecting \(H_0\) is that Hispanics have less nicotine dependence than expected by chance.
(Class 25) (4 p) Discussion
Put the results you found for each research question in the context you provided in your introduction.
These results support the previous literature that individuals with major depression are more sensitive to the development of nicotine dependence regardless of how much they smoke. Thus, people with depression are an important population subgroup for targeted smoking intervention programs.
(Class 25) (1 p) Further directions or Future work or Next steps or something else that indicates there more to do and you’ve thought about it.
What do these results lead you to want to investigate?
We can extend focus from young adults to all adults, and (with multiple regression techniques we’ll learn in ADA2) assess the particular depressed subpopulations most susceptible to developing nicotine dependence.
(Class 25) (2 p) References
By citing sources in your introduction, this section will automatically have your bibliography.
References
References are supposed to appear here, but they may appear at the end of the document.
7.2 Class 27, Poster Preparation, into poster template
Caraballo, Ralph S., Scott P. Novak, and Katherine Asman. 2009. “Linking Quantity and Frequency Profiles of Cigarette Smoking to the Presence of Nicotine Dependence Symptoms Among Adolescent Smokers: Findings from the 2004 National Youth Tobacco Survey.”Nicotine & Tobacco Research, January, ntn008. https://doi.org/10.1093/ntr/ntn008.
Dierker, Lisa C., Shelli Avenevoli, Kathleen R. Merikangas, Brian P. Flaherty, and Marilyn Stolar. 2001. “Association Between Psychiatric Disorders and the Progression of Tobacco Use Behaviors.”Journal of the American Academy of Child & Adolescent Psychiatry 40 (10): 1159–67. https://doi.org/10.1097/00004583-200110000-00009.
Dierker, Lisa C., Eric Donny, Stephen Tiffany, Suzanne M. Colby, Nicholas Perrine, and Richard R. Clayton. 2007. “The Association Between Cigarette Smoking and DSM-IV Nicotine Dependence Among First Year College Students.”Drug and Alcohol Dependence 86 (2–3): 106–14. https://doi.org/10.1016/j.drugalcdep.2006.05.025.
Rohde, Paul, Christopher W. Kahler, Peter M. Lewinsohn, and Richard A. Brown. 2004. “Psychiatric Disorders, Familial Factors, and Cigarette Smoking: II. Associations with Progression to Daily Smoking.”Nicotine & Tobacco Research 6 (1): 119–32. https://doi.org/10.1080/14622200310001656948.
Rohde, Paul, Peter M. Lewinsohn, Richard A. Brown, Jeffrey M. Gau, and Christopher W. Kahler. 2003. “Psychiatric Disorders, Familial Factors and Cigarette Smoking: I. Associations with Smoking Initiation.”Nicotine & Tobacco Research 5 (1): 85–98. https://doi.org/10.1080/1462220031000070507.
Stanton, Warren R., John B. Lowe, and Phil A. Silva. 1995. “Antecedents of Vulnerability and Resilience to Smoking Among Adolescents.”Journal of Adolescent Health 16 (1): 71–77. https://doi.org/10.1016/1054-139X(94)00051-F.