---
title: "ADA2: Class 27, ATUS data subset and model selection"
author: Your Name
date: last-modified
description: |
[Advanced Data Analysis 2](https://StatAcumen.com/teach/ada2),
Stat 428/528, Spring 2024, Prof. Erik Erhardt, UNM
format:
html:
theme: litera
highlight-style: atom-one
page-layout: full # article, full # https://quarto.org/docs/output-formats/page-layout.html
toc: true
toc-location: body # body, left, right
number-sections: true
self-contained: false # !!! this can cause a render error
code-overflow: scroll # scroll, wrap
code-block-bg: true
code-block-border-left: "#30B0E0"
code-copy: false # true, false, hover a copy buttom in top-right of code block
fig-width: 6
fig-height: 4
fig-align: center # default, left, right, or center
execute: # https://quarto.org/docs/computations/execution-options.html, https://quarto.org/docs/computations/r.html
cache: false # false, true
echo: true # true, false Include the source code in output
warning: true # true, false Include warnings in the output.
error: true # true, false Include errors in the output (note that this implies that errors executing code will not halt processing of the document).
#bibliography: ADA2_proj.bib
---
The __American Time Use Survey (ATUS)__ measures the amount of time people spend
doing various activities, such as paid work, childcare, volunteering, and
socializing.
We use my summarized version of the USA BLS.gov [ATUS 2003-2021 Multi-Year
Microdata Files](https://www.bls.gov/tus/data/datafiles-0321.htm).
# Rubric
Complete everything below.
---
# (1 p) Personalized analysis conditions
Set your personalized analysis conditions.
This assigns to you an effectively random sample from the dataset,
a stepwise selection strategy, and
a model selection criterion.
Your path through this analysis will differ depending on your name and birth date.
1. Sample of data.
* Choose a random number seed based on your first and last name initials and birth day of the month.
```{r}
# example: EE and 14th becomes 050514, where each E = 05th letter of the alphabet
condition_1_seed <- 050514
```
2. Stepwise selection starting model.
* If your birth day of the month is __01--10__, then your stepwise model selection will start with the __mean (empty) model__. Choose index `[1]` below.
* If your birth day of the month is __11--20__, then your stepwise model selection will start with the __main effects model__. Choose index `[2]` below.
* If your birth day of the month is __21--31__, then your stepwise model selection will start with the __two-way interaction model__. Choose index `[3]` below.
```{r}
# example: 14th becomes "Main effects", that's the 2nd index
condition_2_init_model <- c("Mean", "Main effects", "Two-way interaction")[2]
```
3. Model selection criterion.
* If your birth month is __01--06__, then use __AIC__ for model selection. Choose index `[1]` below.
* If your birth month is __07--12__, then use __BIC__ for model selection. Choose index `[2]` below.
We will compare the results of all of the assignments, so please do your best with your given conditions.
The point I hope will become clear is that the results are sensitive to the sample and the choices you make in analysis.
```{r}
# example: December is 12th month, giving "BIC", that's the 2nd index
condition_3_criterion <- c("AIC", "BIC")[2]
```
4. Everyone will use $n = 500$ observations for analysis.
* This is large enough to clearly identify patterns in the data but not
overwhelmingly large to detect tiny effects.
* Do not change this value.
```{r}
n_analysis <- 500
```
# Data
## Read and format
These data have been prepared for you.
```{r}
library(erikmisc)
library(tidyverse)
library(erikdata) # ATUS data, install with devtools::install_github("erikerhardt/erikdata")
library(labelled) # for variabel labels, use: var_label(dat_atus$TUCASEID)
set.seed(condition_1_seed) # must run prior to dplyr::slice_sample() to draw the same sample
dat_atus <-
erikdata::dat_atus %>%
dplyr::select(
TUCASEID
, t0101, TESEX, TEAGE, GTMETSTA, PEEDUCA, TRERNHLY, TEHRUSL1, TEHRUSL2, TRHHCHILD, TRTALONE, TRTHHFAMILY
)
# list of variables with their labels
labels_dat_atus %>%
dplyr::filter(
Var %in% names(dat_atus)
)
dat_atus <-
dat_atus %>%
dplyr::filter(
TRERNHLY > 0 # only people who work and earn an hourly wage
, t0101 > 0 # only people who went to sleep
) %>%
dplyr::mutate(
t0101 = t0101 / 60 # convert minutes to hours
, PEEDUCA_num =
case_when(
PEEDUCA == "Less than 1st grade" ~ 0 # 1
, PEEDUCA == "1st, 2nd, 3rd, or 4th grade" ~ 2.5 # 2
, PEEDUCA == "5th or 6th grade" ~ 5.5 # 3
, PEEDUCA == "7th or 8th grade" ~ 7.5 # 4
, PEEDUCA == "9th grade" ~ 9 # 5
, PEEDUCA == "10th grade" ~ 10 # 6
, PEEDUCA == "11th grade" ~ 11 # 7
, PEEDUCA == "12th grade - no diploma" ~ 12 # 8
, PEEDUCA == "High school graduate - diploma or equivalent (GED)" ~ 12 # 9
, PEEDUCA == "Some college but no degree" ~ 13 # 10
, PEEDUCA == "Associate degree - occupational/vocational" ~ 14 # 11
, PEEDUCA == "Associate degree - academic program" ~ 14 # 12
, PEEDUCA == "Bachelor's degree (BA, AB, BS, etc.)" ~ 16 # 13
, PEEDUCA == "Master's degree (MA, MS, MEng, MEd, MSW, etc.)" ~ 18 # 14
, PEEDUCA == "Professional school degree (MD, DDS, DVM, etc.)" ~ 21 # 15
, PEEDUCA == "Doctoral degree (PhD, EdD, etc.)" ~ 21 # 16
, TRUE ~ NA %>% as.numeric()
)
# set the "Not identified" Metropolitan areas to NA
, GTMETSTA =
GTMETSTA %>%
factor(
levels =
# keep the levels that are not "Not identified"
stringr::str_subset(
string = levels(dat_atus$GTMETSTA)
, pattern = "Not identified"
, negate = TRUE
)
)
# hours worked at all jobs
, TEHRUSL_all = TEHRUSL1 + TEHRUSL2
) %>%
dplyr::select(
-PEEDUCA
, -TEHRUSL1
, -TEHRUSL2
) %>%
# drop rows with any missing values
tidyr::drop_na() %>%
# select your sample of rows for analysis
dplyr::slice_sample(
n = n_analysis
)
# label new variables
labelled::var_label(dat_atus[[ "PEEDUCA_num" ]]) <-
labelled::var_label(dat_atus[[ "PEEDUCA" ]])
# relabel variables that were modified in a way that removes the label attribute
labelled::var_label(dat_atus[[ "GTMETSTA" ]]) <-
labels_dat_atus %>% filter(Var == "GTMETSTA") %>% pull(Label)
labelled::var_label(dat_atus[[ "TEHRUSL_all" ]]) <-
labels_dat_atus %>% filter(Var == "TEHRUSL1") %>% pull(Label)
# wrap all labels for plots
for (i_var in seq_len(ncol(dat_atus))) {
labelled::var_label(dat_atus[, i_var]) <-
labelled::var_label(dat_atus[, i_var]) %>%
str_wrap(width = 30)
}
str(dat_atus)
```
## Data decisions start here
There are probably some unsustainably short or long numbers of hours slept.
Let's filter to keep only people who slept at least 5 and at most 12 hours of sleep.
Add any other changes to this code for filtering, excluding outliers, or transforming variables.
```{r}
## filter and mutate data here to satisfy model assumptions
dat_atus <-
dat_atus %>%
dplyr::filter(
t0101 >= 5
, t0101 <= 12
#, TUCASEID %notin% c() # Can use this to exclude observations by ID number
) %>%
dplyr::mutate(
)
str(dat_atus)
```
## Plot
Set `eval: false` to skip this plot once you're satisfied with the data choices you've made.
__Please remember to set it back to `true`__ so that this plot appears as part of your homework submission.
```{r}
#| fig-width: 12
#| fig-height: 12
#| eval: true # false
## Scatterplot matrix
library(ggplot2)
library(GGally)
p <-
ggpairs(
dat_atus %>% dplyr::select(-TUCASEID)
, title = "ATUS Sleeping"
, mapping = ggplot2::aes(colour = TESEX, alpha = 0.5)
, diag = list(
continuous =
wrap(
c("densityDiag", "barDiag", "blankDiag")[1]
, alpha = 1/2
)
, discrete =
c("barDiag", "blankDiag")[1]
)
# scatterplots on top so response as first variable has y on vertical axis
, upper = list(
continuous =
wrap(
c("points", "smooth", "smooth_loess", "density", "cor", "blank")[2]
, se = FALSE
, alpha = 1/2
, size = 1
)
, discrete =
c("ratio", "facetbar", "blank")[2]
, combo =
wrap(
c("box", "box_no_facet", "dot", "dot_no_facet", "facethist", "facetdensity", "denstrip", "blank")[2]
#, bins = 10 # for facethist
)
)
, lower = list(
continuous =
wrap(
c("points", "smooth", "smooth_loess", "density", "cor", "blank")[5]
#, se = FALSE
#, alpha = 1/2
#, size = 1
)
, discrete =
c("ratio", "facetbar", "blank")[2]
, combo =
wrap(
c("box", "box_no_facet", "dot", "dot_no_facet", "facethist", "facetdensity", "denstrip", "blank")[5]
, bins = 10 # for facethist
)
)
, progress = FALSE
, legend = 1 # create legend
)
p <- p + theme_bw()
p <- p + theme(legend.position = "bottom")
print(p)
```
# Analysis
## Initial model
This is done automatically based on your personalized analysis conditions.
```{r}
#| fig-width: 9
#| fig-height: 4
# Mean model
if (condition_2_init_model == "Mean") {
lm_fit_init <-
lm(
t0101 ~ 1
, data = dat_atus
)
}
# Main-effects model
if (condition_2_init_model == "Main effects") {
lm_fit_init <-
lm(
t0101 ~ TESEX + TEAGE + GTMETSTA + PEEDUCA_num + TRERNHLY + TEHRUSL_all + TRHHCHILD + TRTALONE + TRTHHFAMILY
, data = dat_atus
)
}
# Two-way interaction model
if (condition_2_init_model == "Two-way interaction") {
lm_fit_init <-
lm(
t0101 ~ (TESEX + TEAGE + GTMETSTA + PEEDUCA_num + TRERNHLY + TEHRUSL_all + TRHHCHILD + TRTALONE + TRTHHFAMILY)^2
, data = dat_atus
)
# If the two-way interaction model has NA coefficients,
# then there were probably pairs of categories that had no observations so could not be estimated.
# In this case, set the argument "singular.ok = TRUE" in the car::Anova() function below.
}
lm_fit_init
car::Anova(lm_fit_init, type = 3, singular.ok = FALSE)
# plot diagnostics
e_plot_lm_diagnostics(lm_fit_init)
```
## (3 p) Interpret diagnostics for initial model, resolve issues
## Model selection
This is done automatically based on your personalized analysis conditions.
```{r}
#| fig-width: 9
#| fig-height: 4
# criterion
if (condition_3_criterion == "AIC") {
AIC_k = 2
}
if (condition_3_criterion == "BIC") {
AIC_k = log(nrow(dat_atus))
}
## AIC/BIC stepwise selection
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm_fit_AIC <-
step(
lm_fit_init
, scope =
list(
upper = t0101 ~ (TESEX + TEAGE + GTMETSTA + PEEDUCA_num + TRERNHLY + TEHRUSL_all + TRHHCHILD + TRTALONE + TRTHHFAMILY)^2
, lower = t0101 ~ 1
)
, direction = "both"
, test = "F"
, trace = 1
, k = AIC_k # condition_3_criterion takes effect here
)
lm_fit_final <- lm_fit_AIC
car::Anova(lm_fit_final, type = 3)
# plot diagnostics
e_plot_lm_diagnostics(lm_fit_final)
summary(lm_fit_final)
lm_fit_criteria <-
e_lm_model_criteria(
lm_fit = lm_fit_final
, dat_fit = dat_atus
)
```
## (3 p) Interpret diagnostics for selected model, resolve issues
## Model effects/contrasts
```{r}
#| fig-width: 10
#| fig-height: 12
p_cont <-
e_plot_model_contrasts(
fit = lm_fit_final
, dat_cont = dat_atus
, choose_contrasts = NULL
, sw_table_in_plot = TRUE
, adjust_method = c("none", "tukey", "scheffe", "sidak", "bonferroni", "dunnettx", "mvt")[2]
, CI_level = 0.95
, sw_glm_scale = c("link", "response")[1]
, sw_print = FALSE
, sw_marginal_even_if_interaction = FALSE
, sw_TWI_plots_keep = c("singles", "both", "all")[1]
, sw_TWI_both_orientation = c("wide", "tall")[1]
, sw_plot_quantiles_values = c("quantiles", "values")[1]
, plot_quantiles = c(0.05, 0.25, 0.5, 0.75, 0.95)
, sw_quantile_type = 7
, plot_values = NULL
, emmip_rg.limit = 1000
)
# Since plot interactions have sublists of plots, we want to pull those out
# into a one-level plot list.
# The code here works for sw_TWI_plots_keep = "singles"
# which will make each plot the same size in the plot_grid() below.
# For a publications, you'll want to manually choose which plots to show.
# index for plot list,
# needed since interactions add 2 plots to the list, so the number of terms
# is not necessarily the same as the number of plots.
i_list <- 0
# initialize a list of plots
p_list <- list()
for (i_term in 1:length(p_cont$plots)) {
## i_term = 1
if ( length(p_cont$plots) == 0 ) {
print("Skip printing contrasts if intercept-only model")
next
}
# extract the name of the plot
n_list <- names(p_cont$plots)[i_term]
# test whether the name has a colon ":"; if so, it's an interaction
if (stringr::str_detect(string = n_list, pattern = stringr::fixed(":"))) {
# an two-way interaction has two plots
# first plot
i_list <- i_list + 1
p_list[[ i_list ]] <- p_cont$plots[[ i_term ]][[ 1 ]]
# second plot
i_list <- i_list + 1
p_list[[ i_list ]] <- p_cont$plots[[ i_term ]][[ 2 ]]
} else {
# not an interaction, only one plot
i_list <- i_list + 1
p_list[[ i_list ]] <- p_cont$plots[[ i_term ]]
} # if
# Every 4 plots, print them
if (i_list >= 4) {
p_arranged <-
cowplot::plot_grid(
plotlist = p_list
, nrow = NULL
, ncol = 2
, labels = "AUTO"
)
p_arranged %>% print()
i_list <- 0
next
}
# if last term, print the plots
if (i_term == length(p_cont$plots)) {
p_arranged <-
cowplot::plot_grid(
plotlist = p_list
, nrow = NULL
, ncol = 2
, labels = "AUTO"
)
p_arranged %>% print()
}
} # for
```
## (2 p) Interpret one main effect and one interaction
If you don't have an interaction, interpret a second main effect.
Choose the two you think are the most interesting (to you).
1. Main effect: [Variable name here]
* [Interpretation here]
2. Interaction: [Variable names here]
* [Interpretation here]
# (1 p) Summarize and share your decisions and results
We will compile all of the results in a google form so that we can see the
variability from random sampling, starting model, model selection criteria, and
transformations to the response and selected covariates.
Prepare these answers to enter into the __[Google Form](https://forms.gle/x6dZei2dM15e7mAU6)__.
I've automatically filled in the answers that I could (1, 2, 3, 9, and 10).
Please review what you did above to complete the rest, thank you.
1. Random number seed (number): __`r condition_1_seed %>% format(scientific = FALSE)`__
2. Starting model: __`r condition_2_init_model`__
* Mean
* Main effects
* Two-way interaction
3. Model selection criteria: __`r condition_3_criterion`__
* AIC
* BIC
4. Response variable transformation:
* none
* log
* sqrt (y^0.5)
* other power transformation
5. Any covariate transformations? (includes any change to a covariate, such as grouping factor levels)
* Yes
* No
* If "Yes" to covariate transformations, list each variable and its transformation on separate lines: "var_name, transformation".
* var_name1, transformation
* var_name2, transformation
* var_name3, transformation
* var_name4, transformation
6. How many outliers dropped?
* number:
7. Were you able to satisfy model assumptions?
* Yes
* No
8. If model assumptions were not met, which were violated, including other issues?
* Residuals not normal
* Non-random structure in a residual plot
* Non-constant variance in a residual plot
* Influential points (large Cook's D)
* Outliers
9. Final model criteria statistics:
* r2 = __`r lm_fit_criteria$r2`__
* aic (even if you bic for selection) = __`r lm_fit_criteria$aic`__
* p (number of model parameters) = __`r lm_fit_criteria$p`__
* df = __`r lm_fit_criteria$df`__
10. Terms in model:
* Initial model: __`r attr(lm_fit_init $terms, "term.labels")`__
* Final model: __`r attr(lm_fit_final$terms, "term.labels")`__