---
title: "ADA2: Class 05, Ch 03 A Taste of Model Selection for Multiple Regression"
author: "Name Here"
date: "mm/dd/yyyy"
output:
pdf_document:
number_sections: yes
toc: yes
html_document:
toc: true
number_sections: true
code_folding: show
---
```{R, echo=FALSE}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
knitr::opts_chunk$set(cache = TRUE, autodep=TRUE) #$
```
# CCHD birth weight
The California Child Health and Development Study involved women on the Kaiser
Health plan who received prenatal care and later gave birth in the Kaiser
clinics. Approximately 19,000 live-born children were delivered in the 20,500
pregnancies. We consider the subset of the 680 live-born white male infants in the study.
Data were collected on a variety of features of the child, the mother, and the
father.
The columns in the data set are, from left to right:
```
col var name description
1 id ID
2 cheadcir child's head circumference (inches)
3 clength child's length (inches), $y$ response
4 cbwt child's birth weight (pounds)
5 gest gestation (weeks)
6 mage maternal age (years)
7 msmoke maternal smoking (cigarettes/day)
8 mht maternal height (inches)
9 mppwt maternal pre-pregnancy weight (pounds)
10 page paternal age (years)
11 ped paternal education (years)
12 psmoke paternal smoking (cigarettes/day)
13 pht paternal height (inches)
```
```{R}
fn.data <- "http://statacumen.com/teach/ADA2/worksheet/ADA2_WS_05_cchd-birthwt.txt"
cchd <- read.table(fn.data, header=TRUE)
# only keep the variables we're analyzing
cchd <- subset(cchd, select=c("cbwt", "mage", "msmoke", "mht", "mppwt"
, "page", "ped", "psmoke", "pht")
)
str(cchd)
head(cchd)
```
# Rubric
A goal here is to build a multiple regression model to predict child's
birth weight (column 4, `cbwt`) from the data on the mother and father (columns 6--13).
A reasonable strategy would be to:
1. Examine the relationship between birth weight and the potential predictors.
2. Decide whether any of the variables should be transformed.
3. Perform a backward elimination using the desired response and predictors.
4. Given the selected model, examine the residuals and check for influential cases.
5. Repeat the process, if necessary.
6. Interpret the model and discuss any model limitations.
## __(1 p)__ Looking at the data
_Describe any patterns you see in the data.
Are the ranges for each variable reasonable?
Extreme/unusual observations?
Strong nonlinear trends with the response suggesting a transformation?_
```{R}
summary(cchd)
```
```{R, fig.height = 8, fig.width = 8}
library(ggplot2)
library(GGally)
#p <- ggpairs(cchd)
# put scatterplots on top so y axis is vertical
p <- ggpairs(cchd, upper = list(continuous = wrap("points", alpha = 0.2, size = 0.5))
, lower = list(continuous = "cor")
)
print(p)
```
```{R}
# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
rcorr(as.matrix(cchd))
```
### Solution
[answer]
## __(2 p)__ Backward selection, diagnostics of reduced model
Below I fit the linear model with all the selected main effects.
```{R}
# fit full model
lm.cchd.full <- lm(cbwt ~ mage + msmoke + mht + mppwt
+ page + ped + psmoke + pht
, data = cchd)
library(car)
#Anova(aov(lm.cchd.full), type=3)
summary(lm.cchd.full)
```
Model selection starts here.
```{R}
## AIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.cchd.red.AIC <- step(lm.cchd.full, direction="backward", test="F")
lm.cchd.final <- lm.cchd.red.AIC
summary(lm.cchd.final)
# BIC (not shown)
# step(lm.cchd.full, direction="backward", test="F", k=log(nrow(cchd)))
```
Backward selection results in a model with `msmoke`, `mht`, `mppwt`, and `pht`
all significant at a 0.05 level.
__Diagnostics__
```{R, fig.height = 8, fig.width = 10, echo=FALSE}
# plot diagnistics
par(mfrow=c(3,3))
plot(lm.cchd.final, which = c(1,4,6))
plot(cchd$msmoke, lm.cchd.final$residuals, main="Residuals vs msmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd$mht, lm.cchd.final$residuals, main="Residuals vs mht")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd$mppwt, lm.cchd.final$residuals, main="Residuals vs mppwt")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd$pht, lm.cchd.final$residuals, main="Residuals vs pht")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.cchd.final$residuals, las = 1, id.n = 3, main="QQ Plot")
# residuals vs order of data
plot(lm.cchd.final$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
```
```{R, fig.height = 8, fig.width = 8, echo=FALSE}
library(car)
avPlots(lm.cchd.final, id.n=5)
```
__Discuss the diagnostics in terms of influential observations or problematic structure in the residuals.__
In particular, if an observation is influential, describe _how_ it is influential;
does it change the slope, intercept, or both for the regression surface?
### Solution
[answer]
## __(3 p)__ Address model fit
If the model doesn't fit well (diagnostics tell you this, not $R^2$ or significance tests),
then address the lack of model fit.
Transformations and removing influential points are two strategies.
The decisions you make should be based on what you observed in the residual plots.
If there's an influential observation, remove it and see how that affects
the backward selection (whether the same predictors are retained),
the model fit (diagnostics),
and regression coefficient estimates (betas).
If there's a pattern in the residuals that can be addressed by a transformation,
guess at the appropriate transformation and try it.
Repeat until you are satisfied with the diagnostics meeting the model assumptions.
Below, briefly outline what you did (no need to show all the output)
by (1) identifying what you observed in the diagostics
and (2) the strategy you took to address that issue.
Finally, show the final model and the diagnostics for that.
Describe how the final model is different from the original;
in particular discuss whether variables retained are different from backward selection
and whether the sign and magnitude of the regression coefficients are much different.
### Solution
[answer]
## __(3 p)__ Interpret the final model
What proportion of variation in the response does the model explain over the mean of the response?
(This quantity indicates how precisely this model will predict new observations.)
Finally, write the equation for the final model and interpret each model coefficient.
Do these quantities make sense?
### Solution
[answer]
## __(1 p)__ Inference to whom
To which population of people does this model make inference to?
Does this generalize to all humans?
Sometimes this is call the "limitations" section.
By carefully specifying what the population is that inference applies to,
often that accounts for the limitations.
### Solution
[answer]