---
title: "ADA2: Class 05, Ch 03 A Taste of Model Selection for Multiple Regression"
author: anonymous
date: "2/2/2016"
output:
html_document:
toc: true
---
```{R, echo=FALSE}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
knitr::opts_chunk$set(cache = TRUE, autodep=TRUE) #$
```
# CCHD birth weight
The California Child Health and Development Study involved women on the Kaiser
Health plan who received prenatal care and later gave birth in the Kaiser
clinics. Approximately 19,000 live-born children were delivered in the 20,500
pregnancies. We consider the subset of the 680 live-born white male infants in the study.
Data were collected on a variety of features of the child, the mother, and the
father.
The columns in the data set are, from left to right:
```
col var name description
1 id ID
2 cheadcir child's head circumference (inches)
3 clength child's length (inches), $y$ response
4 cbwt child's birth weight (pounds)
5 gest gestation (weeks)
6 mage maternal age (years)
7 msmoke maternal smoking (cigarettes/day)
8 mht maternal height (inches)
9 mppwt maternal pre-pregnancy weight (pounds)
10 page paternal age (years)
11 ped paternal education (years)
12 psmoke paternal smoking (cigarettes/day)
13 pht paternal height (inches)
```
```{R}
fn.data <- "http://statacumen.com/teach/ADA2/worksheet/ADA2_WS_05_cchd-birthwt.txt"
cchd <- read.table(fn.data, header=TRUE)
# only keep the variables we're analyzing
cchd <- subset(cchd, select=c("cbwt", "mage", "msmoke", "mht", "mppwt"
, "page", "ped", "psmoke", "pht")
)
str(cchd)
head(cchd)
```
# Rubric
A goal here is to build a multiple regression model to predict child's
birth weight (column 4, `cbwt`) from the data on the mother and father (columns 6--13).
A reasonable strategy would be to:
1. Examine the relationship between birth weight and the potential predictors.
2. Decide whether any of the variables should be transformed.
3. Perform a backward elimination using the desired response and predictors.
4. Given the selected model, examine the residuals and check for influential cases.
5. Repeat the process, if necessary.
6. Interpret the model and discuss any model limitations.
## __(1 p)__ Looking at the data
_Describe any patterns you see in the data.
Are the ranges for each variable reasonable?
Extreme/unusual observations?
Strong nonlinear trends with the response suggesting a transformation?_
```{R}
summary(cchd)
```
```{R, fig.height = 8, fig.width = 8}
library(ggplot2)
library(GGally)
#p <- ggpairs(cchd)
# put scatterplots on top so y axis is vertical
p <- ggpairs(cchd, upper = list(continuous = wrap("points", alpha = 0.2, size = 0.5))
, lower = list(continuous = "cor")
)
print(p)
```
```{R}
# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
rcorr(as.matrix(cchd))
```
### Solution
The minimum and maximum values in the numerical summaries seem reasonable for each variable.
The scatterplots reveals one outlier for maternal pre-pregnancy weight (`mppwt`).
The scatterplots and correlation matrix suggest `msmoke`, `mht`, `mppwt`, and `pht` as weak linear
predictors of $y = $ `cbwt`.
There is no strong suggestion for transformations of $x$s to be linear with $y = $ `cbwt`.
Note that relationships may change with multiple predictors together.
## __(2 p)__ Backward selection, diagnostics of reduced model
Below I fit the linear model with all the selected main effects.
```{R}
# fit full model
lm.cchd.full <- lm(cbwt ~ mage + msmoke + mht + mppwt
+ page + ped + psmoke + pht
, data = cchd)
library(car)
#Anova(aov(lm.cchd.full), type=3)
summary(lm.cchd.full)
```
Model selection starts here.
```{R}
## AIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.cchd.red.AIC <- step(lm.cchd.full, direction="backward", test="F")
lm.cchd.final <- lm.cchd.red.AIC
summary(lm.cchd.final)
# BIC (not shown)
# step(lm.cchd.full, direction="backward", test="F", k=log(nrow(cchd)))
```
Backward selection results in a model with `msmoke`, `mht`, `mppwt`, and `pht`
all significant at a 0.05 level.
__Diagnostics__
```{R, fig.height = 8, fig.width = 10, echo=FALSE}
# plot diagnistics
par(mfrow=c(3,3))
plot(lm.cchd.final, which = c(1,4,6))
plot(cchd$msmoke, lm.cchd.final$residuals, main="Residuals vs msmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd$mht, lm.cchd.final$residuals, main="Residuals vs mht")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd$mppwt, lm.cchd.final$residuals, main="Residuals vs mppwt")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd$pht, lm.cchd.final$residuals, main="Residuals vs pht")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.cchd.final$residuals, las = 1, id.n = 3, main="QQ Plot")
# residuals vs order of data
plot(lm.cchd.final$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
```
```{R, fig.height = 8, fig.width = 8, echo=FALSE}
library(car)
avPlots(lm.cchd.final, id.n=5)
```
__Discuss the diagnostics in terms of influential observations or problematic structure in the residuals.__
In particular, if an observation is influential, describe _how_ it is influential;
does it change the slope, intercept, or both for the regression surface?
### Solution
Observation 506 has larger influence than other observations on the model fit,
seen below by the Cook's D.
Obs 506 is near the center of the data, but extreme for `cbwt`,
thus it has more influence on the intercept of the model
(by shifting the entire regression surface down)
and does not greatly change the slopes.
Other diagnostics do not show any issues.
## __(3 p)__ Address model fit
If the model doesn't fit well (diagnostics tell you this, not $R^2$ or significance tests),
then address the lack of model fit.
Transformations and removing influential points are two strategies.
The decisions you make should be based on what you observed in the residual plots.
If there's an influential observation, remove it and see how that affects
the backward selection (whether the same predictors are retained),
the model fit (diagnostics),
and regression coefficient estimates (betas).
If there's a pattern in the residuals that can be addressed by a transformation,
guess at the appropriate transformation and try it.
Repeat until you are satisfied with the diagnostics meeting the model assumptions.
Below, briefly outline what you did (no need to show all the output)
by (1) identifying what you observed in the diagostics
and (2) the strategy you took to address that issue.
Finally, show the final model and the diagnostics for that.
Describe how the final model is different from the original;
in particular discuss whether variables retained are different from backward selection
and whether the sign and magnitude of the regression coefficients are much different.
### Solution
Because of the large influence,
I will __remove observation 506__ from the data and redo backward selection.
Backward selection again results in a model with `msmoke`, `mht`, `mppwt`, and `pht`
all significant at a 0.05 level
conditional on all other variables in the model.
The model coefficients are largely unchanged (same sign and similar magnitude for each coefficient).
```{R}
# remove obs 506
cchd506 <- cchd[-506,]
# fit full model
lm.cchd506.full <- lm(cbwt ~ mage + msmoke + mht + mppwt
+ page + ped + psmoke + pht
, data = cchd506)
#library(car)
#Anova(aov(lm.cchd506.full), type=3)
#summary(lm.cchd506.full)
## AIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.cchd506.red.AIC <- step(lm.cchd506.full, direction="backward", test="F", trace = 0)
lm.cchd506.final <- lm.cchd506.red.AIC
summary(lm.cchd506.final)
```
The plots indicate that
the residuals do not show any violation of constant variance or normality,
and Cook's distances do not suggest any overly influential observations.
```{R, fig.height = 8, fig.width = 10, echo=FALSE}
# plot diagnistics
par(mfrow=c(3,3))
plot(lm.cchd506.final, which = c(1,4,6))
plot(cchd506$msmoke, lm.cchd506.final$residuals, main="Residuals vs msmoke")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd506$mht, lm.cchd506.final$residuals, main="Residuals vs mht")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd506$mppwt, lm.cchd506.final$residuals, main="Residuals vs mppwt")
# horizontal line at zero
abline(h = 0, col = "gray75")
plot(cchd506$pht, lm.cchd506.final$residuals, main="Residuals vs pht")
# horizontal line at zero
abline(h = 0, col = "gray75")
# Normality of Residuals
library(car)
qqPlot(lm.cchd506.final$residuals, las = 1, id.n = 3, main="QQ Plot")
# residuals vs order of data
plot(lm.cchd506.final$residuals, main="Residuals vs Order of data")
# horizontal line at zero
abline(h = 0, col = "gray75")
```
The partial regression residual plots show weak linear relationships for each
variable conditional on the others.
Variable `mppwt` shows a potential high-leverage point (167, high pre-pregnancy weight)
that might influence this
predictor, but given the Cook's distance plot above, I won't worry too much
about this point at this time
(though it could be worth redoing the analysis without this point to see whether the selected model changes).
```{R, fig.height = 8, fig.width = 8, echo=FALSE}
library(car)
avPlots(lm.cchd506.final, id.n=5)
summary(lm.cchd506.final)
```
## __(3 p)__ Interpret the final model
What proportion of variation in the response does the model explain over the mean of the response?
(This quantity indicates how precisely this model will predict new observations.)
Finally, write the equation for the final model and interpret each model coefficient.
Do these quantities make sense?
### Solution
The model explains $R^2=`r signif(summary(lm.cchd.final)$r.squared, 3)*100`%
of the variation in the response.
This is not a lot, but birth weight is more complicated that these predictors can capture.
The final model is
$$
\textrm{cbwt}
=
`r signif(summary(lm.cchd.final)$coefficients[1,1], 2)`
+ `r signif(summary(lm.cchd.final)$coefficients[2,1], 2)` \textrm{ msmoke}
+ `r signif(summary(lm.cchd.final)$coefficients[3,1], 2)` \textrm{ mht}
+ `r signif(summary(lm.cchd.final)$coefficients[4,1], 2)` \textrm{ mppwt}
+ `r signif(summary(lm.cchd.final)$coefficients[5,1], 2)` \textrm{ pht}
.
$$
The model predicts that, with all other predictors held constant,
that a child's birth weight (pounds) is expected to
* increase `r signif(summary(lm.cchd.final)$coefficients[2,1], 2)` pounds for each additional cigarette the mother smokes each day,
* increase `r signif(summary(lm.cchd.final)$coefficients[3,1], 2)` pounds for each inch taller the mother is,
* increase `r signif(summary(lm.cchd.final)$coefficients[4,1], 2)` pounds for each pound heavier the mother is pre-pregnancy, and
* increase `r signif(summary(lm.cchd.final)$coefficients[5,1], 2)` pounds for each inch taller the father is.
So maternal cigarette use decreases birth weight and parental physical size increases birth weight.
This makes sense to me.
## __(1 p)__ Inference to whom
To which population of people does this model make inference to?
Does this generalize to all humans?
Sometimes this is call the "limitations" section.
By carefully specifying what the population is that inference applies to,
often that accounts for the limitations.
### Solution
The model applies to live-born white male babies to women under the Kaiser
Health plan who recieve prenatal care and later give birth in a Kaiser clinic
in California within a reasonable time range of the children studied in this
dataset.
This does not generalize to all humans, though we would expect similarities.