1 CCHD birth weight

The California Child Health and Development Study involved women on the Kaiser Health plan who received prenatal care and later gave birth in the Kaiser clinics. Approximately 19,000 live-born children were delivered in the 20,500 pregnancies. We consider the subset of the 680 live-born white male infants in the study. Data were collected on a variety of features of the child, the mother, and the father.

The columns in the data set are, from left to right:

col   var name   description
  1   id         ID
  2   cheadcir   child's head circumference (inches)
  3   clength    child's length (inches), $y$ response
  4   cbwt       child's birth weight (pounds)
  5   gest       gestation (weeks)
  6   mage       maternal age (years)
  7   msmoke     maternal smoking (cigarettes/day)
  8   mht        maternal height (inches)
  9   mppwt      maternal pre-pregnancy weight (pounds)
 10   page       paternal age (years)
 11   ped        paternal education (years)
 12   psmoke     paternal smoking (cigarettes/day)
 13   pht        paternal height (inches)
fn.data <- "http://statacumen.com/teach/ADA2/worksheet/ADA2_WS_05_cchd-birthwt.txt"
cchd <- read.table(fn.data, header=TRUE)
# only keep the variables we're analyzing
cchd <- subset(cchd, select=c("cbwt", "mage", "msmoke", "mht", "mppwt"
                                    , "page", "ped", "psmoke", "pht")
              )
str(cchd)
'data.frame':   680 obs. of  9 variables:
 $ cbwt  : num  7.3 8 7.5 7 5.3 8.6 9.1 6.5 3.3 8.1 ...
 $ mage  : int  33 28 32 27 32 30 23 27 32 28 ...
 $ msmoke: int  25 0 0 2 17 0 0 17 0 0 ...
 $ mht   : int  66 63 61 68 67 63 65 64 64 66 ...
 $ mppwt : int  140 130 126 150 112 131 134 125 142 113 ...
 $ page  : int  37 35 38 30 28 34 26 29 32 41 ...
 $ ped   : int  12 10 12 16 10 12 12 12 14 16 ...
 $ psmoke: int  25 7 17 7 17 17 0 7 0 0 ...
 $ pht   : int  74 71 65 73 71 66 71 71 66 68 ...
head(cchd)
  cbwt mage msmoke mht mppwt page ped psmoke pht
1  7.3   33     25  66   140   37  12     25  74
2  8.0   28      0  63   130   35  10      7  71
3  7.5   32      0  61   126   38  12     17  65
4  7.0   27      2  68   150   30  16      7  73
5  5.3   32     17  67   112   28  10     17  71
6  8.6   30      0  63   131   34  12     17  66

2 Rubric

A goal here is to build a multiple regression model to predict child’s birth weight (column 4, cbwt) from the data on the mother and father (columns 6–13). A reasonable strategy would be to:

  1. Examine the relationship between birth weight and the potential predictors.
  2. Decide whether any of the variables should be transformed.
  3. Perform a backward elimination using the desired response and predictors.
  4. Given the selected model, examine the residuals and check for influential cases.
  5. Repeat the process, if necessary.
  6. Interpret the model and discuss any model limitations.

2.1 (1 p) Looking at the data

Describe any patterns you see in the data. Are the ranges for each variable reasonable? Extreme/unusual observations? Strong nonlinear trends with the response suggesting a transformation?

summary(cchd)
      cbwt             mage           msmoke            mht       
 Min.   : 3.300   Min.   :15.00   Min.   : 0.000   Min.   :57.00  
 1st Qu.: 6.800   1st Qu.:21.00   1st Qu.: 0.000   1st Qu.:63.00  
 Median : 7.600   Median :25.00   Median : 0.000   Median :64.00  
 Mean   : 7.516   Mean   :25.86   Mean   : 7.431   Mean   :64.43  
 3rd Qu.: 8.200   3rd Qu.:29.00   3rd Qu.:12.000   3rd Qu.:66.00  
 Max.   :11.400   Max.   :42.00   Max.   :50.000   Max.   :71.00  
     mppwt            page           ped            psmoke     
 Min.   : 85.0   Min.   :18.0   Min.   : 6.00   Min.   : 0.00  
 1st Qu.:115.0   1st Qu.:24.0   1st Qu.:12.00   1st Qu.: 0.00  
 Median :125.0   Median :28.0   Median :14.00   Median :12.00  
 Mean   :126.9   Mean   :28.8   Mean   :13.38   Mean   :14.44  
 3rd Qu.:135.0   3rd Qu.:33.0   3rd Qu.:16.00   3rd Qu.:25.00  
 Max.   :246.0   Max.   :52.0   Max.   :16.00   Max.   :50.00  
      pht       
 Min.   :62.00  
 1st Qu.:69.00  
 Median :71.00  
 Mean   :70.62  
 3rd Qu.:72.00  
 Max.   :79.00  
library(ggplot2)
library(GGally)
#p <- ggpairs(cchd)
# put scatterplots on top so y axis is vertical
p <- ggpairs(cchd, upper = list(continuous = wrap("points", alpha = 0.2, size = 0.5))
                 , lower = list(continuous = "cor")
            )
print(p)

# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
rcorr(as.matrix(cchd))
        cbwt  mage msmoke   mht mppwt  page   ped psmoke   pht
cbwt    1.00  0.00  -0.18  0.20  0.22  0.02  0.03  -0.02  0.15
mage    0.00  1.00   0.05  0.02  0.12  0.82  0.24   0.02 -0.07
msmoke -0.18  0.05   1.00  0.03 -0.03  0.03  0.02   0.26  0.01
mht     0.20  0.02   0.03  1.00  0.49  0.02  0.11  -0.01  0.30
mppwt   0.22  0.12  -0.03  0.49  1.00  0.12  0.00  -0.03  0.17
page    0.02  0.82   0.03  0.02  0.12  1.00  0.22   0.04 -0.13
ped     0.03  0.24   0.02  0.11  0.00  0.22  1.00  -0.18  0.11
psmoke -0.02  0.02   0.26 -0.01 -0.03  0.04 -0.18   1.00  0.01
pht     0.15 -0.07   0.01  0.30  0.17 -0.13  0.11   0.01  1.00

n= 680 


P
       cbwt   mage   msmoke mht    mppwt  page   ped    psmoke pht   
cbwt          0.9729 0.0000 0.0000 0.0000 0.6685 0.3899 0.5430 0.0000
mage   0.9729        0.2412 0.6490 0.0025 0.0000 0.0000 0.6653 0.0639
msmoke 0.0000 0.2412        0.4996 0.5024 0.4707 0.5370 0.0000 0.7791
mht    0.0000 0.6490 0.4996        0.0000 0.6396 0.0048 0.7019 0.0000
mppwt  0.0000 0.0025 0.5024 0.0000        0.0012 0.9736 0.4745 0.0000
page   0.6685 0.0000 0.4707 0.6396 0.0012        0.0000 0.3015 0.0004
ped    0.3899 0.0000 0.5370 0.0048 0.9736 0.0000        0.0000 0.0049
psmoke 0.5430 0.6653 0.0000 0.7019 0.4745 0.3015 0.0000        0.7224
pht    0.0000 0.0639 0.7791 0.0000 0.0000 0.0004 0.0049 0.7224       

2.1.1 Solution

[answer]

2.2 (2 p) Backward selection, diagnostics of reduced model

Below I fit the linear model with all the selected main effects.

# fit full model
lm.cchd.full <- lm(cbwt ~ mage + msmoke + mht + mppwt
                        + page + ped + psmoke + pht
                      , data = cchd)

library(car)
#Anova(aov(lm.cchd.full), type=3)
summary(lm.cchd.full)

Call:
lm(formula = cbwt ~ mage + msmoke + mht + mppwt + page + ped + 
    psmoke + pht, data = cchd)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2194 -0.7005  0.0236  0.6527  3.7613 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.510508   1.374821   0.371 0.710511    
mage        -0.009105   0.012791  -0.712 0.476840    
msmoke      -0.018180   0.003687  -4.931 1.03e-06 ***
mht          0.044131   0.019280   2.289 0.022389 *  
mppwt        0.009221   0.002613   3.529 0.000445 ***
page         0.008121   0.011481   0.707 0.479591    
ped          0.011172   0.019446   0.575 0.565820    
psmoke       0.002546   0.002993   0.851 0.395230    
pht          0.041670   0.016233   2.567 0.010473 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 671 degrees of freedom
Multiple R-squared:  0.1036,    Adjusted R-squared:  0.09291 
F-statistic: 9.693 on 8 and 671 DF,  p-value: 8.952e-13

Model selection starts here.

## AIC
# option: test="F" includes additional information
#           for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.cchd.red.AIC <- step(lm.cchd.full, direction="backward", test="F")
Start:  AIC=62.76
cbwt ~ mage + msmoke + mht + mppwt + page + ped + psmoke + pht

         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
- ped     1    0.3572 726.62 61.093  0.3301 0.5658197    
- page    1    0.5416 726.81 61.265  0.5004 0.4795906    
- mage    1    0.5484 726.81 61.272  0.5067 0.4768396    
- psmoke  1    0.7833 727.05 61.491  0.7237 0.3952296    
<none>                726.26 62.758                      
- mht     1    5.6711 731.94 66.048  5.2396 0.0223886 *  
- pht     1    7.1324 733.40 67.404  6.5896 0.0104731 *  
- mppwt   1   13.4808 739.74 73.265 12.4550 0.0004454 ***
- msmoke  1   26.3137 752.58 84.960 24.3114 1.034e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=61.09
cbwt ~ mage + msmoke + mht + mppwt + page + psmoke + pht

         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
- mage    1    0.4705 727.09 59.533  0.4351 0.5097278    
- psmoke  1    0.6029 727.22 59.657  0.5576 0.4555018    
- page    1    0.6150 727.24 59.668  0.5688 0.4510024    
<none>                726.62 61.093                      
- mht     1    6.0455 732.67 64.727  5.5911 0.0183357 *  
- pht     1    7.6382 734.26 66.204  7.0641 0.0080516 ** 
- mppwt   1   13.1673 739.79 71.305 12.1775 0.0005153 ***
- msmoke  1   26.0343 752.66 83.030 24.0772 1.162e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=59.53
cbwt ~ msmoke + mht + mppwt + page + psmoke + pht

         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
- page    1    0.1513 727.24 57.674  0.1401  0.708332    
- psmoke  1    0.6475 727.74 58.138  0.5994  0.439092    
<none>                727.09 59.533                      
- mht     1    6.1466 733.24 63.257  5.6893  0.017344 *  
- pht     1    7.4130 734.50 64.431  6.8615  0.009006 ** 
- mppwt   1   13.0524 740.14 69.632 12.0813  0.000542 ***
- msmoke  1   26.4382 753.53 81.820 24.4713 9.538e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=57.67
cbwt ~ msmoke + mht + mppwt + psmoke + pht

         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
- psmoke  1    0.6734 727.92 56.304  0.6241 0.4298016    
<none>                727.24 57.674                      
- mht     1    6.1270 733.37 61.379  5.6784 0.0174507 *  
- pht     1    7.2623 734.51 62.431  6.7306 0.0096834 ** 
- mppwt   1   13.7100 740.95 68.374 12.7062 0.0003902 ***
- msmoke  1   26.3602 753.60 79.886 24.4303 9.733e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=56.3
cbwt ~ msmoke + mht + mppwt + pht

         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                727.92 56.304                      
- mht     1    6.0566 733.97 59.938  5.6163 0.0180745 *  
- pht     1    7.3497 735.27 61.135  6.8154 0.0092379 ** 
- mppwt   1   13.6367 741.55 66.925 12.6454 0.0004028 ***
- msmoke  1   25.9791 753.90 78.150 24.0905 1.154e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.cchd.final <- lm.cchd.red.AIC
summary(lm.cchd.final)

Call:
lm(formula = cbwt ~ msmoke + mht + mppwt + pht, data = cchd)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2726 -0.6965  0.0145  0.6661  3.8167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.643876   1.331579   0.484 0.628867    
msmoke      -0.017376   0.003540  -4.908 1.15e-06 ***
mht          0.045320   0.019124   2.370 0.018074 *  
mppwt        0.009129   0.002567   3.556 0.000403 ***
pht          0.041392   0.015855   2.611 0.009238 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.038 on 675 degrees of freedom
Multiple R-squared:  0.1016,    Adjusted R-squared:  0.09623 
F-statistic: 19.07 on 4 and 675 DF,  p-value: 7.095e-15
# BIC (not shown)
# step(lm.cchd.full, direction="backward", test="F", k=log(nrow(cchd)))

Backward selection results in a model with msmoke, mht, mppwt, and pht all significant at a 0.05 level.

Diagnostics

  9 170 506 
  1 680   2