1 CCHD birth weight

The California Child Health and Development Study involved women on the Kaiser Health plan who received prenatal care and later gave birth in the Kaiser clinics. Approximately 19,000 live-born children were delivered in the 20,500 pregnancies. We consider the subset of the 680 live-born white male infants in the study. Data were collected on a variety of features of the child, the mother, and the father.

The columns in the data set are, from left to right:

col   var name   description
  1   id         ID
  2   cheadcir   child's head circumference (inches)
  3   clength    child's length (inches), $y$ response
  4   cbwt       child's birth weight (pounds)
  5   gest       gestation (weeks)
  6   mage       maternal age (years)
  7   msmoke     maternal smoking (cigarettes/day)
  8   mht        maternal height (inches)
  9   mppwt      maternal pre-pregnancy weight (pounds)
 10   page       paternal age (years)
 11   ped        paternal education (years)
 12   psmoke     paternal smoking (cigarettes/day)
 13   pht        paternal height (inches)
library(tidyverse)

# load ada functions
source("ada_functions.R")

# Leading 0s cause otherwise numeric columns to be class character.
# Thus, we add the column format "col_double()" for those columns with
#   leading 0s that we wish to be numeric.

dat_cchd <-
  read_csv(
    "ADA2_WS_05_cchd-birthwt.csv"
  , col_types =
    cols(
      msmoke  = col_double()
    , mppwt   = col_double()
    , ped     = col_double()
    , psmoke  = col_double()
    )
  ) %>%
# only keep the variables we're analyzing
  select(
    cbwt
  , mage, msmoke, mht, mppwt
  , page, psmoke, pht, ped
  )
  #   %>%
  # slice(
  #   -123  #  -123 excludes observation (row number) 123
  # )
str(dat_cchd)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    680 obs. of  9 variables:
 $ cbwt  : num  7.3 8 7.5 7 5.3 8.6 9.1 6.5 3.3 8.1 ...
 $ mage  : num  33 28 32 27 32 30 23 27 32 28 ...
 $ msmoke: num  25 0 0 2 17 0 0 17 0 0 ...
 $ mht   : num  66 63 61 68 67 63 65 64 64 66 ...
 $ mppwt : num  140 130 126 150 112 131 134 125 142 113 ...
 $ page  : num  37 35 38 30 28 34 26 29 32 41 ...
 $ psmoke: num  25 7 17 7 17 17 0 7 0 0 ...
 $ pht   : num  74 71 65 73 71 66 71 71 66 68 ...
 $ ped   : num  12 10 12 16 10 12 12 12 14 16 ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_character(),
  ..   cheadcir = col_double(),
  ..   clength = col_double(),
  ..   cbwt = col_double(),
  ..   gest = col_double(),
  ..   mage = col_double(),
  ..   msmoke = col_double(),
  ..   mht = col_double(),
  ..   mppwt = col_double(),
  ..   page = col_double(),
  ..   ped = col_double(),
  ..   psmoke = col_double(),
  ..   pht = col_double()
  .. )
head(dat_cchd)
# A tibble: 6 x 9
   cbwt  mage msmoke   mht mppwt  page psmoke   pht   ped
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
1   7.3    33     25    66   140    37     25    74    12
2   8      28      0    63   130    35      7    71    10
3   7.5    32      0    61   126    38     17    65    12
4   7      27      2    68   150    30      7    73    16
5   5.3    32     17    67   112    28     17    71    10
6   8.6    30      0    63   131    34     17    66    12

2 Rubric

A goal here is to build a multiple regression model to predict child’s birth weight (column 4, cbwt) from the data on the mother and father (columns 6–13). A reasonable strategy would be to:

  1. Examine the relationship between birth weight and the potential predictors.
  2. Decide whether any of the variables should be transformed.
  3. Perform a backward elimination using the desired response and predictors.
  4. Given the selected model, examine the residuals and check for influential cases.
  5. Repeat the process, if necessary.
  6. Interpret the model and discuss any model limitations.

2.1 (1 p) Looking at the data

Describe any patterns you see in the data. Are the ranges for each variable reasonable? Extreme/unusual observations? Strong nonlinear trends with the response suggesting a transformation?

summary(dat_cchd)
      cbwt             mage           msmoke            mht       
 Min.   : 3.300   Min.   :15.00   Min.   : 0.000   Min.   :57.00  
 1st Qu.: 6.800   1st Qu.:21.00   1st Qu.: 0.000   1st Qu.:63.00  
 Median : 7.600   Median :25.00   Median : 0.000   Median :64.00  
 Mean   : 7.516   Mean   :25.86   Mean   : 7.431   Mean   :64.43  
 3rd Qu.: 8.200   3rd Qu.:29.00   3rd Qu.:12.000   3rd Qu.:66.00  
 Max.   :11.400   Max.   :42.00   Max.   :50.000   Max.   :71.00  
     mppwt            page          psmoke           pht       
 Min.   : 85.0   Min.   :18.0   Min.   : 0.00   Min.   :62.00  
 1st Qu.:115.0   1st Qu.:24.0   1st Qu.: 0.00   1st Qu.:69.00  
 Median :125.0   Median :28.0   Median :12.00   Median :71.00  
 Mean   :126.9   Mean   :28.8   Mean   :14.44   Mean   :70.62  
 3rd Qu.:135.0   3rd Qu.:33.0   3rd Qu.:25.00   3rd Qu.:72.00  
 Max.   :246.0   Max.   :52.0   Max.   :50.00   Max.   :79.00  
      ped       
 Min.   : 6.00  
 1st Qu.:12.00  
 Median :14.00  
 Mean   :13.38  
 3rd Qu.:16.00  
 Max.   :16.00  
library(ggplot2)
library(GGally)
#p <- ggpairs(dat_cchd)
# put scatterplots on top so y axis is vertical
p <-
  ggpairs(
    dat_cchd
  , upper = list(continuous = wrap("points", alpha = 0.2, size = 0.5))
  , lower = list(continuous = "cor")
  )
print(p)

# correlation matrix and associated p-values testing "H0: rho == 0"
library(Hmisc)
rcorr(as.matrix(dat_cchd))
        cbwt  mage msmoke   mht mppwt  page psmoke   pht   ped
cbwt    1.00  0.00  -0.18  0.20  0.22  0.02  -0.02  0.15  0.03
mage    0.00  1.00   0.05  0.02  0.12  0.82   0.02 -0.07  0.24
msmoke -0.18  0.05   1.00  0.03 -0.03  0.03   0.26  0.01  0.02
mht     0.20  0.02   0.03  1.00  0.49  0.02  -0.01  0.30  0.11
mppwt   0.22  0.12  -0.03  0.49  1.00  0.12  -0.03  0.17  0.00
page    0.02  0.82   0.03  0.02  0.12  1.00   0.04 -0.13  0.22
psmoke -0.02  0.02   0.26 -0.01 -0.03  0.04   1.00  0.01 -0.18
pht     0.15 -0.07   0.01  0.30  0.17 -0.13   0.01  1.00  0.11
ped     0.03  0.24   0.02  0.11  0.00  0.22  -0.18  0.11  1.00

n= 680 


P
       cbwt   mage   msmoke mht    mppwt  page   psmoke pht    ped   
cbwt          0.9729 0.0000 0.0000 0.0000 0.6685 0.5430 0.0000 0.3899
mage   0.9729        0.2412 0.6490 0.0025 0.0000 0.6653 0.0639 0.0000
msmoke 0.0000 0.2412        0.4996 0.5024 0.4707 0.0000 0.7791 0.5370
mht    0.0000 0.6490 0.4996        0.0000 0.6396 0.7019 0.0000 0.0048
mppwt  0.0000 0.0025 0.5024 0.0000        0.0012 0.4745 0.0000 0.9736
page   0.6685 0.0000 0.4707 0.6396 0.0012        0.3015 0.0004 0.0000
psmoke 0.5430 0.6653 0.0000 0.7019 0.4745 0.3015        0.7224 0.0000
pht    0.0000 0.0639 0.7791 0.0000 0.0000 0.0004 0.7224        0.0049
ped    0.3899 0.0000 0.5370 0.0048 0.9736 0.0000 0.0000 0.0049       

2.1.1 Solution

[answer]

2.2 (2 p) Backward selection, diagnostics of reduced model

Below I fit the linear model with all the selected main effects.

# fit full model
lm_cchd_full <- lm(cbwt ~ mage + msmoke + mht + mppwt
                        + page + ped + psmoke + pht
                      , data = dat_cchd)

library(car)
#Anova(aov(lm_cchd_full), type=3)
summary(lm_cchd_full)

Call:
lm(formula = cbwt ~ mage + msmoke + mht + mppwt + page + ped + 
    psmoke + pht, data = dat_cchd)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2194 -0.7005  0.0236  0.6527  3.7613 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.510508   1.374821   0.371 0.710511    
mage        -0.009105   0.012791  -0.712 0.476840    
msmoke      -0.018180   0.003687  -4.931 1.03e-06 ***
mht          0.044131   0.019280   2.289 0.022389 *  
mppwt        0.009221   0.002613   3.529 0.000445 ***
page         0.008121   0.011481   0.707 0.479591    
ped          0.011172   0.019446   0.575 0.565820    
psmoke       0.002546   0.002993   0.851 0.395230    
pht          0.041670   0.016233   2.567 0.010473 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.04 on 671 degrees of freedom
Multiple R-squared:  0.1036,    Adjusted R-squared:  0.09291 
F-statistic: 9.693 on 8 and 671 DF,  p-value: 8.952e-13