The California Child Health and Development Study involved women on the Kaiser Health plan who received prenatal care and later gave birth in the Kaiser clinics. Approximately 19,000 live-born children were delivered in the 20,500 pregnancies. We consider the subset of the 680 live-born white male infants in the study. Data were collected on a variety of features of the child, the mother, and the father.
The columns in the data set are, from left to right:
col var name description
1 id ID
2 cheadcir child's head circumference (inches)
3 clength child's length (inches), $y$ response
4 cbwt child's birth weight (pounds)
5 gest gestation (weeks)
6 mage maternal age (years)
7 msmoke maternal smoking (cigarettes/day)
8 mht maternal height (inches)
9 mppwt maternal pre-pregnancy weight (pounds)
10 page paternal age (years)
11 ped paternal education (years)
12 psmoke paternal smoking (cigarettes/day)
13 pht paternal height (inches)
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Leading 0s cause otherwise numeric columns to be class character.# Thus, we add the column format "col_double()" for those columns with# leading 0s that we wish to be numeric.dat_cchd <-read_csv("ADA2_CL_05_cchd-birthwt.csv" , col_types =cols(msmoke =col_double() , mppwt =col_double() , ped =col_double() , psmoke =col_double() ) ) %>%# only keep the variables we're analyzingselect( cbwt , mage, msmoke, mht, mppwt , page, psmoke, pht, ped )# %>%# slice(# -123 # -123 excludes observation (row number) 123# )str(dat_cchd)
A goal here is to build a multiple regression model to predict child’s birth weight (column 4, cbwt) from the data on the mother and father (columns 6–13). A reasonable strategy would be to:
Examine the relationship between birth weight and the potential predictors.
Decide whether any of the variables should be transformed.
Perform a backward elimination using the desired response and predictors.
Given the selected model, examine the residuals and check for influential cases.
Repeat the process, if necessary.
Interpret the model and discuss any model limitations.
(1 p) Looking at the data
Describe any patterns you see in the data. Are the ranges for each variable reasonable? Extreme/unusual observations? Strong nonlinear trends with the response suggesting a transformation?
summary(dat_cchd)
cbwt mage msmoke mht
Min. : 3.300 Min. :15.00 Min. : 0.000 Min. :57.00
1st Qu.: 6.800 1st Qu.:21.00 1st Qu.: 0.000 1st Qu.:63.00
Median : 7.600 Median :25.00 Median : 0.000 Median :64.00
Mean : 7.516 Mean :25.86 Mean : 7.431 Mean :64.43
3rd Qu.: 8.200 3rd Qu.:29.00 3rd Qu.:12.000 3rd Qu.:66.00
Max. :11.400 Max. :42.00 Max. :50.000 Max. :71.00
mppwt page psmoke pht ped
Min. : 85.0 Min. :18.0 Min. : 0.00 Min. :62.00 Min. : 6.00
1st Qu.:115.0 1st Qu.:24.0 1st Qu.: 0.00 1st Qu.:69.00 1st Qu.:12.00
Median :125.0 Median :28.0 Median :12.00 Median :71.00 Median :14.00
Mean :126.9 Mean :28.8 Mean :14.44 Mean :70.62 Mean :13.38
3rd Qu.:135.0 3rd Qu.:33.0 3rd Qu.:25.00 3rd Qu.:72.00 3rd Qu.:16.00
Max. :246.0 Max. :52.0 Max. :50.00 Max. :79.00 Max. :16.00
library(ggplot2)library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
#p <- ggpairs(dat_cchd)# put scatterplots on top so y axis is verticalp <-ggpairs( dat_cchd , upper =list(continuous =wrap("points", alpha =0.2, size =0.5)) , lower =list(continuous ="cor") )print(p)
Error in e_plot_lm_diagostics(lm_cchd_final, sw_plot_set = "simpleAV"): could not find function "e_plot_lm_diagostics"
Discuss the diagnostics in terms of influential observations or problematic structure in the residuals. In particular, if an observation is influential, describe how it is influential; does it change the slope, intercept, or both for the regression surface?
Solution
[answer]
(3 p) Address model fit
If the model doesn’t fit well (diagnostics tell you this, not \(R^2\) or significance tests), then address the lack of model fit. Transformations and removing influential points are two strategies. The decisions you make should be based on what you observed in the residual plots. If there’s an influential observation, remove it and see how that affects the backward selection (whether the same predictors are retained), the model fit (diagnostics), and regression coefficient estimates (betas). If there’s a pattern in the residuals that can be addressed by a transformation, guess at the appropriate transformation and try it.
Repeat until you are satisfied with the diagnostics meeting the model assumptions. Below, briefly outline what you did (no need to show all the output) by (1) identifying what you observed in the diagostics and (2) the strategy you took to address that issue. Finally, show the final model and the diagnostics for that. Describe how the final model is different from the original; in particular discuss whether variables retained are different from backward selection and whether the sign and magnitude of the regression coefficients are much different.
Solution
[answer]
(3 p) Interpret the final model
What proportion of variation in the response does the model explain over the mean of the response? (This quantity indicates how precisely this model will predict new observations.)
Finally, write the equation for the final model and interpret each model coefficient. Do these quantities make sense?
Solution
[answer]
(1 p) Inference to whom
To which population of people does this model make inference to? Does this generalize to all humans?
Sometimes this is call the “limitations” section. By carefully specifying what the population is that inference applies to, often that accounts for the limitations.