---
title: "ADA1: Ch 09 Correlation and Regression, Diagnostics examples"
subtitle: "Video supplement"
author: Erik Erhardt
date: last-modified
description: |
[Advanced Data Analysis 1](https://StatAcumen.com/teach/ada1),
Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM
format:
html:
theme: litera
highlight-style: atom-one
page-layout: full # article, full # https://quarto.org/docs/output-formats/page-layout.html
toc: true
toc-location: body # body, left, right
number-sections: false
self-contained: false # !!! this can cause a render error
code-overflow: scroll # scroll, wrap
code-block-bg: true
code-block-border-left: "#30B0E0"
code-copy: false # true, false, hover a copy buttom in top-right of code block
fig-width: 6
fig-height: 4
---
From .
# Correlation and Regression {#ch:ADA1_08_CorrRegression}
```{r}
library(erikmisc)
library(tidyverse)
```
### Residual and Diagnostic Analysis of the Blood Loss Data
We looked at much of this
before, but let us go through the above steps systematically.
Recall the data set (we want to predict blood loss from weight):
```{r ex08-10}
#### Residual and Diagnostic Analysis of the Blood Loss Data
dat_thyroid <- read.table(text="
weight time blood_loss
44.3 105 503
40.6 80 490
69.0 86 471
43.7 112 505
50.3 109 482
50.2 100 490
35.4 96 513
52.2 120 464
", header=TRUE) |>
mutate(
# create data ids
id = 1:n()
)
# show the structure of the data.frame
str(dat_thyroid)
# display the data.frame
dat_thyroid
```
1. *Plot the data.*
Plot blood loss vs. weight.
```{r res_bloodloss_weight_lm_ggplot, include=TRUE, fig.width=6, fig.height=6}
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(dat_thyroid, aes(x = weight, y = blood_loss, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm)
print(p)
```
Clearly the heaviest individual is an unusual value that warrants
a closer look (maybe data recording error).
I might be inclined to try a transformation here (such as `log(weight)`) to
make that point a little less influential.
2. *Do any obvious transformations of the data.*
We will look at transformations later.
3. *Fit the least squares equation.*
Blood Loss appears significantly negatively associated with weight.
```{r ex08-12}
lm_blood_wt <- lm(blood_loss ~ weight, data = dat_thyroid)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm_blood_wt)
```
1. *Graphs: Check Standardized Residuals (or the Deleted Residuals).*
The residual plots:
```{r res_bloodloss_diag, include=TRUE, fig.width=8.5, fig.height=3}
e_plot_lm_diagnostics(lm_blood_wt, sw_plot_set = "simple")
```
4. *Examine the residual plots and results.*
1. *Do you see curvature?*
There does not appear to be curvature (and it could be hard to detect with so few points).
2. *Does it appear $\sigma_{Y|X}$ depends upon X?*
Not much evidence for this.
3. *Do you see obvious outliers?*
Observation 3 is an outlier in the $x$ direction, and therefore possibly a
high leverage point and influential on the model fit.
4. *Is the normality assumption reasonable?*
There appears to be some skewness, but with so few points normality may be reasonable.
5. *Is there a striking pattern in residuals vs. order of the data?*
No striking pattern.
5. *Check the Cook's $D$ values.*
We anticipated that the $3^{\textrm{rd}}$ observation is affecting the fit by a lot more than any other values.
The $D$-value is much larger than 1.
Note that the residual is not large for this value.
6. *Omit problem observations from the analysis and see if any conclusions change substantially.*
Let us refit the equation without observation 3 to see if
anything changes drastically. I will use the weighted least
squares approach discussed earlier on this example. Define a
variable `wt` that is 1 for all observations except obs. 3,
and make it 0 for that one.
```{r ex08-13}
# wt = 1 for all except obs 3 where wt = 0
dat_thyroid <-
dat_thyroid |>
mutate(
wt = ifelse(id == 3, 0, 1)
)
dat_thyroid$wt
```
What changes by deleting case 3? The fitted line gets steeper
(slope changes from $-1.30$ to $-2.19$), adjusted $R^2$ gets larger
(up to 58% from 53%), and $S$ changes from 11.7 to 10.6.
Because the Weight values are much less spread out,
$SE(\hat{\beta_1})$ becomes quite a bit larger (to 0.714, up from 0.436)
and we lose a degree of freedom for MS Error (which will
penalize us on tests and CIs). Just about any quantitative
statement we would want to make using CIs would be about the same
either way since CIs will overlap a great deal, and our
qualitative interpretations are unchanged (Blood Loss drops with
Weight). Unless something shows up in the plots, I don't see any
very important changes here.
```{r ex08-14}
lm_blood_wt_no3 <- lm(blood_loss ~ weight, data = dat_thyroid, weights = wt)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm_blood_wt_no3)
```
```{r res_no3_bloodloss_weight_lm_ggplot, include=TRUE, fig.width=6, fig.height=6}
# exclude obs 3
dat_thyroid_no3 <-
dat_thyroid |>
filter(
wt == 1
)
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(dat_thyroid, aes(x = weight, y = blood_loss, label = id))
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
p <- p + geom_smooth(method = lm, color = "blue", se = FALSE, linetype = "solid")
p <- p + geom_smooth(data = dat_thyroid_no3, method = lm, color = "red", se = FALSE, linetype = "dashed", fullrange = TRUE)
p <- p + labs(caption = "Blue solid is full data, Red dashed is without obs 3.")
print(p)
```
Nothing very striking shows up in the residual plots, and no
Cook's $D$ values are very large among the remaining observations.
```{r res_no3_bloodloss_diag, include=TRUE, fig.width=8.5, fig.height=3}
e_plot_lm_diagnostics(lm_blood_wt_no3, sw_plot_set = "simple")
```
How much difference is there in a practical sense? Examine the
95% prediction interval for a new observation at Weight = 50kg.
Previously we saw that interval based on all 8 observations was
from 457.1 to 517.8 ml of Blood Loss. Based on just the 7
observations the prediction interval is 451.6 to 512.4 ml. There
really is no practical difference here.
```{r ex08-15}
# CI for the mean and PI for a new observation at weight=50
predict(lm_blood_wt , data.frame(weight=50), interval = "prediction")
predict(lm_blood_wt_no3, data.frame(weight=50), interval = "prediction")
```
Therefore, while obs. 3 was potentially influential,
whether the value is included or not
makes very little difference in the model fit or relationship between Weight and BloodLoss.
### Gesell data
These data are from a UCLA study of cyanotic heart disease in
children. The predictor is the age of the child in months at first
word and the response variable is the Gesell adaptive score, for
each of 21 children.
```{r ex08-16, echo=FALSE, size='tiny'}
#### Gesell data
dat_gesell <- read.table(text="
id age score
1 15 95
2 26 71
3 10 83
4 9 91
5 15 102
6 20 87
7 18 93
8 11 100
9 8 104
10 20 94
11 7 113
12 9 96
13 10 83
14 11 84
15 11 102
16 10 100
17 12 105
18 42 57
19 17 121
20 11 86
21 10 100
", header=TRUE)
# show the structure of the data.frame
str(dat_gesell)
# display the data.frame
dat_gesell
```
Let us go through the same steps as before.
1. Plot Score versus Age. Comment on the relationship between
Score and Age.
```{r gesell_score_age_lm_ggplot, include=TRUE, fig.width=8, fig.height=6}
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(dat_gesell, aes(x = age, y = score, label = id))
p <- p + theme_bw()
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
#p <- p + geom_smooth(method = lm)
p <- p + geom_smooth(method = lm, color = "blue", se = FALSE, linetype = "solid")
p <- p + geom_smooth(data = dat_gesell |> filter(!(id == 18)), method = lm, color = "red", se = FALSE, linetype = "dashed", fullrange = TRUE)
p <- p + geom_smooth(data = dat_gesell |> filter(!(id == 19)), method = lm, color = "purple", se = FALSE, linetype = "dotted", fullrange = TRUE)
p <- p + geom_smooth(data = dat_gesell |> filter(!(id %in% c(18, 19))), method = lm, color = "orange", se = FALSE, linetype = "solid", fullrange = TRUE)
p <- p + labs(caption = "Blue solid is full data, Red dashed is without obs 18, Green dotted is without obs 19.")
print(p)
```
2. There are no obvious transformations to try here.
3. Fit a simple linear regression model. Provide an equation
for the LS line. Does age at first word appear to be an "important
predictor" of Gesell adaptive score? (i.e., is the estimated slope
significantly different from zero?)
```{r ex08-18}
lm_score_age <- lm(score ~ age, data = dat_gesell)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm_score_age)
```
4. Do these plots suggest any inadequacies with the model?
```{r gesell_score_age_diag, include=TRUE, fig.width=8.5, fig.height=3}
e_plot_lm_diagnostics(lm_score_age, sw_plot_set = "simple")
```
#### What if restrict to age <= 20?
```{r gesell_score_age_lm_ggplot2, include=TRUE, fig.width=8, fig.height=6}
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p <- ggplot(dat_gesell |> filter(age <= 20), aes(x = age, y = score, label = id))
p <- p + theme_bw()
p <- p + geom_point()
# plot labels next to points
p <- p + geom_text(hjust = 0.5, vjust = -0.5)
# plot regression line and confidence band
#p <- p + geom_smooth(method = lm)
p <- p + geom_smooth(method = lm, color = "blue", se = FALSE, linetype = "solid")
print(p)
```