---
title: "ADA1: Class 25, Simple linear regression"
author: Your Name
date: last-modified
description: |
[Advanced Data Analysis 1](https://StatAcumen.com/teach/ada1),
Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM
format:
html:
theme: litera
highlight-style: atom-one
page-layout: full # article, full # https://quarto.org/docs/output-formats/page-layout.html
toc: true
toc-location: body # body, left, right
number-sections: false
self-contained: false # !!! this can cause a render error
code-overflow: scroll # scroll, wrap
code-block-bg: true
code-block-border-left: "#30B0E0"
code-copy: false # true, false, hover a copy buttom in top-right of code block
fig-width: 6
fig-height: 4
---
# Rubric
```{r load-packages, message=FALSE}
library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw for all plots
```
Answer the questions with the data example.
---
# Height vs Hand Span
In a previous year, this was the procedure for collecting data:
1. Record your height in inches. For example 5'0" is 60 inches.
2. Use a ruler to measure your hand span in centimeters:
the distance from the tip of your thumb to pinky finger
with your hand splayed as wide as possible.
3. Have someone from your table enter your measurements at this link (as well as your gender):
[link not provided here]
4. Analysis.
```{R}
# Height vs Hand Span
dat_hand <-
read_csv("ADA1_CL_09_Data-CorrHandSpan.csv") |>
na.omit() |>
mutate(
Gender_M_F = factor(Gender_M_F, levels = c("M", "F"))
, Year = Year |> factor()
)
str(dat_hand)
```
Plot data for `Height_in` vs `HandSpan_cm` for Females and Males.
```{R}
#| fig-width: 6
#| fig-height: 4
library(ggplot2)
p <- ggplot(dat_hand, aes(x = HandSpan_cm, y = Height_in))
p <- p + theme_bw()
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.5)
# separate for Females and Males
p <- p + facet_wrap(~ Gender_M_F, nrow = 1)
print(p)
```
__Choose either Females or Males for the remaining analysis.__
Uncomment one of the `Gender_M_F ==` lines below to choose Females or Males.
```{R}
# choose one:
dat_use <-
dat_hand |>
filter(
# Gender_M_F == "F"
# Gender_M_F == "M"
)
```
__Plan:__
1. Center the explanatory variable `HandSpan_cm`,
2. fit a simple linear regression model,
3. check model assumptions,
4. interpret the parameter estimate table, and
5. interpret a confidence and prediction interval.
## Center the explanatory variable `HandSpan_cm`
Recentering the $x$-variable doesn't change the model,
but it does provide an interpretation for the intercept of the model.
For example, if you interpret the intercept for the regression lines above,
it's the "expected height for a person with a hand span of zero",
but that's not meaningful.
Choose a value to center your data on.
A good choice is a nice round number near the mean (or center) of your data.
This becomes the value for the interpretation of your intercept.
```{R}
val_center <- 20
dat_use <-
dat_use |>
mutate(
HandSpan_cm_centered = HandSpan_cm - val_center
)
```
## Fit a simple linear regression model
```{R}
# fit model
lm_fit <-
lm(
Height_in ~ HandSpan_cm_centered
, data = dat_use
)
```
Here's the data you're using for the linear regression,
with the regression line and confidence and prediction intervals.
```{R}
#| fig-width: 6
#| fig-height: 5
library(ggplot2)
p <- ggplot(dat_use, aes(x = HandSpan_cm_centered, y = Height_in))
p <- p + theme_bw()
p <- p + geom_vline(xintercept = 0, alpha = 0.25)
# prediction bands
p <- p + geom_ribbon(aes(ymin = predict(lm_fit, data.frame(HandSpan_cm_centered)
, interval = "prediction", level = 0.95)[, 2],
ymax = predict(lm_fit, data.frame(HandSpan_cm_centered)
, interval = "prediction", level = 0.95)[, 3],)
, alpha=0.1, fill="darkgreen")
# linear regression fit and confidence bands
p <- p + geom_smooth(method = lm, se = TRUE)
# jitter a little to uncover duplicate points
p <- p + geom_jitter(position = position_jitter(.1), alpha = 0.5)
p <- p + labs(
title = "Regression with confidence and prediction bands"
, caption = paste0("Handspan centered at ", val_center, " cm.")
)
print(p)
```
## Check model assumptions
Present and interpret the residual plots with respect to model assumptions.
```{R}
#| fig-width: 9
#| fig-height: 3
e_plot_lm_diagnostics(
lm_fit
#, rc_mfrow = c(1, 2)
, sw_plot_set = "simple"
)
```
### (1 p) Assess normality
If the normality assumption seems to be violated, perform a normality test on the standardized residuals.
### (1 p) Assess residuals
Do the residuals versus the fitted values and `HandSpan_cm_centered` values appear random?
Or is there a pattern?
## Investigate the relative influence of points
Investigate the leverages and Cook's Distance.
There are recommendations for what's considered large,
for example, a $3p/n$ cutoff for large leverages,
and a cutoff of 1 for large Cook's D values.
I find it more practical to consider the relative
leverage or Cook's D between all the points and
worry when there are only a few that are much more influential than others.
Here's a plot that duplicates a plot above.
Here, the observation number is used as both the plotting point and a label.
```{R}
#| fig-width: 6
#| fig-height: 5
# plot diagnistics
par(mfrow=c(1,2))
plot(influence(lm_fit)$hat, main="Leverages", type = "n")
text(1:nrow(dat_use), influence(lm_fit)$hat, label=paste(1:nrow(dat_use)))
# horizontal line at zero
abline(h = 3 * 2 / nrow(dat_use), col = "gray75")
plot(cooks.distance(lm_fit), main="Cook's Distances", type = "n")
text(1:nrow(dat_use), cooks.distance(lm_fit), label=paste(1:nrow(dat_use)))
# horizontal line at zero
abline(h = qchisq(0.1, 2) / 2, col = "gray75")
```
### (1 p) Interpret the leverages and Cook's D values
Are any observations are having undue influence on model fit?
## Interpret the parameter estimate table
Here's the parameter estimate table.
We're estimating the $\beta$ parameter coefficients in the regression model
$y_i = \beta_0 + \beta_1 x_i + e_i$.
```{R}
#| fig-width: 4
#| fig-height: 3
summary(lm_fit)
e_plot_model_contrasts(lm_fit, dat_cont = dat_use)
```
### (1 p) Write model equation
Assuming the model fits well, complete this equation
(fill in the $\beta$ values with values from the table)
with the appropriate numbers from the table above
(3 numbers: each beta and the HandSpan centering value).
The regression line is
$\hat{\textrm{Height}} = \hat{\beta}_0 + \hat{\beta}_1 (\textrm{HandSpan} - `r val_center`)$.
### (2 p) Hypothesis test of regression line slope
State the hypothesis test related to the slope of the line,
indicate the p-value for the test,
and state the conclusion.
Words and notation:
* Words:
* Notation: $H_0:\beta_? = ?$ vs $H_A:\beta_? \ne ?$
### (1 p) Interpret slope
Interpret the slope coefficient in the context of the model.
### (0.5 p) Interpret intercept
Interpret the intercept in the context of the model.
### (0.5 p) Interpret $R^2$
State and interpret the $R^2$ value.
## Interpret a confidence and prediction interval
Below is a 95% confidence interval (CI) for the mean (the regression line)
and a prediction interval (PI) for a new observation
at $\textrm{HandSpan centered} = -1$.
See how these match up with the plot above.
```{R}
predict(lm_fit, data.frame(HandSpan_cm_centered = -1)
, interval = "confidence", level = 0.95)
predict(lm_fit, data.frame(HandSpan_cm_centered = -1)
, interval = "prediction", level = 0.95)
```
### (1 p) Interpret the CI
### (1 p) Interpret the PI