---
title: "ADA1: Class 08, Introduction to linear regression"
author: Your Name
date: last-modified
description: |
[Advanced Data Analysis 1](https://StatAcumen.com/teach/ada1),
Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM
format:
html:
theme: litera
highlight-style: atom-one
page-layout: full # article, full # https://quarto.org/docs/output-formats/page-layout.html
toc: true
toc-location: body # body, left, right
number-sections: false
self-contained: false # !!! this can cause a render error
code-overflow: scroll # scroll, wrap
code-block-bg: true
code-block-border-left: "#30B0E0"
code-copy: false # true, false, hover a copy buttom in top-right of code block
fig-width: 6
fig-height: 4
---
# Rubric
The context of this assignment comes from
[OpenIntro Labs](http://openintrostat.github.io/oilabs-tidy/) for R and tidyverse:
* [8. Simple linear regression](http://openintrostat.github.io/oilabs-tidy/08_simple_regression/simple_regression.html)
_This is a template for the assignment. Modify this and turn it in._
Some questions are answered by the code you've written. In those cases, in your answer write "see code".
```{r load-packages, message=FALSE}
library(erikmisc)
library(tidyverse)
library(openintro)
#library(statsr)
library(broom)
```
The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe.
It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances.
The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
In this lab, you'll be analysing data from the Human Freedom Index reports.
Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.
# Getting Started
The data we're working with is in the `openintro` package and it's called `hfi`, short for Human Freedom Index.
## (1 p) 1. Data basics
1. *What are the dimensions of the dataset?*
2. *What does each row represent?*
* Write the text of your answer here...
```{r}
# Insert code here
data(hfi, package = "openintro")
attr(hfi, "spec") <- NULL # remove variable class specification attribute
dim(hfi)
hfi
```
## (1 p) 2. Data subset `hfi_2016`
1. *The dataset spans a lot of years, but we are only interested in data from year 2016.*
2. *Filter the data `hfi` data frame for year 2016, select the listed variables, and assign the result to a data frame named `hfi_2016`.*
* `year`
* `ISO_code`
* `countries`
* `region`
* `pf_expression_control`
* `pf_score`
* `hf_score`
* Write the text of your answer here...
```{r}
# Insert code here
hfi_2016 <-
hfi
# more code here...
```
## (1 p) 3. Model 1, Linear relationship?
1. *What type of plot would you use to display the relationship between the personal freedom score, `pf_score`, and `pf_expression_control`?*
2. *Plot this relationship using the variable `pf_expression_control` as the predictor. Does the relationship look linear?*
3. *If you knew a country's `pf_expression_control`, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?*
4. *If the relationship looks linear, quantify the strength of the relationship with the correlation coefficient.*
* Write the text of your answer here...
```{r}
# Insert code here
```
# Sum of squared residuals
## (1 p) 4. Model 1, Residuals
1. *Looking at your plot from the previous exercise, describe the relationship between these two variables.*
2. *Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.*
* Write the text of your answer here...
```{r}
# Insert code here
```
## (0 p) 5. Model 1, Least squares line, building intuition (practice, only)
1. *Using `plot_ss`, choose a line that does a good job of minimizing the sum of squares.*
2. *Run the function several times. What was the smallest sum of squares that you got?*
3. *How does it compare to your neighbours?*
```{r, eval=FALSE}
statsr::plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)
```
# The linear model
## (2 p) 6. Model 2, Linear relationship
1. *Plot the relationship between x=`pf_expression_control` and y=`hf_score`, or the total human freedom score.*
2. *Fit a new model for this relationship.*
3. *Using the estimates from the R output, write the equation of the regression line.*
4. *What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?*
* Write the text of your answer here...
*(I recommend that you follow the example from the website.
I've included two example equations for you to select from and complete by replacing "beta0" and "beta1" with numbers.)*
$$
\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control}
$$
$$
\widehat{\textrm{hf\_score}} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control}
$$
```{r}
# Insert code here
```
# Prediction and prediction errors
## (1 p) 7. Model 1, Prediction
1. *If someone saw the least squares regression line and not the actual data, how would they predict a country's personal freedom score (`pf_score`) for one with a 3 rating for `pf_expression_control`?*
2. *Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?*
* Write the text of your answer here...
```{r}
# Insert code here
m1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)
#summary(m1)
tidy(m1)
#glance(m1)
```
# Model diagnostics
To assess whether the linear model is reliable, we need to check for
(1) linearity,
(2) nearly normal residuals, and
(3) constant variability.
## (1 p) 8. Model 1, Residual plots, linearity
1. *Is there any apparent pattern in the residuals plot? *
2. *What does this indicate about the linearity of the relationship between the two variables?*
::: {.callout-note}
The e_plot_lm_diagnostics() function with the "simple" set of plots gives
6 plots to review. We will focus on 3 of these for now.
1. **QQ-plot for Normality** (points stay within band)
2. Cook's Distance for influential points (ignore for now)
3. Cook's Distance by leverage for explaining influential points (ignore for now)
4. **Residuals vs predicted values**
5. **Residuals vs each x variable**
6. Box-Cox transformation for y-variable transformation if residuals weren't normal (ignore for now)
:::
* Write the text of your answer here...
```{r}
#| fig-width: 8
#| fig-height: 3
e_plot_lm_diagnostics(m1, sw_plot_set = "simple")
```
## (1 p) 9. Model 1, Residual and histogram plots, normality
1. *Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?*
* Write the text of your answer here...
```{r}
hist(residuals(m1), 30)
```
## (1 p) 10. Model 1, Constant variability
1. *Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?*
* Write the text of your answer here...
![Creative Commons License](https://i.creativecommons.org/l/by-sa/4.0/88x31.png){style="border-width:0"}

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.