```
library(erikmisc)
library(tidyverse)
library(openintro)
#library(statsr)
library(broom)
```

# ADA1: Class 08, Introduction to linear regression

Advanced Data Analysis 1, Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM

# Rubric

The context of this assignment comes from OpenIntro Labs for R and tidyverse:

*This is a template for the assignment. Modify this and turn it in.*

Some questions are answered by the code you’ve written. In those cases, in your answer write “see code”.

The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it’s political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.

In this lab, you’ll be analysing data from the Human Freedom Index reports. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.

# Getting Started

The data we’re working with is in the `openintro`

package and it’s called `hfi`

, short for Human Freedom Index.

## (1 p) 1. Data basics

*What are the dimensions of the dataset?**What does each row represent?*

- Write the text of your answer here…

```
# Insert code here
data(hfi, package = "openintro")
attr(hfi, "spec") <- NULL # remove variable class specification attribute
dim(hfi)
```

`[1] 1458 123`

` hfi`

```
# A tibble: 1,458 × 123
year ISO_code countries region pf_rol_procedural pf_rol_civil
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2016 ALB Albania Eastern Europe 6.66 4.55
2 2016 DZA Algeria Middle East & North… NA NA
3 2016 AGO Angola Sub-Saharan Africa NA NA
4 2016 ARG Argentina Latin America & the… 7.10 5.79
5 2016 ARM Armenia Caucasus & Central … NA NA
6 2016 AUS Australia Oceania 8.44 7.53
7 2016 AUT Austria Western Europe 8.97 7.87
8 2016 AZE Azerbaijan Caucasus & Central … NA NA
9 2016 BHS Bahamas Latin America & the… 6.93 6.01
10 2016 BHR Bahrain Middle East & North… NA NA
# ℹ 1,448 more rows
# ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
# pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
# pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
# pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
# pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
# pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …
```

## (1 p) 2. Data subset `hfi_2016`

*The dataset spans a lot of years, but we are only interested in data from year 2016.**Filter the data*`hfi`

data frame for year 2016, select the listed variables, and assign the result to a data frame named`hfi_2016`

.

`year`

`ISO_code`

`countries`

`region`

`pf_expression_control`

`pf_score`

`hf_score`

Write the text of your answer here…

```
# Insert code here
<-
hfi_2016
hfi# more code here...
```

## (1 p) 3. Model 1, Linear relationship?

*What type of plot would you use to display the relationship between the personal freedom score,*`pf_score`

, and`pf_expression_control`

?*Plot this relationship using the variable*`pf_expression_control`

as the predictor. Does the relationship look linear?*If you knew a country’s*`pf_expression_control`

, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?*If the relationship looks linear, quantify the strength of the relationship with the correlation coefficient.*

- Write the text of your answer here…

`# Insert code here`

# Sum of squared residuals

## (1 p) 4. Model 1, Residuals

*Looking at your plot from the previous exercise, describe the relationship between these two variables.**Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.*

- Write the text of your answer here…

`# Insert code here`

## (0 p) 5. Model 1, Least squares line, building intuition (practice, only)

*Using*`plot_ss`

, choose a line that does a good job of minimizing the sum of squares.*Run the function several times. What was the smallest sum of squares that you got?**How does it compare to your neighbours?*

`::plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE) statsr`

# The linear model

## (2 p) 6. Model 2, Linear relationship

*Plot the relationship between x=*`pf_expression_control`

and y=`hf_score`

, or the total human freedom score.*Fit a new model for this relationship.**Using the estimates from the R output, write the equation of the regression line.**What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?*

- Write the text of your answer here…

*(I recommend that you follow the example from the website. I’ve included two example equations for you to select from and complete by replacing “beta0” and “beta1” with numbers.)*

\[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]

\[ \widehat{\textrm{hf\_score}} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]

`# Insert code here`

# Prediction and prediction errors

## (1 p) 7. Model 1, Prediction

*If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom score (*`pf_score`

) for one with a 3 rating for`pf_expression_control`

?*Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?*

- Write the text of your answer here…

```
# Insert code here
<- lm(pf_score ~ pf_expression_control, data = hfi_2016)
m1 #summary(m1)
tidy(m1)
```

```
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.62 0.0575 80.4 0
2 pf_expression_control 0.491 0.0101 48.8 8.19e-303
```

`#glance(m1)`

# Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

## (1 p) 8. Model 1, Residual plots, linearity

*Is there any apparent pattern in the residuals plot?**What does this indicate about the linearity of the relationship between the two variables?*

The e_plot_lm_diagnostics() function with the “simple” set of plots gives 6 plots to review. We will focus on 3 of these for now.

**QQ-plot for Normality**(points stay within band)- Cook’s Distance for influential points (ignore for now)
- Cook’s Distance by leverage for explaining influential points (ignore for now)
**Residuals vs predicted values****Residuals vs each x variable**- Box-Cox transformation for y-variable transformation if residuals weren’t normal (ignore for now)

- Write the text of your answer here…

`e_plot_lm_diagnostics(m1, sw_plot_set = "simple")`

## (1 p) 9. Model 1, Residual and histogram plots, normality

*Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?*

- Write the text of your answer here…

`hist(residuals(m1), 30)`

## (1 p) 10. Model 1, Constant variability

*Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?*

- Write the text of your answer here…

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.