ADA1: Class 08, Introduction to linear regression

Advanced Data Analysis 1, Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM


Your Name


September 10, 2023


The context of this assignment comes from OpenIntro Labs for R and tidyverse:

This is a template for the assignment. Modify this and turn it in.

Some questions are answered by the code you’ve written. In those cases, in your answer write “see code”.


The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it’s political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.

In this lab, you’ll be analysing data from the Human Freedom Index reports. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.

Getting Started

The data we’re working with is in the openintro package and it’s called hfi, short for Human Freedom Index.

(1 p) 1. Data basics

  1. What are the dimensions of the dataset?
  2. What does each row represent?
  • Write the text of your answer here…
# Insert code here
data(hfi, package = "openintro")
attr(hfi, "spec") <- NULL  # remove variable class specification attribute

[1] 1458  123
# A tibble: 1,458 × 123
    year ISO_code countries  region               pf_rol_procedural pf_rol_civil
   <dbl> <chr>    <chr>      <chr>                            <dbl>        <dbl>
 1  2016 ALB      Albania    Eastern Europe                    6.66         4.55
 2  2016 DZA      Algeria    Middle East & North…             NA           NA   
 3  2016 AGO      Angola     Sub-Saharan Africa               NA           NA   
 4  2016 ARG      Argentina  Latin America & the…              7.10         5.79
 5  2016 ARM      Armenia    Caucasus & Central …             NA           NA   
 6  2016 AUS      Australia  Oceania                           8.44         7.53
 7  2016 AUT      Austria    Western Europe                    8.97         7.87
 8  2016 AZE      Azerbaijan Caucasus & Central …             NA           NA   
 9  2016 BHS      Bahamas    Latin America & the…              6.93         6.01
10  2016 BHR      Bahrain    Middle East & North…             NA           NA   
# ℹ 1,448 more rows
# ℹ 117 more variables: pf_rol_criminal <dbl>, pf_rol <dbl>,
#   pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
#   pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
#   pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
#   pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
#   pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …

(1 p) 2. Data subset hfi_2016

  1. The dataset spans a lot of years, but we are only interested in data from year 2016.
  2. Filter the data hfi data frame for year 2016, select the listed variables, and assign the result to a data frame named hfi_2016.
  • year

  • ISO_code

  • countries

  • region

  • pf_expression_control

  • pf_score

  • hf_score

  • Write the text of your answer here…

# Insert code here

hfi_2016 <-
  # more code here...

(1 p) 3. Model 1, Linear relationship?

  1. What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control?
  2. Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear?
  3. If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?
  4. If the relationship looks linear, quantify the strength of the relationship with the correlation coefficient.
  • Write the text of your answer here…
# Insert code here

Sum of squared residuals

(1 p) 4. Model 1, Residuals

  1. Looking at your plot from the previous exercise, describe the relationship between these two variables.
  2. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
  • Write the text of your answer here…
# Insert code here

(0 p) 5. Model 1, Least squares line, building intuition (practice, only)

  1. Using plot_ss, choose a line that does a good job of minimizing the sum of squares.
  2. Run the function several times. What was the smallest sum of squares that you got?
  3. How does it compare to your neighbours?
statsr::plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)

The linear model

(2 p) 6. Model 2, Linear relationship

  1. Plot the relationship between x=pf_expression_control and y=hf_score, or the total human freedom score.
  2. Fit a new model for this relationship.
  3. Using the estimates from the R output, write the equation of the regression line.
  4. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?
  • Write the text of your answer here…

(I recommend that you follow the example from the website. I’ve included two example equations for you to select from and complete by replacing “beta0” and “beta1” with numbers.)

\[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]

\[ \widehat{\textrm{hf\_score}} = \hat{\beta}_{0} + \hat{\beta}_{1} \times \textrm{pf\_expression\_control} \]

# Insert code here

Prediction and prediction errors

(1 p) 7. Model 1, Prediction

  1. If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom score (pf_score) for one with a 3 rating for pf_expression_control?
  2. Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
  • Write the text of your answer here…
# Insert code here

m1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)
# A tibble: 2 × 5
  term                  estimate std.error statistic   p.value
  <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)              4.62     0.0575      80.4 0        
2 pf_expression_control    0.491    0.0101      48.8 8.19e-303

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

(1 p) 8. Model 1, Residual plots, linearity

  1. Is there any apparent pattern in the residuals plot?
  2. What does this indicate about the linearity of the relationship between the two variables?

The e_plot_lm_diagnostics() function with the “simple” set of plots gives 6 plots to review. We will focus on 3 of these for now.

  1. QQ-plot for Normality (points stay within band)
  2. Cook’s Distance for influential points (ignore for now)
  3. Cook’s Distance by leverage for explaining influential points (ignore for now)
  4. Residuals vs predicted values
  5. Residuals vs each x variable
  6. Box-Cox transformation for y-variable transformation if residuals weren’t normal (ignore for now)
  • Write the text of your answer here…
e_plot_lm_diagnostics(m1, sw_plot_set = "simple")

(1 p) 9. Model 1, Residual and histogram plots, normality

  1. Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?
  • Write the text of your answer here…
hist(residuals(m1), 30)

(1 p) 10. Model 1, Constant variability

  1. Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?
  • Write the text of your answer here…

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.