ADA1: Class 27, Multiple regression, introduction

Advanced Data Analysis 1, Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM

Author

Your Name

Published

October 21, 2023

Rubric

# install.packages("mosaicData")  # install package for data

library(erikmisc)
library(tidyverse)
#library(openintro)
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots

Answer the questions with the data example.


Northampton Rail Trail System

RailTrail. The Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station. In the sections below, describe the relationship between variables and develop a model for predicting volume given hightemp and cloudcover.

A data frame with 90 observations on the following variables.

  • volume estimated number of trail users that day (number of breaks recorded)
  • hightemp daily high temperature (in degrees Fahrenheit)
  • cloudcover measure of cloud cover (in oktas)
dat_rail <-
  mosaicData::RailTrail |>
  as_tibble() |>
  select(
    volume
  , hightemp
  , cloudcover
  ) |>
  mutate(
  #  volume = volume    # Any of the variables can be transformed here
  )

str(dat_rail)
tibble [90 × 3] (S3: tbl_df/tbl/data.frame)
 $ volume    : int [1:90] 501 419 397 385 200 375 417 629 533 547 ...
 $ hightemp  : int [1:90] 83 73 74 95 44 69 66 66 80 79 ...
 $ cloudcover: num [1:90] 7.6 6.3 7.5 2.6 10 ...
summary(dat_rail)
     volume         hightemp       cloudcover    
 Min.   :129.0   Min.   :41.00   Min.   : 0.000  
 1st Qu.:291.5   1st Qu.:59.25   1st Qu.: 3.650  
 Median :373.0   Median :69.50   Median : 6.400  
 Mean   :375.4   Mean   :68.83   Mean   : 5.807  
 3rd Qu.:451.2   3rd Qu.:77.75   3rd Qu.: 8.475  
 Max.   :736.0   Max.   :97.00   Max.   :10.000  

(1 p) Scatterplot matrix

In a scatterplot matrix below interpret the relationship between each pair of variables. If a transformation is suggested by the plot (that is, because there is a curved relationship), also plot the data on the transformed scale and perform the analysis on the transformed scale. Otherwise indicate that no transformation is necessary.

  • [answer]
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
p <-
  ggpairs(
    dat_rail
  , diag  = list(continuous = wrap("densityDiag", alpha = 1/2), discrete = "barDiag")
  # scatterplots on top so response as first variable has y on vertical axis
  , upper = list(continuous = wrap("smooth", se = FALSE, alpha = 1/2, size = 1)
            , discrete = "facetbar"
            , combo = wrap("box_no_facet")
            )
  , lower = list(continuous = wrap("cor")
            , discrete = "facetbar"
            , combo = wrap("facethist", bins = 10)
            )
  , progress = FALSE
  )
p <- p + theme_bw()
print(p)

(1 p) Correlation matrix

Below is the correlation matrix and tests for the hypothesis that each correlation is equal to zero. Interpret the hypothesis tests and relate this to the plot that you produced above.

  • [answer]
# correlation matrix and associated p-values testing "H0: rho == 0"
#library(Hmisc)
dat_rail |>
  as.matrix() |>
  Hmisc::rcorr()
           volume hightemp cloudcover
volume       1.00     0.58      -0.37
hightemp     0.58     1.00      -0.10
cloudcover  -0.37    -0.10       1.00

n= 90 


P
           volume hightemp cloudcover
volume            0.0000   0.0003    
hightemp   0.0000          0.3702    
cloudcover 0.0003 0.3702             

(2 p) Multiple regression assumptions (assessing model fit)

Below the multiple regression is fit. Start by assessing the model assumptions by interpretting what you learn from the plots. If assumptions are not met, attempt to address by transforming a variable and restart at the beginning using the new transformed variable.

# fit the simple linear regression model
lm_fit <-
  lm(
    volume ~ hightemp + cloudcover
  , data = dat_rail
  )

Plot diagnostics.

# plot diagnostics
e_plot_lm_diagnostics(lm_fit, sw_plot_set = "simple")
Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a graphical
parameter

First provide a summary of whether the model assumptions are met, then go into detail about what information you get from each plot.

  1. Residual plot: [answer]
  2. QQ-plot: [answer]
  3. Box-Cox transformation: [answer]
  4. Cook’s distance: [answer]
  5. Cook’s dist vs Leverage: [answer]
  6. Residuals vs Fitted: [answer]
  7. Residuals vs hightemp: [answer]
  8. Residuals vs cloudcover: [answer]

(2 p) Multiple regression hypothesis tests

State the hypothesis test and conclusion for each regression coefficient.

# use summary() to get t-tests of parameters (slope, intercept)
summary(lm_fit)

Call:
lm(formula = volume ~ hightemp + cloudcover, data = dat_rail)

Residuals:
     Min       1Q   Median       3Q      Max 
-246.433  -43.651    9.449   42.550  306.237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  77.4664    59.7717   1.296 0.198390    
hightemp      5.4008     0.7874   6.859 9.63e-10 ***
cloudcover  -12.7138     3.1783  -4.000 0.000133 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.3 on 87 degrees of freedom
Multiple R-squared:  0.442, Adjusted R-squared:  0.4292 
F-statistic: 34.46 on 2 and 87 DF,  p-value: 9.508e-12

Each hypothesis is testing, conditional on all other predictors being in the model, whether the addition of the predictor being tested explains significantly more variability in volume than without it.

For each regression coefficient, write the hypothesis in notation, report the \(t\)-statistic and \(p\)-value, state the conclusion of the test, and interpret the result in the context of the problem.

  • [answer]

(2 p) Multiple regression interpret coefficients

Interpret the coefficients of the multiple regression model.

  • [answer]
fit_contrasts <-
  e_plot_model_contrasts(
    fit       = lm_fit
  , dat_cont  = dat_rail
  , sw_print  = FALSE
  )

p_arranged <-
  cowplot::plot_grid(
    plotlist  = list(fit_contrasts$plots[[1]], fit_contrasts$plots[[2]])
  , nrow      = 1
  , labels    = "AUTO"
  )

p_arranged %>% print()

(1 p) Multiple regression \(R^2\)

Interpret the Multiple R-squared value.

  • [answer]

(1 p) Summary

Summarize your findings in one sentence.

  • [answer]

View model in 3D, if you want

## Aside: I generally recommend against 3D plots for a variety of reasons.
## However, here's a 3D version of the plot so you can visualize the surface fit in 3D.

# library(rgl)
# library(car)
# scatter3d(volume ~ hightemp + cloudcover, data = dat_rail)