# ADA1: Class 27, Multiple regression, introduction

Advanced Data Analysis 1, Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM

Author

Published

October 21, 2023

# Rubric

# install.packages("mosaicData")  # install package for data

library(erikmisc)
library(tidyverse)
#library(openintro)
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots

Answer the questions with the data example.

# Northampton Rail Trail System

RailTrail. The Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station. In the sections below, describe the relationship between variables and develop a model for predicting volume given hightemp and cloudcover.

A data frame with 90 observations on the following variables.

• volume estimated number of trail users that day (number of breaks recorded)
• hightemp daily high temperature (in degrees Fahrenheit)
• cloudcover measure of cloud cover (in oktas)
dat_rail <-
mosaicData::RailTrail |>
as_tibble() |>
select(
volume
, hightemp
, cloudcover
) |>
mutate(
#  volume = volume    # Any of the variables can be transformed here
)

str(dat_rail)
tibble [90 × 3] (S3: tbl_df/tbl/data.frame)
$volume : int [1:90] 501 419 397 385 200 375 417 629 533 547 ...$ hightemp  : int [1:90] 83 73 74 95 44 69 66 66 80 79 ...
$cloudcover: num [1:90] 7.6 6.3 7.5 2.6 10 ... summary(dat_rail)  volume hightemp cloudcover Min. :129.0 Min. :41.00 Min. : 0.000 1st Qu.:291.5 1st Qu.:59.25 1st Qu.: 3.650 Median :373.0 Median :69.50 Median : 6.400 Mean :375.4 Mean :68.83 Mean : 5.807 3rd Qu.:451.2 3rd Qu.:77.75 3rd Qu.: 8.475 Max. :736.0 Max. :97.00 Max. :10.000  ## (1 p) Scatterplot matrix In a scatterplot matrix below interpret the relationship between each pair of variables. If a transformation is suggested by the plot (that is, because there is a curved relationship), also plot the data on the transformed scale and perform the analysis on the transformed scale. Otherwise indicate that no transformation is necessary. • [answer] library(GGally) Registered S3 method overwritten by 'GGally': method from +.gg ggplot2 p <- ggpairs( dat_rail , diag = list(continuous = wrap("densityDiag", alpha = 1/2), discrete = "barDiag") # scatterplots on top so response as first variable has y on vertical axis , upper = list(continuous = wrap("smooth", se = FALSE, alpha = 1/2, size = 1) , discrete = "facetbar" , combo = wrap("box_no_facet") ) , lower = list(continuous = wrap("cor") , discrete = "facetbar" , combo = wrap("facethist", bins = 10) ) , progress = FALSE ) p <- p + theme_bw() print(p) ## (1 p) Correlation matrix Below is the correlation matrix and tests for the hypothesis that each correlation is equal to zero. Interpret the hypothesis tests and relate this to the plot that you produced above. • [answer] # correlation matrix and associated p-values testing "H0: rho == 0" #library(Hmisc) dat_rail |> as.matrix() |> Hmisc::rcorr()  volume hightemp cloudcover volume 1.00 0.58 -0.37 hightemp 0.58 1.00 -0.10 cloudcover -0.37 -0.10 1.00 n= 90 P volume hightemp cloudcover volume 0.0000 0.0003 hightemp 0.0000 0.3702 cloudcover 0.0003 0.3702  ## (2 p) Multiple regression assumptions (assessing model fit) Below the multiple regression is fit. Start by assessing the model assumptions by interpretting what you learn from the plots. If assumptions are not met, attempt to address by transforming a variable and restart at the beginning using the new transformed variable. # fit the simple linear regression model lm_fit <- lm( volume ~ hightemp + cloudcover , data = dat_rail ) Plot diagnostics. # plot diagnostics e_plot_lm_diagnostics(lm_fit, sw_plot_set = "simple") Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a graphical parameter First provide a summary of whether the model assumptions are met, then go into detail about what information you get from each plot. 1. Residual plot: [answer] 2. QQ-plot: [answer] 3. Box-Cox transformation: [answer] 4. Cook’s distance: [answer] 5. Cook’s dist vs Leverage: [answer] 6. Residuals vs Fitted: [answer] 7. Residuals vs hightemp: [answer] 8. Residuals vs cloudcover: [answer] ## (2 p) Multiple regression hypothesis tests State the hypothesis test and conclusion for each regression coefficient. # use summary() to get t-tests of parameters (slope, intercept) summary(lm_fit)  Call: lm(formula = volume ~ hightemp + cloudcover, data = dat_rail) Residuals: Min 1Q Median 3Q Max -246.433 -43.651 9.449 42.550 306.237 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 77.4664 59.7717 1.296 0.198390 hightemp 5.4008 0.7874 6.859 9.63e-10 *** cloudcover -12.7138 3.1783 -4.000 0.000133 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 96.3 on 87 degrees of freedom Multiple R-squared: 0.442, Adjusted R-squared: 0.4292 F-statistic: 34.46 on 2 and 87 DF, p-value: 9.508e-12 Each hypothesis is testing, conditional on all other predictors being in the model, whether the addition of the predictor being tested explains significantly more variability in volume than without it. For each regression coefficient, write the hypothesis in notation, report the $$t$$-statistic and $$p$$-value, state the conclusion of the test, and interpret the result in the context of the problem. • [answer] ## (2 p) Multiple regression interpret coefficients Interpret the coefficients of the multiple regression model. • [answer] fit_contrasts <- e_plot_model_contrasts( fit = lm_fit , dat_cont = dat_rail , sw_print = FALSE ) p_arranged <- cowplot::plot_grid( plotlist = list(fit_contrasts$plots[[1]], fit_contrasts\$plots[[2]])
, nrow      = 1
, labels    = "AUTO"
)

p_arranged %>% print()

## (1 p) Multiple regression $$R^2$$

Interpret the Multiple R-squared value.

## (1 p) Summary

Summarize your findings in one sentence.

## Aside: I generally recommend against 3D plots for a variety of reasons.
# scatter3d(volume ~ hightemp + cloudcover, data = dat_rail)