```
# install.packages("mosaicData") # install package for data
library(erikmisc)
library(tidyverse)
#library(openintro)
::theme_set(ggplot2::theme_bw()) # set theme_bw for all plots ggplot2
```

# ADA1: Class 27, Multiple regression, introduction

Advanced Data Analysis 1, Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM

# Rubric

Answer the questions with the data example.

# Northampton Rail Trail System

**RailTrail.** The Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station. In the sections below, describe the relationship between variables and develop a model for predicting `volume`

given `hightemp`

and `cloudcover`

.

A data frame with 90 observations on the following variables.

`volume`

estimated number of trail users that day (number of breaks recorded)`hightemp`

daily high temperature (in degrees Fahrenheit)`cloudcover`

measure of cloud cover (in oktas)

```
<-
dat_rail ::RailTrail |>
mosaicDataas_tibble() |>
select(
volume
, hightemp
, cloudcover|>
) mutate(
# volume = volume # Any of the variables can be transformed here
)
str(dat_rail)
```

```
tibble [90 × 3] (S3: tbl_df/tbl/data.frame)
$ volume : int [1:90] 501 419 397 385 200 375 417 629 533 547 ...
$ hightemp : int [1:90] 83 73 74 95 44 69 66 66 80 79 ...
$ cloudcover: num [1:90] 7.6 6.3 7.5 2.6 10 ...
```

`summary(dat_rail)`

```
volume hightemp cloudcover
Min. :129.0 Min. :41.00 Min. : 0.000
1st Qu.:291.5 1st Qu.:59.25 1st Qu.: 3.650
Median :373.0 Median :69.50 Median : 6.400
Mean :375.4 Mean :68.83 Mean : 5.807
3rd Qu.:451.2 3rd Qu.:77.75 3rd Qu.: 8.475
Max. :736.0 Max. :97.00 Max. :10.000
```

## (1 p) Scatterplot matrix

*In a scatterplot matrix below interpret the relationship between each pair of variables. If a transformation is suggested by the plot (that is, because there is a curved relationship), also plot the data on the transformed scale and perform the analysis on the transformed scale. Otherwise indicate that no transformation is necessary.*

- [answer]

`library(GGally)`

```
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
```

```
<-
p ggpairs(
dat_raildiag = list(continuous = wrap("densityDiag", alpha = 1/2), discrete = "barDiag")
, # scatterplots on top so response as first variable has y on vertical axis
upper = list(continuous = wrap("smooth", se = FALSE, alpha = 1/2, size = 1)
, discrete = "facetbar"
, combo = wrap("box_no_facet")
,
)lower = list(continuous = wrap("cor")
, discrete = "facetbar"
, combo = wrap("facethist", bins = 10)
,
)progress = FALSE
,
)<- p + theme_bw()
p print(p)
```

## (1 p) Correlation matrix

*Below is the correlation matrix and tests for the hypothesis that each correlation is equal to zero. Interpret the hypothesis tests and relate this to the plot that you produced above.*

- [answer]

```
# correlation matrix and associated p-values testing "H0: rho == 0"
#library(Hmisc)
|>
dat_rail as.matrix() |>
::rcorr() Hmisc
```

```
volume hightemp cloudcover
volume 1.00 0.58 -0.37
hightemp 0.58 1.00 -0.10
cloudcover -0.37 -0.10 1.00
n= 90
P
volume hightemp cloudcover
volume 0.0000 0.0003
hightemp 0.0000 0.3702
cloudcover 0.0003 0.3702
```

## (2 p) Multiple regression assumptions (assessing model fit)

*Below the multiple regression is fit. Start by assessing the model assumptions by interpretting what you learn from the plots.* *If assumptions are not met, attempt to address by transforming a variable and restart at the beginning using the new transformed variable.*

```
# fit the simple linear regression model
<-
lm_fit lm(
~ hightemp + cloudcover
volume data = dat_rail
, )
```

Plot diagnostics.

```
# plot diagnostics
e_plot_lm_diagnostics(lm_fit, sw_plot_set = "simple")
```

```
Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a graphical
parameter
```

*First provide a summary of whether the model assumptions are met, then go into detail about what information you get from each plot.*

**Residual plot:**[answer]**QQ-plot:**[answer]**Box-Cox transformation:**[answer]**Cook’s distance:**[answer]**Cook’s dist vs Leverage:**[answer]**Residuals vs Fitted:**[answer]**Residuals vs**[answer]`hightemp`

:**Residuals vs**[answer]`cloudcover`

:

## (2 p) Multiple regression hypothesis tests

*State the hypothesis test and conclusion for each regression coefficient.*

```
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm_fit)
```

```
Call:
lm(formula = volume ~ hightemp + cloudcover, data = dat_rail)
Residuals:
Min 1Q Median 3Q Max
-246.433 -43.651 9.449 42.550 306.237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.4664 59.7717 1.296 0.198390
hightemp 5.4008 0.7874 6.859 9.63e-10 ***
cloudcover -12.7138 3.1783 -4.000 0.000133 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 96.3 on 87 degrees of freedom
Multiple R-squared: 0.442, Adjusted R-squared: 0.4292
F-statistic: 34.46 on 2 and 87 DF, p-value: 9.508e-12
```

Each hypothesis is testing, conditional on all other predictors being in the model, whether the addition of the predictor being tested explains significantly more variability in volume than without it.

For each regression coefficient, write the hypothesis in notation, report the \(t\)-statistic and \(p\)-value, state the conclusion of the test, and interpret the result in the context of the problem.

- [answer]

## (2 p) Multiple regression interpret coefficients

*Interpret the coefficients of the multiple regression model.*

- [answer]

```
<-
fit_contrasts e_plot_model_contrasts(
fit = lm_fit
dat_cont = dat_rail
, sw_print = FALSE
,
)
<-
p_arranged ::plot_grid(
cowplotplotlist = list(fit_contrasts$plots[[1]], fit_contrasts$plots[[2]])
nrow = 1
, labels = "AUTO"
,
)
%>% print() p_arranged
```

## (1 p) Multiple regression \(R^2\)

*Interpret the Multiple R-squared value.*

- [answer]

## (1 p) Summary

*Summarize your findings in one sentence.*

- [answer]

## View model in 3D, if you want

```
## Aside: I generally recommend against 3D plots for a variety of reasons.
## However, here's a 3D version of the plot so you can visualize the surface fit in 3D.
# library(rgl)
# library(car)
# scatter3d(volume ~ hightemp + cloudcover, data = dat_rail)
```