---
title: "ADA2: Class 04, Ch 02 Introduction to Multiple Linear Regression"
author: "Name Here"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
number_sections: true
toc_depth: 5
code_folding: show
#df_print: paged
#df_print: kable
#toc_float: true
#collapsed: false
#smooth_scroll: TRUE
theme: cosmo #spacelab #yeti #united #cosmo
highlight: tango
pdf_document:
df_print: kable
fontsize: 12pt
geometry: margin=0.25in
always_allow_html: yes
---
```{R, echo=FALSE}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
```
# Water Usage of Production Plant
A production plant cost-control engineer is responsible for cost reduction. One
of the costly items in his plant is the amount of water used by the production
facilities each month. He decided to investigate water usage by collecting
seventeen observations on his plant's water usage and other variables.
Variable | Description
----------- | ----------------------------------------------
Temperature | Average monthly temperate (F)
Production | Amount of production (M pounds)
Days | Number of plant operating days in the month
Persons | Number of persons on the monthly plant payroll
Water | Monthly water usage (gallons)
```{R}
library(erikmisc)
library(tidyverse)
# First, download the data to your computer,
# save in the same folder as this Rmd file.
# read the data
dat_water <- read_csv("ADA2_CL_04_water.csv")
str(dat_water)
#dat_water
dat_water <-
dat_water %>%
mutate(
# Add an ID column
id = 1:n()
) %>%
# filter to remove observations (if needed)
# TRUE indicates that ALL observations are included
# !(id %in% c(4, 12)) indicates to include all observations that are NOT id 4 or 12
filter(
TRUE # !(id %in% c(4, 12))
)
```
__Note:__
Because of the high correlation between `Production` and `Persons`,
do not include `Persons` in the model.
# Rubric
Following the in-class assignment this week,
perform a complete multiple regression analysis.
1. (1 p) Scatterplot matrix and interpretation
1. (2 p) Fit model, assess multiple regression assumptions
1. (1 p) Interpret added variable plots
1. (1 p) If there are model assumption issues, say how you address them at the beginning and start again.
1. (1 p) State and interpret the multiple regression hypothesis tests
1. (2 p) Interpret the significant multiple regression coefficients
1. (1 p) Interpret the multiple regression $R^2$
1. (1 p) One- or two-sentence summary
# Solutions
## __(1 p)__ Scatterplot matrix
_In a scatterplot matrix below interpret the relationship between each pair of variables.
If a transformation is suggested by the plot (that is, because there is a curved relationship),
also plot the data on the transformed scale and
perform the following analysis on the transformed scale.
Otherwise indicate that no transformation is necessary._
```{R}
library(ggplot2)
library(GGally)
#p <- ggpairs(dat_water)
p <- ggpairs(dat_water %>% select(-id, -Persons)) ## use select to remove vars
print(p)
```
_A parallel coordinate plot is another way of seeing patterns of observations
over a range of variables._
```{R, fig.height = 4, fig.width = 10, echo=FALSE}
# http://www.inside-r.org/packages/cran/GGally/docs/ggparcoord
library(ggplot2)
library(GGally)
# univariate min/max scaling
p_uniminmax <-
ggparcoord(
data = dat_water
, columns = c(5, 2, 4, 1, 3) #1:5
, groupColumn = 5 # color (pick the response)
#, order = "anyClass"
, scale = "uniminmax" # "uniminmax". "globalminmax"
, showPoints = FALSE
, title = "Parallel Coordinate Plot for the Water Data"
#, alphaLines = 1/3
#, shadeBox = "white"
#, boxplot = TRUE
) + theme_bw()
print(p_uniminmax)
```
### Solution
[answer]
## __(2 p)__ Multiple regression assumptions (assessing model fit)
_Below the multiple regression is fit.
Start by assessing the model assumptions by interpretting what you learn from the first seven plots (save the added variable plots for the next question)._
_If assumptions are not met, attempt to address by transforming a variable (or removing an outlier) and
restart at the beginning using the new transformed variable._
```{R}
# fit the simple linear regression model
#lm_w_tpdp <- lm(Water ~ Temperature + Production + Days + Persons, data = dat_water)
lm_w_tpd <-
lm(
Water ~ Temperature + Production + Days
, data = dat_water
)
```
Plot diagnostics.
```{R, fig.height = 3, fig.width = 10, echo=FALSE}
# plot diagnostics
e_plot_lm_diagostics(lm_w_tpd, sw_plot_set = "simpleAV")
```
### Solution
[answer]
From the diagnostic plots above,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
## __(1 p)__ Added variable plots
_Use partial regression residual plots (added variable plots)
to check for the need for transformations.
If linearity is not supported, address and restart at the beginning._
### Solution
[answer]
## __(1 p)__ Multiple regression hypothesis tests
_State the hypothesis test and conclusion for each regression coefficient._
```{R}
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm_w_tpd)
```
### Solution
[answer]
(I'll get you started by writing the beginning of the first conclusion.)
Each hypothesis is testing, conditional on all other predictors
being in the model, whether the addition of the predictor being tested explains
significantly more variability in Water than without it.
For $H_0: \beta_{\textrm{Temperature}}=0$, the $t$-statistic is
`r signif(summary(lm_w_tpd)$coefficients[2,3],4)`
with an associated p-value of
`r signif(summary(lm_w_tpd)$coefficients[2,4],4)`.
Thus, we
[reject / FAIL TO reject]
$H_0$ concluding that the slope is
[NOT] statistically significantly different
from 0 conditional on the other variables in the model.
Similarly,
for $H_0: \beta_{\textrm{Production}}=0$, the $t$-statistic is ...
Similarly,
for $H_0: \beta_{\textrm{Days}}=0$, the $t$-statistic is ...
## __(1 p)__ Multiple regression interpret coefficients
_Interpret the significant coefficients of the multiple regression model._
### Solution
[answer]
(I'll get you started by writing the beginning of the first conclusion.)
The coefficient for Temperature is estimated at
$\hat{\beta}_{\textrm{Temperature}}$=`r signif(summary(lm_w_tpd)$coefficients[2,1],4)`,
thus we expect the Water to increase by
`r signif(summary(lm_w_tpd)$coefficients[2,1],4)`
for each year increase in Temperature
holding the other variables constant.
The coefficient for Production is estimated at ...
The coefficient for Days is estimated at ...
## __(1 p)__ Multiple regression $R^2$
_Interpret the Multiple R-squared value._
### Solution
[answer]
## __(1 p)__ Summary
_Summarize your findings in one sentence._
### Solution
[answer]
# Unused plots
```{R}
## Aside: While I generally recommend against 3D plots for a variety of reasons,
## so you can visualize the surface fit in 3D, here's a 3D version of the plot.
## I will point out a feature in this plot that we would't see in other plots
## and would typically only be detected by careful consideration
## of a "more complicated" second-order model that includes curvature.
# library(rgl)
# library(car)
# scatter3d(Water ~ Temperature + Production, data = dat_water)
```
These bivariate plots can help show the relationships between the response and
predictor variables and identify each observation.
```{R, fig.height = 4, fig.width = 10, echo=FALSE}
# ggplot: Plot the data with linear regression fit and confidence bands
library(ggplot2)
p1 <- ggplot(dat_water, aes(x = Temperature, y = Water, label = id))
p1 <- p1 + geom_point(aes(colour = Production), size=3)
# plot labels next to points
p1 <- p1 + geom_text(hjust = 0.5, vjust = -0.5, alpha = 1/2)
# plot regression line and confidence band
p1 <- p1 + geom_smooth(method = lm)
p1 <- p1 + labs(title="Selling Water by Temperature with colored Production")
#print(p1)
library(ggplot2)
p2 <- ggplot(dat_water, aes(x = Days, y = Water, label = id))
p2 <- p2 + geom_point(aes(colour = Temperature), size=3)
# plot labels next to points
p2 <- p2 + geom_text(hjust = 0.5, vjust = -0.5, alpha = 1/2)
# plot regression line and confidence band
p2 <- p2 + geom_smooth(method = lm)
p2 <- p2 + labs(title="Selling Water by Days with colored Temperature")
library(ggplot2)
p3 <- ggplot(dat_water, aes(x = Production, y = Water, label = id))
p3 <- p3 + geom_point(aes(colour = Days), size=3)
# plot labels next to points
p3 <- p3 + geom_text(hjust = 0.5, vjust = -0.5, alpha = 1/2)
# plot regression line and confidence band
p3 <- p3 + geom_smooth(method = lm)
p3 <- p3 + labs(title="Selling Water by Production with colored Days")
#print(p3)
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3), nrow=1)
```