1 Water Usage of Production Plant

A production plant cost-control engineer is responsible for cost reduction. One of the costly items in his plant is the amount of water used by the production facilities each month. He decided to investigate water usage by collecting seventeen observations on his plantโ€™s water usage and other variables.

Variable Description
Temperature Average monthly temperate (F)
Production Amount of production (M pounds)
Days Number of plant operating days in the month
Persons Number of persons on the monthly plant payroll
Water Monthly water usage (gallons)
library(erikmisc)
library(tidyverse)

# First, download the data to your computer,
#   save in the same folder as this Rmd file.

# read the data
dat_water <- read_csv("ADA2_CL_04_water.csv")
str(dat_water)
spec_tbl_df [17 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Temperature: num [1:17] 58.8 65.2 70.9 77.4 79.3 81 71.9 63.9 54.5 39.5 ...
 $ Production : num [1:17] 7107 6373 6796 9208 14792 ...
 $ Days       : num [1:17] 21 22 22 20 25 23 20 23 20 20 ...
 $ Persons    : num [1:17] 129 141 153 166 193 189 175 186 190 187 ...
 $ Water      : num [1:17] 3067 2828 2891 2994 3082 ...
 - attr(*, "spec")=
  .. cols(
  ..   Temperature = col_double(),
  ..   Production = col_double(),
  ..   Days = col_double(),
  ..   Persons = col_double(),
  ..   Water = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
#dat_water

dat_water <-
  dat_water %>%
  mutate(
    # Add an ID column
    id = 1:n()
  ) %>%
  # filter to remove observations (if needed)
  #   TRUE        indicates that ALL observations are included
  #   !(id %in% c(4, 12)) indicates to include all observations that are NOT id 4 or 12
  filter(
    TRUE  # !(id %in% c(4, 12))
  )

Note: Because of the high correlation between Production and Persons, do not include Persons in the model.

2 Rubric

Following the in-class assignment this week, perform a complete multiple regression analysis.

  1. (1 p) Scatterplot matrix and interpretation
  2. (2 p) Fit model, assess multiple regression assumptions
  3. (1 p) Interpret added variable plots
  4. (1 p) If there are model assumption issues, say how you address them at the beginning and start again.
  5. (1 p) State and interpret the multiple regression hypothesis tests
  6. (2 p) Interpret the significant multiple regression coefficients
  7. (1 p) Interpret the multiple regression \(R^2\)
  8. (1 p) One- or two-sentence summary

3 Solutions

3.1 (1 p) Scatterplot matrix

In a scatterplot matrix below interpret the relationship between each pair of variables. If a transformation is suggested by the plot (that is, because there is a curved relationship), also plot the data on the transformed scale and perform the following analysis on the transformed scale. Otherwise indicate that no transformation is necessary.

library(ggplot2)
library(GGally)
#p <- ggpairs(dat_water)
p <- ggpairs(dat_water %>% select(-id, -Persons))  ## use select to remove vars
print(p)

A parallel coordinate plot is another way of seeing patterns of observations over a range of variables.

3.1.1 Solution

[answer]

3.2 (2 p) Multiple regression assumptions (assessing model fit)

Below the multiple regression is fit. Start by assessing the model assumptions by interpretting what you learn from the first seven plots (save the added variable plots for the next question). If assumptions are not met, attempt to address by transforming a variable (or removing an outlier) and restart at the beginning using the new transformed variable.

# fit the simple linear regression model
#lm_w_tpdp <- lm(Water ~ Temperature + Production + Days + Persons, data = dat_water)
lm_w_tpd <-
  lm(
    Water ~ Temperature + Production + Days
  , data = dat_water
  )

Plot diagnostics.