A production plant cost-control engineer is responsible for cost reduction. One of the costly items in his plant is the amount of water used by the production facilities each month. He decided to investigate water usage by collecting seventeen observations on his plant’s water usage and other variables.
|Temperature||Average monthly temperate (F)|
|Production||Amount of production (M pounds)|
|Days||Number of plant operating days in the month|
|Persons||Number of persons on the monthly plant payroll|
|Water||Monthly water usage (gallons)|
library(erikmisc) library(tidyverse) # First, download the data to your computer, # save in the same folder as this Rmd file. # read the data <- read_csv("ADA2_CL_04_water.csv") dat_water str(dat_water)
spec_tbl_df [17 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Temperature: num [1:17] 58.8 65.2 70.9 77.4 79.3 81 71.9 63.9 54.5 39.5 ... $ Production : num [1:17] 7107 6373 6796 9208 14792 ... $ Days : num [1:17] 21 22 22 20 25 23 20 23 20 20 ... $ Persons : num [1:17] 129 141 153 166 193 189 175 186 190 187 ... $ Water : num [1:17] 3067 2828 2891 2994 3082 ... - attr(*, "spec")= .. cols( .. Temperature = col_double(), .. Production = col_double(), .. Days = col_double(), .. Persons = col_double(), .. Water = col_double() .. ) - attr(*, "problems")=<externalptr>
#dat_water <- dat_water %>% dat_water mutate( # Add an ID column id = 1:n() %>% ) # filter to remove observations (if needed) # TRUE indicates that ALL observations are included # !(id %in% c(4, 12)) indicates to include all observations that are NOT id 4 or 12 filter( TRUE # !(id %in% c(4, 12)) )
Note: Because of the high correlation between
Persons, do not include
Persons in the model.
Following the in-class assignment this week, perform a complete multiple regression analysis.
In a scatterplot matrix below interpret the relationship between each pair of variables. If a transformation is suggested by the plot (that is, because there is a curved relationship), also plot the data on the transformed scale and perform the following analysis on the transformed scale. Otherwise indicate that no transformation is necessary.
library(ggplot2) library(GGally) #p <- ggpairs(dat_water) <- ggpairs(dat_water %>% select(-id, -Persons)) ## use select to remove vars p print(p)
A parallel coordinate plot is another way of seeing patterns of observations over a range of variables.
Below the multiple regression is fit. Start by assessing the model assumptions by interpretting what you learn from the first seven plots (save the added variable plots for the next question). If assumptions are not met, attempt to address by transforming a variable (or removing an outlier) and restart at the beginning using the new transformed variable.
# fit the simple linear regression model #lm_w_tpdp <- lm(Water ~ Temperature + Production + Days + Persons, data = dat_water) <- lm_w_tpd lm( ~ Temperature + Production + Days Water data = dat_water , )