---
title: "ADA2: Homework 14, Ch 07b, ANCOVA 2"
author: "Name Here"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
number_sections: true
toc_depth: 5
code_folding: show
#df_print: paged
#df_print: kable
#toc_float: true
#collapsed: false
#smooth_scroll: TRUE
theme: cosmo #spacelab #yeti #united #cosmo
highlight: tango
pdf_document:
df_print: kable
fontsize: 12pt
geometry: margin=0.25in
always_allow_html: yes
---
```{R, echo=FALSE}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
```
This is a challenging dataset, in part because it's real and messy.
I will guide you through a simplified sensible analysis,
but other models are possible.
_Note that I needed to set `cache=FALSE` to assure all output was updated._
# ANCOVA model: Albuquerque NM 87108, House and Apartment listing prices
Prof Erhardt constructed a dataset of listing prices for dwellings (homes and apartments) for sale from
[Zillow.com](http://www.zillow.com/homes/for_sale/Albuquerque-NM-87108/95303_rid/any_days/35.095087-106.52167835.035021-106.633258_rect/13_zm/0_mmm/)
on Feb 26, 2016 at 1 PM for Albuquerque NM 87108.
In this assignment we'll develop a model to help understand which qualities that
contribute to a __typical dwelling's listing price__.
We will then also predict the listing prices of new listings
posted on the following day, Feb 27, 2016 by 2 PM.
Because we want to model a _typical dwelling_,
it is completely reasonable to remove "unusual" dwellings from the dataset.
Dwellings have a distribution with a [long tail](https://en.wikipedia.org/wiki/Long_tail)!
## Unusual assignment, not top-down, but up-down-up-down
This is an unusual assignment because the workflow of this assignment isn't top-down;
instead, you'll be scrolling up and down as you make decisions about the data and model you're fitting.
Yes, I have much of the code worked out for you.
However, there are data decisions to make early in the code
(such as excluding observations, transforming variables, etc.)
that depend on the analysis (model checking) later.
Think of it as a "choose your own adventure" that I've written for you.
### Keep a record of your decisions
It is always desirable to make your work reproducible,
either by someone else or by your future self.
For each step you take, keep a diary of
(a) what the next minor goal is,
(b) what evidence/information you have,
(c) what decision you make, and
(d) what the outcome was.
For example, here's the first couple steps of your diary:
1. Include only "typical dwellings". Based on scatterplot, remove extreme observations. Keep only HOUSE and APARTMENT.
2. Exclude a few variables to reduce multicollinearity between predictor variables. Exclude `Baths` and `LotSize`.
3. etc.
## __(1 p)__ (Step 1) Restrict data to "typical" dwellings
__Step 1:__
After looking at the scatterplot below, identify what you consider to be a "typical dwelling"
and exclude observations far from that range.
For example, there are only a couple `TypeSale` that are common enough to model;
remember to run `factor()` again to remove factor levels that no longer appear.
```{R}
library(tidyverse)
# First, download the data to your computer,
# save in the same folder as this Rmd file.
# read the data, skip the first two comment lines of the data file
dat_abq <-
read_csv("ADA2_HW_14_HomePricesZillow_Abq87108.csv", skip=2) %>%
mutate(
id = 1:n()
, TypeSale = factor(TypeSale)
# To help scale the intercept to a more reasonable value
# Scaling the x-variables are sometimes done to the mean of each x.
# center year at 1900 (negative values are older, -10 is built in 1890)
, YearBuilt_1900 = YearBuilt - 1900
) %>%
select(
id, everything()
, -Address, -YearBuilt
)
head(dat_abq)
## RETURN HERE TO SUBSET THE DATA
dat_abq <-
dat_abq %>%
filter(
TRUE # (X <= z) # keep observations where variable X <= value z
)
# note, if you remove a level from a categorical variable, then run factor() again
# SOLUTION
# these deletions are based only on the scatter plot in order to have
# "typical" dwellings
str(dat_abq)
```
## __(1 p)__ (Step 3) Transform response, if necessary.
__Step 3:__
Does the response variable require a transformation?
If so, what transformation is recommended from the model diagnostic plots (Box-Cox)?
### Solution
[answer]
```{R}
dat_abq <-
dat_abq %>%
mutate(
# Price in units of $1000
PriceListK = PriceList / 1000
) %>%
select(
-PriceList
)
str(dat_abq)
```
## __(1 p)__ (Step 4) Remove extremely influential observations.
__Step 4:__
The goal is to develop a model that will work well for the typical dwellings.
If an observation is highly influential, then it's unusual.
```{R}
## Remove influential observation
#dat_abq <- dat_abq[-which(row.names(dat_abq) %in% c(...)),]
# SOLUTION
```
## Subset data for model building and prediction
Create a subset of the data for building the model,
and another subset for prediction later on.
```{R}
# remove observations with NAs
dat_abq <-
dat_abq %>%
na.omit()
# the data subset we will use to build our model
dat_sub <-
dat_abq %>%
filter(
DaysListed > 0
)
# the data subset we will predict from our model
dat_pred <-
dat_abq %>%
filter(
DaysListed == 0
) %>%
mutate(
# the prices we hope to predict closely from our model
PriceListK_true = PriceListK
# set them to NA to predict them later
, PriceListK = NA
)
```
Scatterplot of the model-building subset.
```{R, fig.height = 8, fig.width = 8}
# NOTE, this plot takes a long time if you're repeadly recompiling the document.
# comment the "print(p)" line so save some time when you're not evaluating this plot.
library(GGally)
library(ggplot2)
p <- ggpairs(dat_sub
, mapping = ggplot2::aes(colour = TypeSale, alpha = 0.5)
, lower = list(continuous = "points")
, upper = list(continuous = "cor")
)
print(p)
```
There are clearly some unusual observations.
Go back to the first code chunk and remove some observations that don't represent a "typical" dwelling.
For example, remove these dwellings (in code above):
* Super expensive dwelling
* Dwellings with huge lots
* Dwellings that were listed for years
* Because most dwellings were APARTMENTs and HOUSEs, remove the others (there are only 1 or so of each).
Discuss the observed correlations or other outstanding features in the data.
### Solution
[answer]
Features of data:
## __(1 p)__ (Step 2) Fit full two-way interaction model.
_You'll revisit this section after each modification of the data above._
__Step 2:__
Let's fit the full two-way interaction model and assess the assumptions.
However, some of the predictor variables are highly correlated.
Recall that the interpretation of a beta coefficient is
"the expected increase in the response for a 1-unit increase in $x$
with all other predictors held constant".
It's hard to hold one variable constant if it's correlated with another variable you're increasing.
Therefore, we'll make a decision to retain some variables but not others
depending on their correlation values.
(In the PCA chapter, we'll see another strategy.)
Somewhat arbitrarily, let's exclude `Baths` (since highly correlated with `Beds` and `Size_sqft`).
Let's also exclude `LotSize` (since highly correlated with `Size_sqft`).
Modify the code below.
Notice that because APARTMENTs don't have more than 1 Beds or Baths,
those interaction terms need to be excluded from the model;
I show you how to do this manually using the `update()` function.
Note that the formula below `y ~ (x1 + x2 + x3)^2` expands into all main effects and two-way interactions.
```{R}
## SOLUTION
lm_full <-
lm(
PriceListK ~ (TypeSale + Beds + Size_sqft + DaysListed + YearBuilt_1900)^2
, data = dat_sub
)
#lm_full <-
# lm(
# PriceListK ~ (Beds + Baths + Size_sqft + LotSize + DaysListed + YearBuilt_1900)^2
# , data = dat_sub
# )
lm_full
library(car)
try(Anova(lm_full, type=3))
## Note that this doesn't work because APARTMENTs only have 1 bed and 1 bath.
## There isn't a second level of bed or bath to estimate the interaction.
## Therefore, remove those two terms
lm_full <-
update(
lm_full
, . ~ . - TypeSale:Beds
)
library(car)
try(Anova(lm_full, type=3))
```
Note the diagnostic plots below are printed automatically with a new function below (`echo=FALSE`, so look in Rmd file).
```{R, echo=FALSE}
#' lm_diag_plots() is a general function for plotting residual diagnostics for an lm() object
#' By: Prof. Erik B. Erhardt, UNM Statistics, erike@stat.unm.edu, StatAcumen.com
#' 02/02/2020
#'
#' @param fit linear model object returned by lm()
#' @param rc_mfrow number of rows and columns for the graphic plot, e.g., c(2,3)
#' @param which_plot default plot numbers for lm()
#' @param n_outliers number to identify in plots from lm() and qqPlot()
#' @param sw_order_of_data T/F for whether to show residuals by order of data
#'
#' @return NULL, invisibly
#' @export
#'
#' @examples
#' lm_full <- lm(mpg ~ cyl + disp + hp + gear, data = mtcars)
#' lm_diag_plots(lm_full, rc_mfrow = c(1, 3))
lm_diag_plots <-
function(
fit
, rc_mfrow = NA
, which_plot = c(4, 6, 1)
, n_outliers = 3
, sw_qqplot = TRUE
, sw_boxcox = TRUE
, sw_constant_var = TRUE
, sw_collinearity = TRUE
, sw_order_of_data = FALSE
) {
# variable names
var_names <- names(fit$model)[-1]
# display settings
if (is.na(rc_mfrow[1])) {
n_plots <-
sw_qqplot +
length(which_plot) +
length(var_names) +
sw_boxcox +
sw_constant_var +
sw_collinearity +
sw_order_of_data
rc_mfrow <- c(ceiling(n_plots / 3), 3)
#rc_mfrow <- c(1, 3)
}
op <- par(no.readonly = TRUE) # the whole list of settable par
par(mfrow = rc_mfrow)
# Normal quantile plot (QQ-plot)
#library(car)
if(sw_qqplot) {
car::qqPlot(as.numeric(fit$residuals), las = 1, id = list(n = n_outliers), main="QQ Plot", ylab = "Residuals")
}
# default: Fitted, Cook's distance (with cutoff), and Leverage (with cutoffs)
for(i_plot in which_plot) {
plot(fit, which = i_plot, id.n = n_outliers)
if (i_plot == 4) {
Di_large <- 4 / (dim(fit$model)[1] - dim(fit$model)[2] - 1)
abline(h = Di_large, col = "blue", lty = 3) # horizontal line
}
if (i_plot == 6) {
lev_large <- c(2, 3) * dim(fit$model)[2] / dim(fit$model)[1]
abline(h = Di_large, col = "blue", lty = 3) # horizontal line
abline(v = lev_large[1], col = "blue", lty = 3) # vertical line
abline(v = lev_large[2], col = "blue", lty = 2) # vertical line
}
}
# residuals plotted vs each main effect
for(i_plot in 1:length(var_names)) {
m_lab <- paste("Residuals vs.", var_names[i_plot])
plot(x = fit$model[,var_names[i_plot]], y = fit$residuals, main=m_lab, ylab = "Residuals", xlab = var_names[i_plot])
abline(h = 0, col = "gray75", lty = 3) # horizontal line at zero
if((class(fit$model[,var_names[i_plot]]) %in% c("numeric", "integer"))) {
# use simple smoother if less than 4 observations, otherwise use splines
if (length(unique(fit$model[,var_names[i_plot]])) < 4) {
# Loess
#lines(predict(loess(fit$residuals ~ fit$model[,var_names[i_plot]], enp.target=1)), col="red", lwd=1)
# Friedman's SuperSmoother
lines(supsmu(x=fit$model[,var_names[i_plot]], y=fit$residuals), col="red", lwd=1)
} else {
lines(smooth.spline(fit$residuals ~ fit$model[,var_names[i_plot]], spar=0.8), col="red", lwd=1, lty=1)
}
}
}
# residuals vs order of data
if(sw_order_of_data) {
# order of data (not always interesting)
plot(fit$residuals, main="Residuals vs Order of data", ylab = "Residuals")
abline(h = 0, col = "gray75", lty = 3) # horizontal line at zero
}
# Box-Cox transformation suggestion
# only if all values are positive
if(sw_boxcox) {
if(min(fit$model[,1] > 0)){
#library(car)
car::boxCox(fit, lambda = seq(-3,3,length=101), main = "Box-Cox power transformation")
abline(v = 1 , col = "orange", lty = 3, lwd = 2) # horizontal line at zero
}
}
# Evaluate homoscedasticity
if(sw_constant_var) {
#library(car)
# non-constant error variance test
print(car::ncvTest(fit))
# plot studentized residuals vs. fitted values
try(car::spreadLevelPlot(fit, sub = "(Homoscedasticity)"))
}
# Evaluate Collinearity
if(sw_collinearity) {
#library(car)
vif_val <- car::vif(fit) # variance inflation factors
dotchart(vif_val, main = "Collinearity", xlab = "Variance Inflation Factor (VIF)", sub = "Not as useful with interactions")
abline(v = 0 , col = "gray75", lty = 3) # horizontal line at zero
abline(v = 2^2, col = "blue", lty = 2) # vertical line
abline(v = 10 , col = "blue", lty = 3) # vertical line
}
par(op) # reset plotting options
invisible(NULL)
## Useful list of diags: http://www.statmethods.net/stats/rdiagnostics.html
} # end of lm_diag_plots()
```
```{R, fig.height = 3, fig.width = 8}
## Uncomment this line when you're ready to assess the model assumptions
#lm_diag_plots(lm_full, rc_mfrow = c(1, 3))
```
After Step 2, interpret the residual plots.
What are the primary issues in the original model?
### Solution
[answer]
## __(1 p)__ (Step 5) Model selection, check model assumptions.
Using `step(..., direction="both")` with the BIC criterion,
perform model selection.
### Solution
```{R, fig.height = 3, fig.width = 8}
## BIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm_red_BIC <-
step(
lm_full
, direction = "both"
, test = "F"
, trace = 0
, k = log(nrow(dat_sub))
)
lm_final <- lm_red_BIC
## Uncomment this line when you're ready to assess the model assumptions
#lm_diag_plots(lm_final, rc_mfrow = c(1, 3))
```
Model assumptions appear to be reasonably met.
A few influential observations exist.
## __(2 p)__ (Step 6) Plot final model, interpret coefficients.
If you arrived at the same model I did,
then the code below will plot it.
Eventually (after Step 7), the fitted model equations will describe the
each dwelling `TypeSale`
and interpret the coefficients.
```{R, fig.height = 5, fig.width = 8, echo=FALSE}
library(ggplot2)
p <- ggplot(dat_sub, aes(x = Size_sqft, y = PriceListK, colour = TypeSale, shape = TypeSale))
p <- p + geom_point(size = 2, alpha = 1)
#p <- p + expand_limits(x = 0, y = 8.5)
p <- p + geom_smooth(method = lm, se = FALSE) # , alpha=0.15, fullrange = TRUE)
p <- p + labs(title="Log Listing Price", x = "log10(Size_sqft)")
print(p)
```
```{R}
library(car)
#Anova(lm_final, type=3)
summary(lm_final)
```
Fitted model equation is
$$
\widehat{\log_{10}(\textrm{PriceList})}
=
`r signif(coef(lm_final)[1], 3)`
+ `r signif(coef(lm_final)[2], 3)` I(\textrm{TypeSale} = \textrm{HOUSE})
+ `r signif(coef(lm_final)[3], 3)` \log_{10}(\textrm{Size_sqft})
$$
### Solution
After Step 7, return and intepret the model coefficients above.
[answer]
## __(1 p)__ (Step 7) Transform predictors.
We now have enough information to see that a transformation of a predictor can be useful.
See the curvature with `Size_sqft`?
This is one of the headaches of regression modelling,
_everything depends on everything else_
and you learn as you go.
Return to the top and transform `Size_sqft` and `LotSize`.
A nice feature of this transformation is that the model interaction goes away.
Our interpretation is now on the log scale, but it's a simpler model.
## __(2 p)__ (Step 8) Predict new observations, interpret model's predictive ability.
Using the `predict()` function, we'll input the data we held out to predict earlier,
and use our final model to predict the `PriceListK` response.
Note that `10^lm_pred` is the table of values on the scale of "thousands of dollars".
Interpret the predictions below the output.
How well do you expect this model to predict? Justify your answer.
```{R}
# predict new observations, convert to data frame
lm_pred <-
as.data.frame(
predict(
lm_final
, newdata = dat_pred
, interval = "prediction"
)
) %>%
mutate(
# add column of actual list prices
PriceListK = dat_pred$PriceListK_true
)
lm_pred
# on "thousands of dollars" scale
10^lm_pred
# attributes of the three predicted observations
dat_pred
```
### Solution
[answer]