This is a challenging dataset, in part because it’s real and messy. I will guide you through a simplified sensible analysis, but other models are possible.
Note that I needed to set cache=FALSE
to assure all output was updated.
Prof Erhardt constructed a dataset of listing prices for dwellings (homes and apartments) for sale from Zillow.com on Feb 26, 2016 at 1 PM for Albuquerque NM 87108. In this assignment we’ll develop a model to help understand which qualities that contribute to a typical dwelling’s listing price. We will then also predict the listing prices of new listings posted on the following day, Feb 27, 2016 by 2 PM.
Because we want to model a typical dwelling, it is completely reasonable to remove “unusual” dwellings from the dataset. Dwellings are a distribution that has a long tail!
This is an unusual assignment because the workflow of this assignment isn’t top-down; instead, you’ll be scrolling up and down as you make decisions about the data and model you’re fitting. Yes, I have much of the code worked out for you. However, there are data decisions to make early in the code (such as excluding observations, transforming variables, etc.) that depend on the analysis (model checking) later. Think of it as a “choose your own adventure” that I’ve chosen for you.
It is always desirable to make your work reproducible, either by someone else or by your future self. For each step you take, keep a diary of (a) what the next minor goal is, (b) what evidence/information you have, (c) what decision you make, and (d) what the outcome was.
For example, here’s the first couple entries of your diary:
Baths
and LotSize
.Step 1: After looking at the scatterplot below, identify what you consider to be a “typical dwelling” and exclude observations far from that range. For example, there are only a couple TypeSale
that are common enough to model; remember to run factor()
again to remove factor levels that no longer appear.
# read the data, skip the first two comment lines of the data file
dat.abq <- read.csv("http://statacumen.com/teach/ADA2/homework/ADA2_HW_07_HomePricesZillow_Abq87108.csv"
, skip=2, stringsAsFactors = FALSE)[,-1]
dat.abq$TypeSale <- factor(dat.abq$TypeSale)
# To help scale the intercept to a more reasonable value
# Scaling the x-variables are sometimes done to the mean of each x.
# center year at 1900
dat.abq$YearBuilt_1900 <- dat.abq$YearBuilt - 1900
dat.abq$YearBuilt <- NULL
head(dat.abq)
TypeSale PriceList Beds Baths Size.sqft LotSize DaysListed
1 HOUSE 186900 3 2 1305 6969 0
2 APARTMENT 305000 1 1 2523 6098 0
3 APARTMENT 244000 1 1 2816 6098 0
4 CONDO 108000 3 2 1137 NA 0
5 CONDO 64900 2 1 1000 NA 1
6 HOUSE 275000 3 3 2022 6098 1
YearBuilt_1900
1 54
2 48
3 89
4 96
5 85
6 52
## RETURN HERE TO SUBSET THE DATA
# dat.abq <- subset(dat.abq, !(X > Z)) # exclude observations of this type
# dat.abq <- dat.abq[!(dat.abq$X > Z),] # exclude observations of this type
# SOLUTION
# these deletions are based only on the scatter plot in order to have
# "typical" dwellings
Step 3: Does the response variable require a transformation? If so, what transformation is recommended from the model diagnostic plots (Box-Cox)?
[answer]
# Price in units of $1000
dat.abq$PriceListK <- dat.abq$PriceList / 1000
dat.abq$PriceList <- NULL
# SOLUTION
Step 4: The goal is to develop a model that will work well for the typical dwellings. If an observation is highly influential, then it’s unusual.
## Remove influential observation
#dat.abq <- dat.abq[-which(row.names(dat.abq) %in% c(...)),]
# SOLUTION
Create a subset of the data for building the model, and another subset for prediction later on.
# remove observations with NAs
dat.abq <- na.omit(dat.abq)
# the data subset we will use to build our model
dat.sub <- subset(dat.abq, DaysListed > 0)
# the data subset we will predict from our model
dat.pred <- subset(dat.abq, DaysListed == 0)
# the prices we hope to predict well from our model
dat.pred$PriceListK_true <- dat.pred$PriceListK
# set them to NA to predict them later
dat.pred$PriceListK <- NA
Scatterplot of the model-building subset.
# NOTE, this plot takes a long time if you're repeadly recompiling the document.
# comment the "print(p)" line so save some time when you're not evaluating this plot.
library(GGally)
library(ggplot2)
p <- ggpairs(dat.sub
, mapping = ggplot2::aes(colour = TypeSale, alpha = 0.5)
, lower = list(continuous = "points")
, upper = list(continuous = "cor")
)
print(p)
There are clearly some unusual observations. Go back to the first code chunk and remove some observations that don’t represent a “typical” dwelling.
For example, remove these dwellings (in code above):
Discuss the observed correlations or other outstanding features in the data.
[answer]
You’ll revisit this section after each modification of the data above.
Step 2: Let’s fit the full two-way interaction model and assess the assumptions. However, some of the predictor variables are highly correlated. Recall that the interpretation of a beta coefficient is “the expected increase in the response for a 1-unit increase in \(x\) with all other predictors held constant”. It’s hard to hold one variable constant if it’s correlated with another variable you’re increasing. Therefore, we’ll make a decision to retain some variables but not others depending on their correlation values. (In the PCA chapter, we’ll see another strategy.)
Somewhat arbitrarily, let’s exclude Baths
(since highly correlated with Beds
and Size.sqft
). Let’s also exclude LotSize
(since highly correlated with Size.sqft
). Modify the code below. Notice that because APARTMENTs don’t have more than 1 Beds or Baths, those interaction terms need to be excluded from the model; I show you how to do this manually using the update()
function.
Note that the formula below y ~ (x1 + x2 + x3)^2
expands into all main effects and two-way interactions.
## SOLUTION
lm.full <- lm(PriceListK ~ (TypeSale + Beds + Baths + Size.sqft + LotSize + DaysListed + YearBuilt_1900)^2, data = dat.sub)
#lm.full <- lm(PriceListK ~ (Beds + Baths + Size.sqft + LotSize + DaysListed + YearBuilt_1900)^2, data = dat.sub)
library(car)
try(Anova(lm.full, type=3))
## Note that this doesn't work because APARTMENTs only have 1 bed and 1 bath.
## There isn't a second level of bed or bath to estimate the interaction.
## Therefore, remove those two terms
lm.full <- update(lm.full, . ~ . - TypeSale:Beds - TypeSale:Baths)
library(car)
try(Anova(lm.full, type=3))
Note the diagnostic plots below are printed automatically with a new function below (echo=FALSE
, so look in Rmd file). Thanks to Eric Kruger for the idea of a diagnostic plotting function.
## Uncomment this line when you're ready to assess the model assumptions
#lm.diag.plots(lm.full, rc.mfrow = c(1,3))
After Step 2, interpret the residual plots. What are the primary issues?
[answer]
Using step(..., direction="both")
with the BIC criterion, perform model selection.
## BIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
lm.red.BIC <- step(lm.full, direction="both", test="F", trace = 0
, k=log(nrow(dat.sub)))
lm.final <- lm.red.BIC
## Uncomment this line when you're ready to assess the model assumptions
#lm.diag.plots(lm.final, rc.mfrow = c(1,3))
Model assumptions appear to be reasonably met. A few influential observations exist.
If you arrived at the same model I did, then the code below will plot it. Eventually (after Step 7), the fitted model equations will describe the each dwelling TypeSale
and interpret the coefficients.
library(car)
#Anova(lm.final, type=3)
summary(lm.final)
Call:
lm(formula = PriceListK ~ TypeSale + Beds + Baths + Size.sqft +
LotSize + YearBuilt_1900 + TypeSale:LotSize + TypeSale:YearBuilt_1900 +
Beds:Baths + Beds:Size.sqft + Beds:LotSize + Baths:YearBuilt_1900 +
Size.sqft:LotSize, data = dat.sub)
Residuals:
Min 1Q Median 3Q Max
-174.692 -21.403 -0.624 23.065 133.337
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 8.404e+01 7.847e+01 1.071
TypeSaleFOR SALE BY OWNER 1.071e+02 6.729e+01 1.591
TypeSaleFORECLOSURE 5.796e+02 2.456e+02 2.360
TypeSaleHOUSE 6.191e+02 1.162e+02 5.329
Beds -6.977e+01 2.939e+01 -2.374
Baths 3.718e+01 5.579e+01 0.667
Size.sqft -4.852e-02 1.503e-02 -3.228
LotSize 7.109e-03 5.577e-03 1.275
YearBuilt_1900 -3.434e+00 1.213e+00 -2.831
TypeSaleFOR SALE BY OWNER:LotSize NA NA NA
TypeSaleFORECLOSURE:LotSize -3.183e-02 1.629e-02 -1.954
TypeSaleHOUSE:LotSize -4.710e-02 9.702e-03 -4.854
TypeSaleFOR SALE BY OWNER:YearBuilt_1900 NA NA NA
TypeSaleFORECLOSURE:YearBuilt_1900 -8.471e+00 4.862e+00 -1.742
TypeSaleHOUSE:YearBuilt_1900 -5.598e+00 1.694e+00 -3.304
Beds:Baths -7.267e+01 1.318e+01 -5.514
Beds:Size.sqft 4.958e-02 6.367e-03 7.787
Beds:LotSize 1.245e-02 4.525e-03 2.751
Baths:YearBuilt_1900 3.647e+00 8.460e-01 4.310
Size.sqft:LotSize 1.057e-06 1.784e-07 5.921
Pr(>|t|)
(Intercept) 0.28686
TypeSaleFOR SALE BY OWNER 0.11486
TypeSaleFORECLOSURE 0.02029 *
TypeSaleHOUSE 6.52e-07 ***
Beds 0.01960 *
Baths 0.50666
Size.sqft 0.00171 **
LotSize 0.20547
YearBuilt_1900 0.00565 **
TypeSaleFOR SALE BY OWNER:LotSize NA
TypeSaleFORECLOSURE:LotSize 0.05362 .
TypeSaleHOUSE:LotSize 4.68e-06 ***
TypeSaleFOR SALE BY OWNER:YearBuilt_1900 NA
TypeSaleFORECLOSURE:YearBuilt_1900 0.08469 .
TypeSaleHOUSE:YearBuilt_1900 0.00134 **
Beds:Baths 2.96e-07 ***
Beds:Size.sqft 8.01e-12 ***
Beds:LotSize 0.00711 **
Baths:YearBuilt_1900 3.94e-05 ***
Size.sqft:LotSize 4.96e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54.98 on 96 degrees of freedom
Multiple R-squared: 0.9701, Adjusted R-squared: 0.9648
F-statistic: 182.9 on 17 and 96 DF, p-value: < 2.2e-16
Fitted model equation is \[ \widehat{\log_{10}(\textrm{PriceList})} = 84 + 107 I(\textrm{TypeSale} = \textrm{HOUSE}) + 580 \log_{10}(\textrm{Size.sqft}) \]
After Step 7, return and intepret the model coefficients above.
[answer]
We now have enough information to see that a transformation of a predictor can be useful. See the curvature with Size.sqft
? This is one of the headaches of regression modelling, everything depends on everything else and you learn as you go. Return to the top and transform Size.sqft
and LotSize
.
A nice feature of this transformation is that the model interaction goes away. Our interpretation is now on the log scale, but it’s a simpler model.
Using the predict()
function, we’ll input the data we held out to predict earlier, and use our final model to predict the PriceListK
response. Note that 10^lm.pred
is the table of values on the scale of “thousands of dollars”.
Interpret the predictions below the output.
How well do you expect this model to predict? Justify your answer.
# predict new observations, convert to data frame
lm.pred <- as.data.frame(predict(lm.final, newdata = dat.pred, interval = "prediction"))
# add column of actual list prices
lm.pred$PriceListK <- dat.pred$PriceListK_true
lm.pred
fit lwr upr PriceListK
1 160.2250 49.43439 271.0157 186.9
2 127.1842 12.81698 241.5515 305.0
3 138.1019 17.26725 258.9365 244.0
# on "thousands of dollars" scale
10^lm.pred
fit lwr upr PriceListK
1 1.678934e+160 2.718868e+49 1.036763e+271 7.943282e+186
2 1.528426e+127 6.561084e+12 3.560519e+241 1.000000e+305
3 1.264370e+138 1.850337e+17 8.639680e+258 1.000000e+244
# attributes of the three predicted observations
dat.pred
TypeSale Beds Baths Size.sqft LotSize DaysListed YearBuilt_1900
1 HOUSE 3 2 1305 6969 0 54
2 APARTMENT 1 1 2523 6098 0 48
3 APARTMENT 1 1 2816 6098 0 89
PriceListK PriceListK_true
1 NA 186.9
2 NA 305.0
3 NA 244.0
[answer]