- 1 ANCOVA model: Albuquerque NM 87108, House and Apartment listing prices
- 1.1 Unusual assignment, not top-down, but up-down-up-down
- 1.2
**(2 p)**(Step 1) Restrict data to “typical” dwellings - 1.3
**(2 p)**(Step 3) Transform response, if necessary. - 1.4
**(2 p)**(Step 4) Remove extremely influential observations. - 1.5 Subset data for model building and prediction
- 1.6
**(2 p)**(Step 2) Fit full two-way interaction model. - 1.7
**(2 p)**(Step 5) Model selection, check model assumptions. - 1.8
**(4 p)**(Step 6) Plot final model, interpret coefficients. - 1.9
**(2 p)**(Step 7) Transform predictors. - 1.10
**(4 p)**(Step 8) Predict new observations, interpret model’s predictive ability.

This is a challenging dataset, in part because it’s real and messy. I will guide you through a simplified sensible analysis, but other models are possible.

*Note that I needed to set cache=FALSE to assure all output was updated.*

Prof Erhardt constructed a dataset of listing prices for dwellings (homes and apartments) for sale from Zillow.com on Feb 26, 2016 at 1 PM for Albuquerque NM 87108. In this assignment we’ll develop a model to help understand which qualities that contribute to a **typical dwelling’s listing price**. We will then also predict the listing prices of new listings posted on the following day, Feb 27, 2016 by 2 PM.

Because we want to model a *typical dwelling*, it is completely reasonable to remove “unusual” dwellings from the dataset. Dwellings have a distribution with a long tail!

This is an unusual assignment because the workflow of this assignment isn’t top-down; instead, you’ll be scrolling up and down as you make decisions about the data and model you’re fitting. Yes, I have much of the code worked out for you. However, there are data decisions to make early in the code (such as excluding observations, transforming variables, etc.) that depend on the analysis (model checking) later. Think of it as a “choose your own adventure” that I’ve written for you.

It is always desirable to make your work reproducible, either by someone else or by your future self. For each step you take, keep a diary of (a) what the next minor goal is, (b) what evidence/information you have, (c) what decision you make, and (d) what the outcome was.

For example, here’s the first couple steps of your diary:

- Include only “typical dwellings”. Based on scatterplot, remove extreme observations. Keep only HOUSE and APARTMENT.
- Exclude a few variables to reduce multicollinearity between predictor variables. Exclude
`Baths`

and`LotSize`

. - etc.

**Step 1:** After looking at the scatterplot below, identify what you consider to be a “typical dwelling” and exclude observations far from that range. For example, there are only a couple `TypeSale`

that are common enough to model; remember to run `factor()`

again to remove factor levels that no longer appear.

```
library(tidyverse)
# load ada functions
source("ada_functions.R")
# First, download the data to your computer,
# save in the same folder as this Rmd file.
# read the data, skip the first two comment lines of the data file
<-
dat_abq read_csv("ADA2_CL_14_HomePricesZillow_Abq87108.csv", skip=2) %>%
mutate(
id = 1:n()
TypeSale = factor(TypeSale)
, # To help scale the intercept to a more reasonable value
# Scaling the x-variables are sometimes done to the mean of each x.
# center year at 1900 (negative values are older, -10 is built in 1890)
YearBuilt_1900 = YearBuilt - 1900
, %>%
) select(
everything()
id, -Address, -YearBuilt
,
)
head(dat_abq)
```

```
# A tibble: 6 x 9
id TypeSale PriceList Beds Baths Size_sqft LotSize DaysListed
<int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 HOUSE 186900 3 2 1305 6969 0
2 2 APARTME~ 305000 1 1 2523 6098 0
3 3 APARTME~ 244000 1 1 2816 6098 0
4 4 CONDO 108000 3 2 1137 NA 0
5 5 CONDO 64900 2 1 1000 NA 1
6 6 HOUSE 275000 3 3 2022 6098 1
# ... with 1 more variable: YearBuilt_1900 <dbl>
```

```
## RETURN HERE TO SUBSET THE DATA
<-
dat_abq %>%
dat_abq filter(
TRUE # (X <= z) # keep observations where variable X <= value z
)# note, if you remove a level from a categorical variable, then run factor() again
# SOLUTION
# these deletions are based only on the scatter plot in order to have
# "typical" dwellings
str(dat_abq)
```

```
tibble [143 x 9] (S3: tbl_df/tbl/data.frame)
$ id : int [1:143] 1 2 3 4 5 6 7 8 9 10 ...
$ TypeSale : Factor w/ 5 levels "APARTMENT","CONDO",..: 5 1 1 2 2 5 5 4 5 5 ...
$ PriceList : num [1:143] 186900 305000 244000 108000 64900 ...
$ Beds : num [1:143] 3 1 1 3 2 3 2 2 3 3 ...
$ Baths : num [1:143] 2 1 1 2 1 3 1 2 2 2 ...
$ Size_sqft : num [1:143] 1305 2523 2816 1137 1000 ...
$ LotSize : num [1:143] 6969 6098 6098 NA NA ...
$ DaysListed : num [1:143] 0 0 0 0 1 1 1 1 1 2 ...
$ YearBuilt_1900: num [1:143] 54 48 89 96 85 52 52 65 58 52 ...
```

**Step 3:** Does the response variable require a transformation? If so, what transformation is recommended from the model diagnostic plots (Box-Cox)?

[answer]

```
<-
dat_abq %>%
dat_abq mutate(
# Price in units of $1000
PriceListK = PriceList / 1000
# SOLUTION
%>%
) select(
-PriceList
)
str(dat_abq)
```

```
tibble [143 x 9] (S3: tbl_df/tbl/data.frame)
$ id : int [1:143] 1 2 3 4 5 6 7 8 9 10 ...
$ TypeSale : Factor w/ 5 levels "APARTMENT","CONDO",..: 5 1 1 2 2 5 5 4 5 5 ...
$ Beds : num [1:143] 3 1 1 3 2 3 2 2 3 3 ...
$ Baths : num [1:143] 2 1 1 2 1 3 1 2 2 2 ...
$ Size_sqft : num [1:143] 1305 2523 2816 1137 1000 ...
$ LotSize : num [1:143] 6969 6098 6098 NA NA ...
$ DaysListed : num [1:143] 0 0 0 0 1 1 1 1 1 2 ...
$ YearBuilt_1900: num [1:143] 54 48 89 96 85 52 52 65 58 52 ...
$ PriceListK : num [1:143] 186.9 305 244 108 64.9 ...
```

**Step 4:** The goal is to develop a model that will work well for the typical dwellings. If an observation is highly influential, then it’s unusual.

```
## Remove influential observation
<-
dat_abq %>%
dat_abq filter(
TRUE # !(id %in% c( ... ))
)
# SOLUTION
```

Create a subset of the data for building the model, and another subset for prediction later on.

```
# remove observations with NAs
<-
dat_abq %>%
dat_abq na.omit()
# the data subset we will use to build our model
<-
dat_sub %>%
dat_abq filter(
> 0
DaysListed
)
# the data subset we will predict from our model
<-
dat_pred %>%
dat_abq filter(
== 0
DaysListed %>%
) mutate(
# the prices we hope to predict closely from our model
PriceListK_true = PriceListK
# set them to NA to predict them later
PriceListK = NA
, )
```

Scatterplot of the model-building subset.

```
# NOTE, this plot takes a long time if you're repeadly recompiling the document.
# comment the "print(p)" line so save some time when you're not evaluating this plot.
library(GGally)
library(ggplot2)
<-
p ggpairs(
dat_submapping = ggplot2::aes(colour = TypeSale, alpha = 0.5)
, lower = list(continuous = "points")
, upper = list(continuous = "cor")
, progress = FALSE
,
)#print(p)
```

There are clearly some unusual observations. Go back to the first code chunk and remove some observations that don’t represent a “typical” dwelling.

For example, remove these dwellings (in code above):

- Super expensive dwelling
- Dwellings with huge lots
- Dwellings that were listed for years
- Because most dwellings were APARTMENTs and HOUSEs, remove the others (there are only 1 or so of each).

Discuss the observed correlations or other outstanding features in the data.

[answer]

Features of data:

*You’ll revisit this section after each modification of the data above.*

**Step 2:** Let’s fit the full two-way interaction model and assess the assumptions. However, some of the predictor variables are highly correlated. Recall that the interpretation of a beta coefficient is “the expected increase in the response for a 1-unit increase in \(x\) with all other predictors held constant”. It’s hard to hold one variable constant if it’s correlated with another variable you’re increasing. Therefore, we’ll make a decision to retain some variables but not others depending on their correlation values. (In the PCA chapter, we’ll see another strategy.)

Somewhat arbitrarily, let’s exclude `Baths`

(since highly correlated with `Beds`

and `Size_sqft`

). Let’s also exclude `LotSize`

(since highly correlated with `Size_sqft`

). Modify the code below. Notice that because APARTMENTs don’t have more than 1 Beds or Baths, those interaction terms need to be excluded from the model; I show you how to do this manually using the `update()`

function.

Note that the formula below `y ~ (x1 + x2 + x3)^2`

expands into all main effects and two-way interactions.

```
## SOLUTION
<-
lm_full lm(
~ (TypeSale + Beds + Size_sqft + DaysListed + YearBuilt_1900)^2
PriceListK data = dat_sub
,
)#lm_full <-
# lm(
# PriceListK ~ (Beds + Baths + Size_sqft + LotSize + DaysListed + YearBuilt_1900)^2
# , data = dat_sub
# )
lm_full
```

```
Call:
lm(formula = PriceListK ~ (TypeSale + Beds + Size_sqft + DaysListed +
YearBuilt_1900)^2, data = dat_sub)
Coefficients:
(Intercept)
2.183e+00
TypeSaleFOR SALE BY OWNER
1.447e+02
TypeSaleFORECLOSURE
2.084e+03
TypeSaleHOUSE
2.859e+02
Beds
-1.299e+02
Size_sqft
1.445e-01
DaysListed
-2.441e-01
YearBuilt_1900
3.642e-01
TypeSaleFOR SALE BY OWNER:Beds
NA
TypeSaleFORECLOSURE:Beds
-1.706e+02
TypeSaleHOUSE:Beds
NA
TypeSaleFOR SALE BY OWNER:Size_sqft
NA
TypeSaleFORECLOSURE:Size_sqft
3.828e-01
TypeSaleHOUSE:Size_sqft
3.209e-02
TypeSaleFOR SALE BY OWNER:DaysListed
NA
TypeSaleFORECLOSURE:DaysListed
-1.066e+00
TypeSaleHOUSE:DaysListed
3.068e-01
TypeSaleFOR SALE BY OWNER:YearBuilt_1900
NA
TypeSaleFORECLOSURE:YearBuilt_1900
-3.516e+01
TypeSaleHOUSE:YearBuilt_1900
-5.218e+00
Beds:Size_sqft
-4.801e-04
Beds:DaysListed
-1.260e-01
Beds:YearBuilt_1900
2.629e+00
Size_sqft:DaysListed
2.270e-04
Size_sqft:YearBuilt_1900
-1.584e-03
DaysListed:YearBuilt_1900
-2.272e-03
```

```
library(car)
try(Anova(lm_full, type=3))
```

```
Error in Anova.III.lm(mod, error, singular.ok = singular.ok, ...) :
there are aliased coefficients in the model
```

```
## Note that this doesn't work because APARTMENTs only have 1 bed and 1 bath.
## There isn't a second level of bed or bath to estimate the interaction.
## Therefore, remove those two terms
<-
lm_full update(
lm_full~ . - TypeSale:Beds
, .
)library(car)
try(Anova(lm_full, type=3))
```

```
Error in Anova.III.lm(mod, error, singular.ok = singular.ok, ...) :
there are aliased coefficients in the model
```

```
## Uncomment this line when you're ready to assess the model assumptions
# plot diagnostics
#lm_diag_plots(lm_full)
# List the row numbers with id numbers
# The row numbers appear in the residual plots.
# The id number can be used to exclude values in code above.
%>% select(id) %>% print(n = Inf) dat_sub
```

```
# A tibble: 114 x 1
id
<int>
1 6
2 7
3 8
4 9
5 10
6 12
7 13
8 14
9 15
10 17
11 18
12 19
13 20
14 21
15 22
16 24
17 27
18 29
19 30
20 31
21 32
22 33
23 35
24 36
25 37
26 38
27 39
28 40
29 41
30 42
31 45
32 46
33 47
34 48
35 49
36 50
37 51
38 52
39 53
40 54
41 55
42 56
43 57
44 58
45 59
46 60
47 61
48 62
49 64
50 65
51 66
52 67
53 68
54 69
55 70
56 71
57 72
58 77
59 78
60 79
61 80
62 81
63 83
64 84
65 85
66 86
67 87
68 91
69 92
70 93
71 94
72 95
73 96
74 99
75 100
76 101
77 102
78 103
79 104
80 105
81 106
82 108
83 109
84 110
85 111
86 113
87 114
88 115
89 116
90 117
91 119
92 120
93 121
94 122
95 123
96 124
97 125
98 126
99 127
100 128
101 129
102 130
103 131
104 132
105 133
106 134
107 135
108 136
109 137
110 138
111 140
112 141
113 142
114 143
```

After Step 2, interpret the residual plots. What are the primary issues in the original model?

[answer]

Using `step(..., direction="both")`

with the BIC criterion, perform model selection.

```
## BIC
# option: test="F" includes additional information
# for parameter estimate tests that we're familiar with
# option: for BIC, include k=log(nrow( [data.frame name] ))
<-
lm_red_BIC step(
lm_fulldirection = "both"
, test = "F"
, trace = 0
, k = log(nrow(dat_sub))
,
)<- lm_red_BIC lm_final
```

```
## Uncomment this line when you're ready to assess the model assumptions
# plot diagnostics
#lm_diag_plots(lm_final)
```

Model assumptions appear to be reasonably met. A few influential observations exist.

If you arrived at the same model I did, then the code below will plot it. Eventually (after Step 7), the fitted model equations will describe the each dwelling `TypeSale`

and interpret the coefficients.

```
library(car)
#Anova(lm_final, type=3)
summary(lm_final)
```

```
Call:
lm(formula = PriceListK ~ TypeSale + Size_sqft + YearBuilt_1900 +
Size_sqft:YearBuilt_1900, data = dat_sub)
Residuals:
Min 1Q Median 3Q Max
-184.16 -41.71 -1.96 28.74 417.46
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.416e+02 9.358e+01 -3.650 0.000407 ***
TypeSaleFOR SALE BY OWNER 1.160e+02 7.686e+01 1.509 0.134125
TypeSaleFORECLOSURE 9.160e+00 3.669e+01 0.250 0.803342
TypeSaleHOUSE 1.045e+02 1.783e+01 5.862 5.13e-08 ***
Size_sqft 2.592e-01 3.855e-02 6.724 8.93e-10 ***
YearBuilt_1900 4.883e+00 1.630e+00 2.995 0.003409 **
Size_sqft:YearBuilt_1900 -2.750e-03 6.432e-04 -4.276 4.14e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 75.1 on 107 degrees of freedom
Multiple R-squared: 0.9377, Adjusted R-squared: 0.9342
F-statistic: 268.6 on 6 and 107 DF, p-value: < 2.2e-16
```

Fitted model equation is \[ \widehat{\log_{10}(\textrm{PriceList})} = -342 + 116 I(\textrm{TypeSale} = \textrm{HOUSE}) + 9.16 \log_{10}(\textrm{Size_sqft}) \]

After Step 7, return and intepret the model coefficients above.

[answer]

We now have enough information to see that a transformation of a predictor can be useful. See the curvature with `Size_sqft`

? This is one of the headaches of regression modelling, *everything depends on everything else* and you learn as you go. Return to the top and transform `Size_sqft`

and `LotSize`

.

A nice feature of this transformation is that the model interaction goes away. Our interpretation is now on the log scale, but it’s a simpler model.

Using the `predict()`

function, we’ll input the data we held out to predict earlier, and use our final model to predict the `PriceListK`

response. Note that `10^lm_pred`

is the table of values on the scale of “thousands of dollars”.

Interpret the predictions below the output.

How well do you expect this model to predict? Justify your answer.

```
# predict new observations, convert to data frame
<-
lm_pred as.data.frame(
predict(
lm_finalnewdata = dat_pred
, interval = "prediction"
,
)%>%
) mutate(
# add column of actual list prices
PriceListK = dat_pred$PriceListK_true
) lm_pred
```

```
fit lwr upr PriceListK
1 171.0553 21.13390 320.9766 186.9
2 213.6503 60.32045 366.9801 305.0
3 133.5463 -21.88382 288.9764 244.0
```

```
# on "thousands of dollars" scale
10^lm_pred
```

```
fit lwr upr PriceListK
1 1.135702e+171 1.361129e+21 Inf 7.943282e+186
2 4.469914e+213 2.091480e+60 Inf 1.000000e+305
3 3.517992e+133 1.306721e-22 9.471238e+288 1.000000e+244
```

```
# attributes of the three predicted observations
%>% print(n = Inf, width = Inf) dat_pred
```

```
# A tibble: 3 x 10
id TypeSale Beds Baths Size_sqft LotSize DaysListed YearBuilt_1900
<int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 HOUSE 3 2 1305 6969 0 54
2 2 APARTMENT 1 1 2523 6098 0 48
3 3 APARTMENT 1 1 2816 6098 0 89
PriceListK PriceListK_true
<lgl> <dbl>
1 NA 187.
2 NA 305
3 NA 244
```

[answer]