This is a challenging dataset, in part because it’s real and messy. I will guide you through a simplified sensible analysis, but other models are possible.

Note that I needed to set cache=FALSE to assure all output was updated.

1 ANCOVA model: Albuquerque NM 87108, House and Apartment listing prices

Prof Erhardt constructed a dataset of listing prices for dwellings (homes and apartments) for sale from Zillow.com on Feb 26, 2016 at 1 PM for Albuquerque NM 87108. In this assignment we’ll develop a model to help understand which qualities that contribute to a typical dwelling’s listing price. We will then also predict the listing prices of new listings posted on the following day, Feb 27, 2016 by 2 PM.

Because we want to model a typical dwelling, it is completely reasonable to remove “unusual” dwellings from the dataset. Dwellings have a distribution with a long tail!

1.1 Unusual assignment, not top-down, but up-down-up-down

This is an unusual assignment because the workflow of this assignment isn’t top-down; instead, you’ll be scrolling up and down as you make decisions about the data and model you’re fitting. Yes, I have much of the code worked out for you. However, there are data decisions to make early in the code (such as excluding observations, transforming variables, etc.) that depend on the analysis (model checking) later. Think of it as a “choose your own adventure” that I’ve written for you.

1.1.1 Keep a record of your decisions

It is always desirable to make your work reproducible, either by someone else or by your future self. For each step you take, keep a diary of (a) what the next minor goal is, (b) what evidence/information you have, (c) what decision you make, and (d) what the outcome was.

For example, here’s the first couple steps of your diary:

  1. Include only “typical dwellings”. Based on scatterplot, remove extreme observations. Keep only HOUSE and APARTMENT.
  2. Exclude a few variables to reduce multicollinearity between predictor variables. Exclude Baths and LotSize.
  3. etc.

1.2 (1 p) (Step 1) Restrict data to “typical” dwellings

Step 1: After looking at the scatterplot below, identify what you consider to be a “typical dwelling” and exclude observations far from that range. For example, there are only a couple TypeSale that are common enough to model; remember to run factor() again to remove factor levels that no longer appear.

library(tidyverse)

# First, download the data to your computer,
#   save in the same folder as this Rmd file.

# read the data, skip the first two comment lines of the data file
dat_abq <-
  read_csv("ADA2_HW_14_HomePricesZillow_Abq87108.csv", skip=2) %>%
  mutate(
    id = 1:n()
  , TypeSale = factor(TypeSale)
    # To help scale the intercept to a more reasonable value
    #   Scaling the x-variables are sometimes done to the mean of each x.
    # center year at 1900 (negative values are older, -10 is built in 1890)
  , YearBuilt_1900 = YearBuilt - 1900
  ) %>%
  select(
    id, everything()
    , -Address, -YearBuilt
  )

head(dat_abq)
# A tibble: 6 x 9
     id TypeSale  PriceList  Beds Baths Size_sqft LotSize DaysListed YearBuilt_1900
  <int> <fct>         <dbl> <dbl> <dbl>     <dbl>   <dbl>      <dbl>          <dbl>
1     1 HOUSE        186900     3     2      1305    6969          0             54
2     2 APARTMENT    305000     1     1      2523    6098          0             48
3     3 APARTMENT    244000     1     1      2816    6098          0             89
4     4 CONDO        108000     3     2      1137      NA          0             96
5     5 CONDO         64900     2     1      1000      NA          1             85
6     6 HOUSE        275000     3     3      2022    6098          1             52
## RETURN HERE TO SUBSET THE DATA

dat_abq <-
  dat_abq %>%
  filter(
    TRUE    # (X <= z)  # keep observations where variable X <= value z
  )
# note, if you remove a level from a categorical variable, then run factor() again

  # SOLUTION
  # these deletions are based only on the scatter plot in order to have
  #  "typical" dwellings




str(dat_abq)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    143 obs. of  9 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ TypeSale      : Factor w/ 5 levels "APARTMENT","CONDO",..: 5 1 1 2 2 5 5 4 5 5 ...
 $ PriceList     : num  186900 305000 244000 108000 64900 ...
 $ Beds          : num  3 1 1 3 2 3 2 2 3 3 ...
 $ Baths         : num  2 1 1 2 1 3 1 2 2 2 ...
 $ Size_sqft     : num  1305 2523 2816 1137 1000 ...
 $ LotSize       : num  6969 6098 6098 NA NA ...
 $ DaysListed    : num  0 0 0 0 1 1 1 1 1 2 ...
 $ YearBuilt_1900: num  54 48 89 96 85 52 52 65 58 52 ...

1.3 (1 p) (Step 3) Transform response, if necessary.

Step 3: Does the response variable require a transformation? If so, what transformation is recommended from the model diagnostic plots (Box-Cox)?

1.3.1 Solution

[answer]

dat_abq <-
  dat_abq %>%
  mutate(
    # Price in units of $1000
    PriceListK = PriceList / 1000

  ) %>%
  select(
    -PriceList
  )

str(dat_abq)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    143 obs. of  9 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ TypeSale      : Factor w/ 5 levels "APARTMENT","CONDO",..: 5 1 1 2 2 5 5 4 5 5 ...
 $ Beds          : num  3 1 1 3 2 3 2 2 3 3 ...
 $ Baths         : num  2 1 1 2 1 3 1 2 2 2 ...
 $ Size_sqft     : num  1305 2523 2816 1137 1000 ...
 $ LotSize       : num  6969 6098 6098 NA NA ...
 $ DaysListed    : num  0 0 0 0 1 1 1 1 1 2 ...
 $ YearBuilt_1900: num  54 48 89 96 85 52 52 65 58 52 ...
 $ PriceListK    : num  186.9 305 244 108 64.9 ...

1.4 (1 p) (Step 4) Remove extremely influential observations.

Step 4: The goal is to develop a model that will work well for the typical dwellings. If an observation is highly influential, then it’s unusual.

## Remove influential observation
#dat_abq <- dat_abq[-which(row.names(dat_abq) %in% c(...)),]

  # SOLUTION

1.5 Subset data for model building and prediction

Create a subset of the data for building the model, and another subset for prediction later on.

# remove observations with NAs
dat_abq <-
  dat_abq %>%
  na.omit()

# the data subset we will use to build our model
dat_sub <-
  dat_abq %>%
  filter(
    DaysListed > 0
  )

# the data subset we will predict from our model
dat_pred <-
  dat_abq %>%
  filter(
    DaysListed == 0
  ) %>%
  mutate(
    # the prices we hope to predict closely from our model
    PriceListK_true = PriceListK
    # set them to NA to predict them later
  , PriceListK = NA
  )

Scatterplot of the model-building subset.

# NOTE, this plot takes a long time if you're repeadly recompiling the document.
# comment the "print(p)" line so save some time when you're not evaluating this plot.
library(GGally)
library(ggplot2)
p <- ggpairs(dat_sub
            , mapping = ggplot2::aes(colour = TypeSale, alpha = 0.5)
            , lower = list(continuous = "points")
            , upper = list(continuous = "cor")
            )
print(p)