1 NM County-level Poverty Data

In this example we’ll use NM county-level poverty data to understand how counties differ by living conditions, and how those living conditions vary together. We hope to reduce our 13-dimensional dataset to the vital few components that explain about 75% of the variability.

Here is a description of the codebook for this data.

NM county-level poverty data from S16 student:
Nathan Dobie, Student Technical Specialist, Bureau of Business Economic Research, UNM
Thanks, Nathan!

Data combined from:
http://bber.unm.edu/county-profiles                                        (poverty)
http://factfinder.census.gov/bkmk/table/1.0/en/ACS/14_5YR/DP04/0400000US35 (other values)
http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt  (county names)

DATA COLUMNS:
 1 area
 2 county
 3 periodyear (2014)
   -Vacancy Status %
 4   Homeowner vacancy rate
 5   Rental vacancy rate
   -Occupancy Status %
 6   Owner-occupied
 7   Renter-occupied
   -Main source of heating (% of homes)
 8   Utility gas
 9   Electricity
10   Wood
11 Lacking complete plumbing facilities %
12 No telephone service available %
13 rentover35        (gross rent as a percentage of household income (grapi))
   -Poverty
14   est_percent     (Estimated percent of people of all ages in poverty)
15   child_percent   (Estimate of people age 0-17 in poverty)
16   fam_percent     (Estimated percent of related children age 5-17 in families in poverty)
fn.data <- "http://statacumen.com/teach/ADA2/homework/ADA2_HW_11_PCA_NMCensusPovertyHousingCharacteristics_DP04.csv"
nm.census <- read.csv(fn.data, header=TRUE, skip=1, as.is = TRUE)
# remove state average, use county-level
nm.census <- subset(nm.census, area != 0)

# Shorter column names
colnames(nm.census) <- c("Area", "County", "Year"
                       , "VacantH", "VacantR"
                       , "Owner", "Renter"
                       , "HeatG", "HeatE", "HeatW"
                       , "NoPlumb", "NoPhone", "Rent35"
                       , "PovAll", "PovChild", "PovFam")

str(nm.census)
'data.frame':   33 obs. of  16 variables:
 $ Area    : int  1 3 5 6 7 9 11 13 15 17 ...
 $ County  : chr  "Bernalillo" "Catron" "Chaves" "Cibola" ...
 $ Year    : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
 $ VacantH : num  1.7 14.8 2.3 1.6 7.4 3.9 11.4 2.1 0.4 3.2 ...
 $ VacantR : num  6.9 7.5 7.8 6.8 20.4 7 8.5 7.4 7.5 8.1 ...
 $ Owner   : num  62.4 87.2 65.4 74.8 67.6 59.4 82.7 64.7 73.5 75.6 ...
 $ Renter  : num  37.6 12.8 34.6 25.2 32.4 40.6 17.3 35.3 26.5 24.4 ...
 $ HeatG   : num  81.7 3.8 50.1 49.1 50.2 47 46.6 70.4 53.5 51.4 ...
 $ HeatE   : num  13 2.8 42.6 10 15.4 46.4 19.1 15.6 38.7 18.6 ...
 $ HeatW   : num  2 51.2 1.9 21.7 13.7 1.1 9 1.6 1 10.6 ...
 $ NoPlumb : num  0.5 0.9 0.5 5.2 0.1 0.1 0 0.7 0.7 1.2 ...
 $ NoPhone : num  3 2.4 2.8 3.5 4.4 3.2 4.5 3.1 2.2 3 ...
 $ Rent35  : num  43.8 51.7 36.7 45.1 38 42 0 46.9 31.4 41.9 ...
 $ PovAll  : num  18.7 22.2 23.4 28.8 20.5 19.2 20.6 27.9 14.1 19.1 ...
 $ PovChild: num  24.5 42.8 32.4 37.6 30.6 27.3 32.1 39.4 18.5 27.8 ...
 $ PovFam  : num  22.6 40.1 28.7 35.9 27.2 26.7 31.6 36 17.3 25.3 ...
#head(nm.census, 3)

# columns to use for analysis
ind.col <- c(4:6, 8:14)

1.1 (2 p) Scatterplot matrix of variables of interest

# Scatterplot matrix
library(ggplot2)
library(GGally)
p <- ggpairs(subset(nm.census, select = ind.col))
print(p)

In the scatterplot matrix describe qualitatively what you see.

1.1.1 Solution

[answer]

1.2 (1 p) Remove the one most extreme county

One coutry is seriously “rustic”. Remove that one observation at the end of the data section above, then rerun the analysis.

1.2.1 Solution

[answer]

1.3 (0 p) PCA using correlation matrix

The PCA output below will be used for the rest of the assignment.

nm.census.pca <- princomp(nm.census[, ind.col], cor = TRUE)
summary(nm.census.pca)
Importance of components:
                          Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
Standard deviation     1.7919163 1.3686422 1.1971798 1.0956960 0.89319298
Proportion of Variance 0.3210964 0.1873182 0.1433239 0.1200550 0.07977937
Cumulative Proportion  0.3210964 0.5084145 0.6517385 0.7717935 0.85157282
                           Comp.6     Comp.7     Comp.8     Comp.9
Standard deviation     0.76255653 0.65427686 0.55243900 0.34242633
Proportion of Variance 0.05814925 0.04280782 0.03051889 0.01172558
Cumulative Proportion  0.90972207 0.95252989 0.98304877 0.99477435
                           Comp.10
Standard deviation     0.228596735
Proportion of Variance 0.005225647
Cumulative Proportion  1.000000000
print(loadings(nm.census.pca), cutoff = 0.2) # to show all values

Loadings:
        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
VacantH        -0.432        -0.510 -0.275  0.393 -0.350        -0.265
VacantR               -0.539        -0.553 -0.578                     
Owner   -0.286 -0.450                0.474         0.513  0.268 -0.350
HeatG    0.343        -0.393  0.469         0.384               -0.424
HeatE    0.315  0.317        -0.515        -0.363         0.412 -0.345
HeatW   -0.504         0.204                      -0.363              
NoPlumb -0.352  0.376 -0.252  0.212               -0.299  0.635       
NoPhone -0.350  0.437 -0.220         0.216               -0.546 -0.510
Rent35  -0.202         0.593  0.395 -0.412                      -0.444
PovAll  -0.315  0.372               -0.357  0.435  0.602              
        Comp.10
VacantH  0.258 
VacantR        
Owner          
HeatG   -0.372 
HeatE   -0.305 
HeatW   -0.730 
NoPlumb  0.291 
NoPhone        
Rent35   0.226 
PovAll         

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
SS loadings       1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
Proportion Var    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1
Cumulative Var    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8
               Comp.9 Comp.10
SS loadings       1.0     1.0
Proportion Var    0.1     0.1
Cumulative Var    0.9     1.0
par(mfrow=c(1,2))
screeplot(nm.census.pca)
biplot(nm.census.pca)

1.4 (1 p) Dimension reduction

How many principal components would you retain to explain about 3/4ths of the total variability? How much variability is actually retained?

(To answer this question, I would like you to embed R notation in the sentence you write to print the variability proportion. For example, the proportion for the first two components is 0.5084, which is calculated by squaring the standard deviations of each component to get the variances, adding the first two variances, then dividing by the sum of all the variances. That ratio give the proportion of variance for the first two components.)

1.4.1 Solution

[answer]

1.5 (3 p) PC interpretations

Interpret the number of principal components you retained in the previous step.

Here’s an interpretation of the first component.

PC1 explains 32.1% of the total variation.

As PC1 increases, households heating with gas and electric increase, while poverty, rent greater than 35% of income, propotion with no phone or plumbing, heating with wood, owner-occupied dwellings, and vacant homes decrease.

I’m surprised to learn that heating with wood and poverty are related, everything else makes sense. This seems to be the primary poverty component.

Interpret the rest below.

1.5.1 Solution

[answer]

PC2 explains 18.7% of the total variation.

As PC2 increases,

PC3 explains 14.3% of the total variation.

As PC3 increases,

etc.

1.6 (3 p) Visualizing and interpretting PCs

Here are two-dimensional plots of PC1 against PC2, PC1 against PC3, and PC2 against PC3. The points and labels are colored by poverty proportion.

library(ggplot2)
p1 <- ggplot(as.data.frame(nm.census.pca$scores), aes(x = Comp.1, y = Comp.2, colour = nm.census$PovAll))
p1 <- p1 + scale_colour_gradientn(colours=c("red", "blue"))
p1 <- p1 + geom_text(aes(label = nm.census$County), vjust = -0.5, alpha = 0.25)
p1 <- p1 + geom_point(size = 3)
p1 <- p1 + theme(legend.position="bottom")
p2 <- ggplot(as.data.frame(nm.census.pca$scores), aes(x = Comp.1, y = Comp.3, colour = nm.census$PovAll))
p2 <- p2 + scale_colour_gradientn(colours=c("red", "blue"))
p2 <- p2 + geom_text(aes(label = nm.census$County), vjust = -0.5, alpha = 0.25)
p2 <- p2 + geom_point(size = 3)
p2 <- p2 + theme(legend.position="none")
p3 <- ggplot(as.data.frame(nm.census.pca$scores), aes(x = Comp.2, y = Comp.3, colour = nm.census$PovAll))
p3 <- p3 + scale_colour_gradientn(colours=c("red", "blue"))
p3 <- p3 + geom_text(aes(label = nm.census$County), vjust = -0.5, alpha = 0.25)
p3 <- p3 + geom_point(size = 3)
p3 <- p3 + theme(legend.position="none")

print(p1)

library(gridExtra)
grid.arrange(grobs = list(p2, p3), nrow=1, top = "Scatterplots of first three PCs")