NM County-level Poverty Data
In this example we’ll use NM county-level poverty data to understand how counties differ by living conditions, and how those living conditions vary together. We hope to reduce our 13-dimensional dataset to the vital few components that explain about 75% of the variability.
Maps: Labels are easier to read on the left, but road features on right make the counties easier to place.

Here is a description of the codebook for this data.
NM county-level poverty data from S16 student:
Nathan Dobie, Student Technical Specialist, Bureau of Business Economic Research, UNM
Thanks, Nathan!
Data combined from:
http://bber.unm.edu/county-profiles (poverty)
http://factfinder.census.gov/bkmk/table/1.0/en/ACS/14_5YR/DP04/0400000US35 (other values)
http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt (county names)
DATA COLUMNS:
1 area
2 county
3 periodyear (2014)
-Vacancy Status %
4 Homeowner vacancy rate
5 Rental vacancy rate
-Occupancy Status %
6 Owner-occupied
7 Renter-occupied
-Main source of heating (% of homes)
8 Utility gas
9 Electricity
10 Wood
11 Lacking complete plumbing facilities %
12 No telephone service available %
13 rentover35 (gross rent as a percentage of household income (grapi))
-Poverty
14 est_percent (Estimated percent of people of all ages in poverty)
15 child_percent (Estimate of people age 0-17 in poverty)
16 fam_percent (Estimated percent of related children age 5-17 in families in poverty)
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.20 ──
✔ tibble 3.1.8 ✔ dplyr 1.1.0
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
erikmisc, solving common complex data analysis workflows
by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.4
✔ ggplot2 3.4.1 ✔ stringr 1.5.0
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# First, download the data to your computer,
# save in the same folder as this Rmd file.
# read the data
dat_nmcensus <-
read_csv(
"ADA2_CL_20_PCA_NMCensusPovertyHousingCharacteristics_DP04.csv"
, skip = 1
) %>%
rename(
# Shorter column names
"Area" = "area"
, "County" = "county"
, "Year" = "periodyear"
, "VacantH" = "Homeowner vacancy rate"
, "VacantR" = "Rental vacancy rate"
, "Owner" = "Owner-occupied"
, "Renter" = "Renter-occupied"
, "HeatG" = "Utility gas"
, "HeatE" = "Electricity"
, "HeatW" = "Wood"
, "NoPlumb" = "Lacking complete plumbing facilities"
, "NoPhone" = "No telephone service available"
, "Rent35" = "rentover35"
, "PovAll" = "est_percent"
, "PovChild" = "child_percent"
, "PovFam" = "fam_percent"
) %>%
filter(
# remove state average, use county-level
Area != 0
)
Rows: 34 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): county
dbl (15): area, periodyear, Homeowner vacancy rate, Rental vacancy rate, Own...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# remove column attributes from read_csv()
attr(dat_nmcensus, "spec") <- NULL
# columns to use for analysis,
use_col_ind <- c(4:6, 8:14)
use_col_names <- names(dat_nmcensus)[use_col_ind]
use_col_names
[1] "VacantH" "VacantR" "Owner" "HeatG" "HeatE" "HeatW" "NoPlumb"
[8] "NoPhone" "Rent35" "PovAll"
spc_tbl_ [33 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Area : num [1:33] 1 3 5 6 7 9 11 13 15 17 ...
$ County : chr [1:33] "Bernalillo" "Catron" "Chaves" "Cibola" ...
$ Year : num [1:33] 2014 2014 2014 2014 2014 ...
$ VacantH : num [1:33] 1.7 14.8 2.3 1.6 7.4 3.9 11.4 2.1 0.4 3.2 ...
$ VacantR : num [1:33] 6.9 7.5 7.8 6.8 20.4 7 8.5 7.4 7.5 8.1 ...
$ Owner : num [1:33] 62.4 87.2 65.4 74.8 67.6 59.4 82.7 64.7 73.5 75.6 ...
$ Renter : num [1:33] 37.6 12.8 34.6 25.2 32.4 40.6 17.3 35.3 26.5 24.4 ...
$ HeatG : num [1:33] 81.7 3.8 50.1 49.1 50.2 47 46.6 70.4 53.5 51.4 ...
$ HeatE : num [1:33] 13 2.8 42.6 10 15.4 46.4 19.1 15.6 38.7 18.6 ...
$ HeatW : num [1:33] 2 51.2 1.9 21.7 13.7 1.1 9 1.6 1 10.6 ...
$ NoPlumb : num [1:33] 0.5 0.9 0.5 5.2 0.1 0.1 0 0.7 0.7 1.2 ...
$ NoPhone : num [1:33] 3 2.4 2.8 3.5 4.4 3.2 4.5 3.1 2.2 3 ...
$ Rent35 : num [1:33] 43.8 51.7 36.7 45.1 38 42 0 46.9 31.4 41.9 ...
$ PovAll : num [1:33] 18.7 22.2 23.4 28.8 20.5 19.2 20.6 27.9 14.1 19.1 ...
$ PovChild: num [1:33] 24.5 42.8 32.4 37.6 30.6 27.3 32.1 39.4 18.5 27.8 ...
$ PovFam : num [1:33] 22.6 40.1 28.7 35.9 27.2 26.7 31.6 36 17.3 25.3 ...
- attr(*, "problems")=<externalptr>
Place your code to subset, filter, or transform variables in this code chunk below.
dat_nmcensus <-
dat_nmcensus %>%
filter(
TRUE
)
(2 p) Scatterplot matrix of variables of interest
# Scatterplot matrix
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
p <-
ggpairs(
dat_nmcensus %>% select(use_col_names)
)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(use_col_names)
# Now:
data %>% select(all_of(use_col_names))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
In the scatterplot matrix describe qualitatively what you see.
(1 p) Remove the one most extreme county
One coutry is seriously “rustic”. Remove that one observation at the end of the data section above, then rerun the analysis.
(0 p) PCA using correlation matrix
The PCA output below will be used for the rest of the assignment.
pca_nmcensus <-
princomp(
dat_nmcensus[, use_col_ind]
, cor = TRUE
)
summary(pca_nmcensus)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.7919163 1.3686422 1.1971798 1.0956960 0.89319298
Proportion of Variance 0.3210964 0.1873182 0.1433239 0.1200550 0.07977937
Cumulative Proportion 0.3210964 0.5084145 0.6517385 0.7717935 0.85157282
Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 0.76255653 0.65427686 0.55243900 0.34242633 0.228596735
Proportion of Variance 0.05814925 0.04280782 0.03051889 0.01172558 0.005225647
Cumulative Proportion 0.90972207 0.95252989 0.98304877 0.99477435 1.000000000
pca_nmcensus %>% loadings() %>% print(cutoff = 0.2) # cutoff = 0 to show all values
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
VacantH 0.432 0.510 0.275 0.393 0.350 0.265 0.258
VacantR 0.539 0.553 -0.578
Owner 0.286 0.450 -0.474 -0.513 0.268 0.350
HeatG -0.343 0.393 -0.469 0.384 0.424 -0.372
HeatE -0.315 -0.317 0.515 -0.363 0.412 0.345 -0.305
HeatW 0.504 -0.204 0.363 -0.730
NoPlumb 0.352 -0.376 0.252 -0.212 0.299 0.635 0.291
NoPhone 0.350 -0.437 0.220 -0.216 -0.546 0.510
Rent35 0.202 -0.593 -0.395 0.412 0.444 0.226
PovAll 0.315 -0.372 0.357 0.435 -0.602
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
SS loadings 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Cumulative Var 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Comp.10
SS loadings 1.0
Proportion Var 0.1
Cumulative Var 1.0
par(mfrow=c(1,2))
screeplot(pca_nmcensus)
biplot(pca_nmcensus)
(1 p) Dimension reduction
How many principal components would you retain to explain about 3/4ths of the total variability? How much variability is actually retained?
To answer this question, I would like you to embed R notation in the sentence you write to print the variability proportion. For example:
- Retaining 2 components explains 0.5084 of the total variability.
This is calculated by squaring the standard deviations of each component to get the variances, adding the first two variances (the indices [1:2]
), then dividing by the sum of all the variances. That ratio give the proportion of variance for the first two components.
(3 p) PC interpretations
Interpret the number of principal components you retained in the previous step.
Here’s an interpretation of the first component.
PC1 explains 32.1% of the total variation.
As PC1 increases, households heating with gas and electric increase, while poverty, rent greater than 35% of income, propotion with no phone or plumbing, heating with wood, owner-occupied dwellings, and vacant homes decrease.
I’m surprised to learn that heating with wood and poverty are related, everything else makes sense. This seems to be the primary poverty component.
Interpret the rest below.
Solution
[answer]
PC2 explains 18.7% of the total variation.
As PC2 increases,
PC3 explains 14.3% of the total variation.
As PC3 increases,
etc.
(3 p) Visualizing and interpretting PCs
Here are two-dimensional plots of PC1 against PC2, PC1 against PC3, and PC2 against PC3. The points and labels are colored by poverty proportion.
library(ggplot2)
p1 <- ggplot(as.data.frame(pca_nmcensus$scores), aes(x = Comp.1, y = Comp.2, colour = dat_nmcensus$PovAll))
p1 <- p1 + scale_colour_gradientn(colours=c("red", "blue"))
p1 <- p1 + geom_text(aes(label = dat_nmcensus$County), vjust = -0.5, alpha = 0.25)
p1 <- p1 + geom_point(size = 3)
p1 <- p1 + theme(legend.position="bottom")
p2 <- ggplot(as.data.frame(pca_nmcensus$scores), aes(x = Comp.1, y = Comp.3, colour = dat_nmcensus$PovAll))
p2 <- p2 + scale_colour_gradientn(colours=c("red", "blue"))
p2 <- p2 + geom_text(aes(label = dat_nmcensus$County), vjust = -0.5, alpha = 0.25)
p2 <- p2 + geom_point(size = 3)
p2 <- p2 + theme(legend.position="none")
p3 <- ggplot(as.data.frame(pca_nmcensus$scores), aes(x = Comp.2, y = Comp.3, colour = dat_nmcensus$PovAll))
p3 <- p3 + scale_colour_gradientn(colours=c("red", "blue"))
p3 <- p3 + geom_text(aes(label = dat_nmcensus$County), vjust = -0.5, alpha = 0.25)
p3 <- p3 + geom_point(size = 3)
p3 <- p3 + theme(legend.position="none")
print(p1)
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
grid.arrange(grobs = list(p2, p3), nrow=1, top = "Scatterplots of first three PCs")
#### For a rotatable 3D plot, use plot3d() from the rgl library
# ## This uses the R version of the OpenGL (Open Graphics Library)
# library(rgl)
# plot3d(x = pca_nmcensus$scores[,"Comp.1"]
# , y = pca_nmcensus$scores[,"Comp.2"]
# , z = pca_nmcensus$scores[,"Comp.3"])
Using your interpretations of PC1 and PC2 above, describe these three counties: Bernalillo, Mora, and Roosevelt.
As an example, here’s a description for Los Alamos.
Los Alamos has large PC1 and large PC2, this indicates (both) there is very low poverty, (PC1) dwellings heat with gas and electric, and (PC2) there tends to be high dwelling vacancy.
The characteristics roughly match that description:
dat_nmcensus %>% filter(County == "Los Alamos") %>% print(n = Inf, width = Inf)
# A tibble: 1 × 16
Area County Year VacantH VacantR Owner Renter HeatG HeatE HeatW NoPlumb
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 28 Los Alamos 2014 0.9 14.5 75 25 88 8.3 2.5 0
NoPhone Rent35 PovAll PovChild PovFam
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.9 22.6 4.2 4.6 3.8
Solution
[answer]
Bernalillo has XXX PC1 and XXX PC2, this indicates (both) (PC1) (PC2)
The characteristics (do/do not) match that description:
# check after you describe it using the PCs
#dat_nmcensus %>% filter(County == "Bernalillo") %>% print(n = Inf, width = Inf)
Mora has XXX PC1 and XXX PC2, this indicates (both) (PC1) (PC2)
The characteristics (do/do not) match that description:
# check after you describe it using the PCs
#dat_nmcensus %>% filter(County == "Mora") %>% print(n = Inf, width = Inf)
Roosevelt has XXX PC1 and XXX PC2, this indicates (both) (PC1) (PC2)
The characteristics (do/do not) match that description:
# check after you describe it using the PCs
#dat_nmcensus %>% filter(County == "Roosevelt") %>% print(n = Inf, width = Inf)