# ADA2: Class 25, Chs 13+11+17, PCA and Logistic Regression for Classification

Advanced Data Analysis 2, Stat 428/528, Spring 2024, Prof. Erik Erhardt, UNM

Author

Published

December 20, 2023

• This assignment looks longer than it is.
• Most of the assignment includes tasks that you have done in previous assignments.
• I’ve written most of the challenging code for you. When I’ve asked you to write code, I’ve given you an example to follow.
• Later code chunks depend on computations I’d like you to do, so those have been commented out by moving the {R} inside the first line of the code chunk. Simply move the {R} back after the first triple-tick to let those compute.
• Overview
• Goal: Develop a logistic regression model to predict the source of fish based on PCA components of the elemental isotopes and compare that with a QDA model.
• I clean and transform the data for you.
• I break the data into a “training” and “test” set so that we can develop (train) models and then see how they perform (test).
• We do PCA for each element (Ba, Ca, Mg, and Sr) by combining the isotopes into a single principal component, each explaining more than 99% of the variability of the set of isotopes.
• I do one, you have to do the other three. Copy/paste, then rename the variables.
• Logistic regression on the training data
• I do an example main-effects model that doesn’t fit well.
• Put in two-way interactions and quadratic terms and it will fit well, even after backward selection.
• The classification metrics on the training set are pretty good (Sensitivity and Specificity).
• Classification on the test data
• I do the hard part of transforming the test data into the same PCA space as the training data. (This is handy code if you find you ever need to do this.)
• I have all the code to do the classification using both Logistic regression and QDA. You only need to interpret the results of the classification.
• Finally, consider how well the models worked and what could be done to improve them.
• You don’t need to do additional improvements, just brainstorm a little.

# San Juan River Razorback Suckers

Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.

Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.

One issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. However, this is actually quite easy to determine with the isotope data that we have; I’d like you to solve a harder problem.

Another issue is distinguishing between the two Wild fish, NAP from Ponds and SJR from the River.

There are five fish sources in the dataset.

4 known sources, 1 mix of unknown sources

Hatchery
DEX = Dexter National Fish Hatchery
GJH = Ouray  National Fish Hatchery, Grand Valley Unit

Wild
NAP = NAPI ponds
SJR = San Juan River

Unknown
UNK = untagged Razorback Suckers captured in the San Juan River
these could be from any of the above sources

Our goal is to classify the Wild fish into their location sources of Pond or River. First, we will use PCA to reduce the sets of highly correlated isotopes of the same element to a single PC feature for each element. Then, using the binary response of “Pond” versus “River”, and we’ll fit a logistic regression and do model selection. Finally, we’ll assess the expected classification accuracy using logistic regression for classification, and predict some observations I excluded from the model building.

## Clean and transform data

Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.

library(erikmisc)
Registered S3 methods overwritten by 'ggpp':
method                  from
heightDetails.titleGrob ggplot2
widthDetails.titleGrob  ggplot2
Registered S3 method overwritten by 'ggpmisc':
method                  from
as.character.polynomial polynom
Warning: replacing previous import 'ggplot2::annotate' by 'ggpp::annotate' when
loading 'erikmisc'
── Attaching packages ─────────────────────────────────────── erikmisc 0.2.12 ──
✔ tibble 3.2.1     ✔ dplyr  1.1.4
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.4.4     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# First, download the data to your computer,
#   save in the same folder as this Rmd file.

dat_sjrs <-
) %>%
# the last set of observations only have a few isotopes, so exclude
na.omit() %>%
# select the columns to use for analysis
select(
Source, Ba137:Sr88
) %>%
# include only the Wild groups
filter(
Source %in% c("NAP", "SJR")
## There are a few unual observations, remove those assuming measurement errors
# remove two small Ca43 values
, Ca43 > 0.5
) %>%
mutate(
# change to character so we can easily change the labels
Source = Source %>% as.character()
# Simplify Source to be "Pond" and "River"
, Source =
case_when(
Source == "NAP" ~ "Pond"
, Source == "SJR" ~ "River"
)
# refactor with new labels
, Source = factor(Source)
# transforming the Ba values separates the tight clustering on the boundary
, Ba137 = log10(Ba137)
, Ba138 = log10(Ba138)
)
Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
names(dat_sjrs)
 [1] "Source" "Ba137"  "Ba138"  "Ca43"   "Mg24"   "Mg25"   "Mg26"   "Sr86"
[9] "Sr87"   "Sr88"  
dat_sjrs %>% dim()
[1] 375  10
## NOTE HERE
## Subset for classification later
# Start random number generator in same place for everyone
#   and so that random partitions are the same each time code is run
set.seed(3)
# sample a subset of observation indices to predict
ind_pred <-
sample.int(
nrow(dat_sjrs)
, size = 100
) %>%
sort()
ind_pred
  [1]   4   9  10  12  15  16  19  22  28  29  33  36  37  40  47  48  50  62
[19]  65  66  68  70  71  73  75  88  91  99 101 102 104 105 108 114 117 123
[37] 127 128 131 136 137 138 139 140 145 161 164 165 166 168 171 176 182 183
[55] 185 186 192 195 197 198 204 206 218 225 233 237 241 245 247 256 258 261
[73] 262 265 266 274 275 276 293 296 297 302 309 317 330 333 334 341 343 344
[91] 347 350 363 368 370 371 372 373 374 375
# prediction subset
dat_sjrs_pred <-
dat_sjrs %>%
slice(
ind_pred
)
# remove observations to predict from data to develop the model
dat_sjrs <-
dat_sjrs %>%
slice(
-ind_pred
)

# data sizes
dat_sjrs      %>% dim()
[1] 275  10
dat_sjrs_pred %>% dim()
[1] 100  10

## Known fish scatterplot

Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant (highly correlated). You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs in the ggpairs() function with dat_sjrs %>% select(col1, col2, ...) and specify the columns to plot.

{R, fig.height = 8, fig.width = 8}
# Scatterplot matrix
library(ggplot2)
library(GGally)
p <-
ggpairs(
dat_sjrs
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "Original data by source"
)
print(p)

# Principal Components Analysis (PCA)

In this section, we’ll reduce the number of variables in the dataset by using PCA to generate new features which are linear combinations of selected variables. In this way, we can greatly reduce the dimension of the problem while retaining most of the information in the data.

## (2 p) PCA of selected sets of variables

I expect that you’ll have four features at the end of this part.

I’ll do the first one as an example and you can do the rest.

Ba variables:

pca_Ba <-
princomp(
~ Ba137 + Ba138
, data = dat_sjrs
, cor  = FALSE
)
pca_Ba %>% summary()
Importance of components:
Comp.1      Comp.2
Standard deviation     0.2064719 0.009226185
Proportion of Variance 0.9980072 0.001992764
Cumulative Proportion  0.9980072 1.000000000
pca_Ba %>% loadings() %>% print(cutoff = 0)

Comp.1 Comp.2
Ba137  0.714  0.700
Ba138  0.700 -0.714

Comp.1 Comp.2
Proportion Var    0.5    0.5
Cumulative Var    0.5    1.0
# If the loadings for Comp.1 are negative,
#   then switch the signs of the scores (observations on the PCA scale)
#   so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
dat_sjrs %>%
mutate(
PC1_Ba = pca_Ba$scores[, "Comp.1"] %>% as.numeric() ) Note that Comp.1 explains 99.801% of the variability of both the Ba variables. Calculate the remaining features to use with PCA below, and report the proportion of variance explained by the first component, Comp.1. ### Solution [answer] Ca variables: Mg variables: Sr variables: ## (1 p) Plot a scatterplot matrix of the new PCA variables This plot should have five variables with a title indicating what is being plotted. Hint: Use the dat_sjrs %>% select(Source, PC1_Ba, ...) command in the first argument of the ggpairs() function. ### Solution [answer] # Logistic Regression In this section, we’ll use logistic regression to develop a model using the PCA features to calculate the probability that a given fish is from the Pond (versus River). ## (2 p) Fit a logistic regression model We will model the probability that a fish came from a pond. First we need a variable indicating whether it is from a pond or not. # response variable indicating "Success" dat_sjrs <- dat_sjrs %>% mutate( Pond = (Source == "Pond") ) Fit the logistic regression model below. If it does not fit, consider a more complex model (interactions and quadratic terms) until you find that the model fits. Perform backward selection and make sure reduced model also fits. {R} glm_pond <- glm( cbind(Pond, 1 - Pond) ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr , family = binomial , data = dat_sjrs ) summary(glm_pond) # Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit) dev_p_val <- 1 - pchisq(glm_pond$deviance, glm_pond$df.residual) dev_p_val # option: trace = 0 doesn't show each step of the automated selection glm_pond_red_AIC <- step( glm_pond , direction = "both" , trace = 0 ) # the anova object provides a summary of the selection steps in order glm_pond_red_AIC$anova
summary(glm_pond_red_AIC)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond_red_AIC$deviance, glm_pond_red_AIC$df.residual)
dev_p_val

Note that the model doesn’t fit well since the lack-of-fit p-value < 0.10. Adding higher-order terms, such as two-way interactions and squared terms, may help.

## (2 p) Assess prediction ability, choose classification threshold

In logistic regression, we have a prediction probability of success. We can find a threshold of that prediction probability (e.g., $$\hat{p}=0.3$$) as a boundary to classify two groups. Below, we summarize all possible threshold with an ROC curve. We also extract the optimal threshold (giving the jointly best Sensitivity and Specificity).

{R, fig.height = 5, fig.width = 5}
glm_roc <-
e_plot_roc(
labels_true     = dat_sjrs$Pond , pred_values_pos = glm_pond_red_AIC$fitted.values
, label_neg_pos   = c(FALSE, TRUE)
, sw_plot         = TRUE
, cm_mode         = c("sens_spec", "prec_recall", "everything")[1]
)

glm_roc$roc_curve_best %>% print(width = Inf) glm_roc$plot_roc

Use these values for the Sensitivity and Specificity for the interpretation below.

{R}
glm_roc$confusion_matrix Interpret the optimal Sensitivity and Specificity values by reading the linked article on the ROC curve or Sensitivity and Specificity. ### Solution [answer] # Classification In this section, we’ll will compare the predictions using the logistic regression model to the discriminant analysis from last class. We’ll predict the observations that were held out (after first projecting them into the PCA feature space). Then we’ll create the confusion matrix (table of which observations were classified from which populations into which other populations) then compare the error rates between the two methods. ## Projecting the “test” set of observations into the PCA space In order to project the “test” subset of the data into the PCA space, we need to perform the same centering and PCA rotation (the loadings) as was done on the “training” subset of data. All this information is in the PCA objects calculated above. Below, we first subtract the centering values (the means of the training data) then use matrix multiplication with the loadings to calculate the linear combinations of the isotope data for the rotation. As before, we choose the same $$+$$ or $$-$$ sign as above for each isotopic element. Now the “test” subset is mapped onto the same PC1 axis as the “training” subset. {R} ## The equation first subtracts the mean of the variables (in$center),
##   then calculates the linear combination for PC1 via matrix multiplication (%*%).

# A function to perform the transformation
f_pca_pred <-
function(
dat
, var_list
, pca_obj
) {
## TESTING ## dat = dat_sjrs_pred; var_list = c("Ba137", "Ba138"); pca_obj = pca_Ba
out <-
(as.matrix(dat %>% select(var_list)) -
matrix(rep(pca_obj$center, nrow(dat)), byrow = TRUE, nrow = nrow(dat), ncol = length(pca_obj$center))) %*%
as.matrix(pca_obj$loadings[,1], ncol = 1) %>% as.numeric() return(out) } # Do the transformation for each element dat_sjrs_pred$PC1_Ba <- f_pca_pred(dat_sjrs_pred, c("Ba137", "Ba138")      , pca_Ba)
dat_sjrs_pred$PC1_Ca <- f_pca_pred(dat_sjrs_pred, c("Ca43") , pca_Ca) dat_sjrs_pred$PC1_Mg <- f_pca_pred(dat_sjrs_pred, c("Mg24", "Mg25", "Mg26"), pca_Mg)
dat_sjrs_pred$PC1_Sr <- f_pca_pred(dat_sjrs_pred, c("Sr86", "Sr87", "Sr88"), pca_Sr) The test data should look similar to the training data because it was randomly sampled from the whole. {R, fig.height = 8, fig.width = 8} # Scatterplot matrix library(ggplot2) library(GGally) p <- ggpairs( dat_sjrs_pred %>% select(Source, PC1_Ba, PC1_Ca, PC1_Mg, PC1_Sr) , mapping = ggplot2::aes(colour = Source, alpha = 0.5) , upper = list(continuous = "density", combo = "box") , lower = list(continuous = "points", combo = "dot") #, lower = list(continuous = "cor") , title = "PCA features by source, Test Prediction data" ) print(p) ## (2 p) Logistic regression vs QDA classification Logistic regression classification using the optimal threshold from the ROC curve. {R, fig.height = 5, fig.width = 5} # classifications in the training set dat_sjrs_pred <- dat_sjrs_pred %>% mutate( class = ifelse( (predict(glm_pond_red_AIC, newdata = dat_sjrs_pred, type = "response") %>% as.numeric() >= glm_roc$roc_curve_best$thresh) , "Pond" , "River" ) %>% factor() ) glm_roc_pred <- e_plot_roc( labels_true = dat_sjrs_pred$Source
, pred_values_pos = dat_sjrs_pred$class , label_neg_pos = c("Pond", "River") , sw_plot = TRUE , cm_mode = c("sens_spec", "prec_recall", "everything")[1] ) glm_roc_pred$roc_curve_best %>% print(width = Inf)
glm_roc_pred$plot_roc Use these values for the Sensitivity and Specificity for the interpretation below. {R} glm_roc_pred$confusion_matrix

Quadratic discriminant analysis classification using the pca features.

{R, fig.height = 5, fig.width = 5}
#library(MASS)
qda_sjrs <-
MASS::qda(
Source ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr
, data = dat_sjrs
)
#qda_sjrs
# CV = TRUE does jackknife (leave-one-out) crossvalidation
#qda_sjrs.cv <- qda(Source ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr
#                 , data = dat_sjrs, CV = TRUE)

# predict the test data from the training data LDFs
qda_sjrs_pred <-
predict(
qda_sjrs
, newdata = dat_sjrs_pred
)
qda_sjrs_pred_class <-
data.frame(
Source = dat_sjrs_pred$Source , class = qda_sjrs_pred$class
#, error = ""
, round(qda_sjrs_pred$posterior,3) ) colnames(qda_sjrs_pred_class) <- c( "Source" , "class" #, "error" , paste("post", colnames(qda_sjrs_pred$posterior), sep="")
)

qda_roc_pred <-
e_plot_roc(
labels_true     = qda_sjrs_pred_class$Source , pred_values_pos = qda_sjrs_pred_class$class
, label_neg_pos   = c("Pond", "River")
, sw_plot         = TRUE
, cm_mode         = c("sens_spec", "prec_recall", "everything")[1]
)

qda_roc_pred$roc_curve_best %>% print(width = Inf) qda_roc_pred$plot_roc

Use these values for the Sensitivity and Specificity for the interpretation below.

{R}
qda_roc_pred\$confusion_matrix

Summarize the error rates and note and differences you observe between the logistic regression and QDA methods.

Decide which method is preferred.

If you’d like, change the set.seed() value to draw a different random sample and see how the results change from sample to sample, but please return the seed to the original value before turning in your solutions.