---
title: "ADA2: Homework 13, Chs 13+11+17, PCA and Logistic Regression for Classification"
author: "Name Here"
date: "mm/dd/yyyy"
output:
pdf_document:
number_sections: yes
toc: yes
html_document:
toc: true
number_sections: true
code_folding: show
---
```{R, echo=FALSE}
# I set some GLOBAL R chunk options here.
# (to hide this message add "echo=FALSE" to the code chunk options)
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)
knitr::opts_chunk$set(cache = TRUE, autodep=TRUE)
```
__Later code chunks depend on computations I'd like you to do, so those have been
commented out by moving the `{R}` inside the first line of the code chunk.
Simply move the `{R}` back after the first triple-tick to let those compute.__
# San Juan River Razorback Suckers
Peer mentor (Spring 2016) _Adam Barkalow_'s wife uses stable isotope ratios to analyze fish.
__Razorback Suckers__ were collected in 2014 on the San Juan River.
Elemental [isotopic ratios](https://en.wikipedia.org/wiki/Isotope_analysis)
from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38).
Finrays are non-lethally obtained and are used to detect natal origin since the
material in the reys are partially developed early in life.
__The issue__ is that hatchery fish can get into the river and lose their tags.
It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish.
There are five fish sources in the dataset.
```
4 known sources, 1 mix of unknown sources
Hatchery
DEX = Dexter National Fish Hatchery
GJH = Ouray National Fish Hatchery, Grand Valley Unit
Wild
NAP = NAPI ponds
SJR = San Juan River
Unknown
UNK = untagged Razorback Suckers captured in the San Juan River
these could be from any of the above sources
```
__Our goal__ is to use PCA to reduce the sets of isotopes of the same element to a single feature for each element.
Then, using the binary response of "Pond" versus "River", and we'll fit a logistic regression and do model selection.
Finally, we'll assess the expected classification accuracy using logistic regression for classification, and predict some observations I included from the model building.
## Clean and transform data
Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful.
Note that measurement error can be an issue in complicated biological measurements.
Furthermore, a transformation might help separate observations that are tightly grouped in space.
__Please download__ the data into your working directory to save downloads from
my website each time you knit this document.
```{R}
sjrs.full <- read.csv("ADA2_HW_12_MANOVA_SanJuanRazorbackSuckers_data2014.csv")
dim(sjrs.full)
# the last set of observations only have a few isotopes, so exclude
sjrs.full <- na.omit(sjrs.full)
dim(sjrs.full)
# no missing values
sum(is.na(sjrs.full))
#str(sjrs)
sjrs <- sjrs.full
names(sjrs)
nrow(sjrs)
# don't include UNK group
sjrs <- subset(sjrs, subset = (Source != "UNK"), select = c(Source, Ba137:Sr88))
# Simplify Source to be "Pond" and "River"
sjrs$Source <- as.character(sjrs$Source)
sjrs <- subset(sjrs, subset = (Source %in% c("NAP", "SJR")), select = c(Source, Ba137:Sr88))
sjrs$Source[(sjrs$Source %in% c("NAP"))] <- "Pond"
sjrs$Source[(sjrs$Source %in% c("SJR"))] <- "River"
sjrs$Source <- factor(sjrs$Source)
# transforming the Ba values separates the tight clustering on the boundary
sjrs$Ba137 <- log10(sjrs$Ba137)
sjrs$Ba138 <- log10(sjrs$Ba138)
## There are a few unual observations, remove those assuming measurement errors
# remove two small Ca43 values
sjrs <- subset(sjrs, Ca43 > 0.5)
## NOTE HERE
## Subset for classification later
# start random number generator in same place for everyone
# and so that random partitions are the same each time code is run
set.seed(3)
# sample a subset of observation indices to predict
pred.ind <- sort(sample.int(nrow(sjrs), size = 100))
# prediction subset
sjrs.pred <- sjrs[pred.ind,]
# remove observations to predict from data to develop the model
sjrs <- sjrs[-pred.ind,]
```
## Known fish scatterplot
Note that this plot can take a while to generate.
You're welcome to subset the data further for this plot if some of the variables are redundant.
You could probably get away with 5 columns of data without any loss of interpretation.
If you want to do this, replace the `sjrs` in the `ggpairs()` function with
`subset(sjrs, select = c(col1, col2, ...))` and specify the columns to plot.
```{R, fig.height = 8, fig.width = 8}
# Scatterplot matrix
library(ggplot2)
library(GGally)
p <- ggpairs(sjrs
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "Original data by source"
)
print(p)
```
# Principal Components Analysis (PCA)
__In this section__, we'll reduce the number of variables in the dataset by
using PCA to generate new features which are linear combinations of selected variables.
In this way, we can greatly reduce the dimension of the problem while retaining most of
the information in the data.
## __(2 p)__ PCA of selected sets of variables
I expect that you'll have four features at the end of this part.
I'll do the first one as an example and you can do the rest.
__Ba variables__:
```{R}
pca.Ba <- princomp( ~ Ba137 + Ba138, data = sjrs, cor = FALSE)
summary(pca.Ba)
print(loadings(pca.Ba), cutoff = 0)
# If the loadings for Comp.1 are negative,
# then switch the signs of the scores (observations on the PCA scale)
# so that positive still indicates larger values.
# For Ba, we need to use a negative sign in front of the scores to do this.
sjrs$PC1.Ba <- -pca.Ba$scores[,"Comp.1"]
```
Note that `Comp.1` explains
`r signif(100*summary(pca.Ba)$sdev[1]^2/sum(summary(pca.Ba)$sdev^2), 5)`%
of the variability of both the Ba variables.
__Calculate__ the remaining features to use with PCA below, and
__indicate__ the proportion of variance explained by the first component, `Comp.1`.
### Solution
[answer]
__Ca variables__:
__Mg variables__:
__Sr variables__:
## __(1 p)__ Plot a scatterplot matrix of the new PCA variables
This plot should have five variables with a title indicating what is being plotted.
__Hint:__ Use the `subset(..., select = c())` command in the first argument of the `ggpairs()` function.
### Solution
[answer]
# Logistic Regression
__In this section__, we'll use logistic regression to develop a model using the PCA
features to calculate the probability that a given fish is from a hatchery
(versus wild).
## __(2 p)__ Fit a logistic regression model
We will model the probability that a fish came from a hatchery.
First we need a variable indicating whether it is from a hatchery or not.
```{R}
# response variable indicating "Success"
sjrs$Pond <- (sjrs$Source == "Pond")
```
__Fit the logistic regression model__ below.
If it does not fit, consider a __more complex model__ (interactions and quadratic terms)
until you find that the model fits.
Perform __backward selection__ and make sure reduced model also fits.
```
{R}
glm.hatch <- glm(cbind(Pond, 1 - Pond) ~ PC1.Ba + PC1.Ca + PC1.Mg + PC1.Sr
, family = binomial, sjrs)
summary(glm.hatch)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.hatch$deviance, glm.hatch$df.residual)
dev.p.val
# option: trace = 0 doesn't show each step of the automated selection
glm.hatch.red.AIC <- step(glm.hatch, direction="both", trace = 0)
# the anova object provides a summary of the selection steps in order
glm.hatch.red.AIC$anova
summary(glm.hatch.red.AIC)
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev.p.val <- 1 - pchisq(glm.hatch.red.AIC$deviance, glm.hatch.red.AIC$df.residual)
dev.p.val
```
### Solution
[answer]
## __(2 p)__ Assess prediction ability, choose classification threshold
In logistic regression,
we have a prediction probability of success.
We can find a threshold of that prediction probability (e.g., $\hat{p}=0.3$)
as a boundary to classify two groups.
Below, we summarize all possible threshold with an [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).
We also extract the optimal threshold (giving the jointly best Sensitivity and Specificity).
```
{R, fig.height = 6, fig.width = 6}
library(ROCR)
pred <- prediction(predictions = glm.hatch$fitted.values, labels = sjrs$Pond)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
# determine the best threshold as having the highest overall classification rate
# Find t that minimizes error
roc.curve <- data.frame(Spec = 1 - unlist(perf@x.values)
, Sens = unlist(perf@y.values)
, thresh = unlist(perf@alpha.values))
roc.curve$correct <- sqrt((1-roc.curve$Sens)^2 + (1-roc.curve$Spec)^2)
opt_t <- subset(roc.curve, roc.curve$correct==min(roc.curve$correct))
opt_t
# color the ROC curve by threshold
plot(perf
, colorize.palette = rev(rainbow(256, start=0, end=4/6))
, colorize = TRUE)
abline(0,1, col = "gray80") # reference line
# optimal threshold
abline(v = 1 - opt_t$Spec, col = "gray80")
abline(h = opt_t$Sens, col = "gray80")
```
__Interpret__ the optimal Sensitivity and Specificity values by reading the
linked article on the [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
or [Sensitivity and Specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity).
### Solution
[answer]
# Classification
__In this section__, we'll will compare the predictions using the
logistic regression model to the discriminant analysis from last class.
We'll predict the observations that were held out
(after first projecting them into the PCA feature space).
Then we'll create the confusion matrix
(table of which observations were classified from which populations into which other populations)
then compare the error rates between the two methods.
## Projecting the "test" set of observations into the PCA space
In order to project the "test" subset of the data into the PCA space,
we need to perform the same centering and PCA rotation (the loadings) as was done on the "training" subset of data.
All this information is in the PCA objects calculated above.
Below, we first subtract the centering values (the means of the training data)
then use matrix multiplication with the loadings to calculate the linear combinations of the isotope data for the rotation.
As before, we choose the same $+$ or $-$ sign as above for each isotopic element.
Now the "test" subset is mapped onto the same PC1 axis as the "training" subset.
```
{R}
## The equation first subtracts the mean of the variables (in $center),
## then calculates the linear combination for PC1 via matrix multiplication (%*%).
# A function to perform the transformation
f.pca.pred <- function(dat, var.list, pca.obj) {
out <- as.numeric((as.matrix(subset(dat, select = var.list))
- matrix(rep(pca.obj$center, nrow(dat)), byrow = TRUE, nrow = nrow(dat), ncol = length(pca.obj$center)))
%*% as.matrix(pca.obj$loadings[,1], ncol = 1))
return(out)
}
# Do the transformation for each element
sjrs.pred$PC1.Ba <- -f.pca.pred(sjrs.pred, c("Ba137", "Ba138") , pca.Ba)
sjrs.pred$PC1.Ca <- f.pca.pred(sjrs.pred, c("Ca43") , pca.Ca)
sjrs.pred$PC1.Mg <- f.pca.pred(sjrs.pred, c("Mg24", "Mg25", "Mg26"), pca.Mg)
sjrs.pred$PC1.Sr <- -f.pca.pred(sjrs.pred, c("Sr86", "Sr87", "Sr88"), pca.Sr)
```
The test data should look similar to the training data because it was randomly sampled from the whole.
```
{R, fig.height = 8, fig.width = 8}
# Scatterplot matrix
library(ggplot2)
library(GGally)
p <- ggpairs(subset(sjrs.pred, select = c(Source, PC1.Ba, PC1.Ca, PC1.Mg, PC1.Sr))
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "PCA features by source, Test Prediction data"
)
print(p)
```
## __(2 p)__ Logistic regression vs QDA classification
__Logistic regression classification__ using the optimal threshold from the ROC curve.
```
{R}
pred <- predict(glm.hatch.red.AIC, newdata = sjrs.pred, type = "response")
sjrs.pred$class <- factor(ifelse((pred >= opt_t$thresh), "Pond", "River"))
# Assess the accuracy of the prediction
# row = true admit, col = classified admit
pred.freq <- table(sjrs.pred$Source, sjrs.pred$class)
pred.freq
prop.table(pred.freq, 1) # proportions by row
# proportion correct for each category
#diag(prop.table(pred.freq, 1))
# total proportion correct
sum(diag(prop.table(pred.freq)))
# total error rate
logis.error <- 1 - sum(diag(prop.table(pred.freq)))
logis.error
```
__Quadratic discriminant analysis__ classification using the pca features.
```
{R}
library(MASS)
qda.sjrs <- qda(Source ~ PC1.Ba + PC1.Ca + PC1.Mg + PC1.Sr
, data = sjrs)
#qda.sjrs
# CV = TRUE does jackknife (leave-one-out) crossvalidation
#qda.sjrs.cv <- qda(Source ~ PC1.Ba + PC1.Ca + PC1.Mg + PC1.Sr
# , data = sjrs, CV = TRUE)
# predict the test data from the training data LDFs
pred.sjrs.pred <- predict(qda.sjrs, newdata = sjrs.pred)
classify.sjrs.pred <- data.frame(Source = sjrs.pred$Source
, class = pred.sjrs.pred$class
#, error = ""
, round(pred.sjrs.pred$posterior,3))
colnames(classify.sjrs.pred) <- c("Source", "class"#, "error"
, paste("post", colnames(pred.sjrs.pred$posterior), sep=""))
# Assess the accuracy of the prediction
# row = true Source, col = classified Source
pred.freq <- table(classify.sjrs.pred$Source, classify.sjrs.pred$class)
pred.freq
prop.table(pred.freq, 1) # proportions by row
# proportion correct for each category
#diag(prop.table(pred.freq, 1))
# total proportion correct
sum(diag(prop.table(pred.freq)))
# total error rate
qda.error <- 1 - sum(diag(prop.table(pred.freq)))
qda.error
```
__Summarize__ the error rates and note and differences between the methods you observe.
__Decide__ which method is preferred.
_If you'd like_, change the `set.seed()` value to draw a different random sample and see
how the results change from sample to sample.
### Solution
[answer]
## __(1 p)__ Accuracy assessment
__Comment__ on whether you are surprised by the prediction accuracy on the "test" subset.
Did the classifier work much better, worse, or roughly what you expected?
Do you have any ideas for how classification could be improved between these two difficult-to-distinguish sources?
### Solution
[answer]