Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.
The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset. The UNK fish are very strange, though, so we’re going to exclude them from our analysis for the purpose of this class and focus on a subset from the four Hatchery and Wild sources.
4 known sources, 1 mix of Hatchery and Wild sources
Hatchery
DEX = Dexter National Fish Hatchery
GJH = Ouray National Fish Hatchery, Grand Valley Unit
Wild
NAP = NAPI ponds
SJR = San Juan River
Unknown (removed)
UNK = untagged Razorback Suckers captured in the San Juan River
these could be from any of the above sources
New (subset of Hatchery and Wild sources)
NEW = Razorback Suckers
Our goal is to use the observations with linear or quadradic discriminant analysis to evaluate classification accuracy of the four known sources using the jackknife (leave-one-out crossvalidation) and then to predict a set of observations where I have withheld their source identity.
NOTE:This assignment is slightly different than what is in the helper video. The helper video develops a classification model on all known fish to predict an unknown population; however, that unknown population is very different from the known fish and doesn’t demonstrate the classification method well for pedagogical purposes. Therefore, I made the following changes. I removed the unknown population and, instead, I assign 1 in every 10 observations to the NEW group which we’ll use to predict at the end of the assignment. Otherwise, the assignment is similar enough that I won’t record a new video for this version.
Clean and transform data
Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# First, download the data to your computer,# save in the same folder as this Rmd file.# read the datadat_sjrs_full <-read_csv("ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv" )
Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(dat_sjrs_full)
[1] 1512 13
# the last set of observations only have a few isotopes, so excludedat_sjrs <- dat_sjrs_full %>%na.omit()dim(dat_sjrs)
[1] 1447 13
# no missing valuesdat_sjrs %>%is.na() %>%sum()
[1] 0
#str(dat_sjrs)dat_sjrs <- dat_sjrs %>%select( Source , Ba137:Sr88 ) %>%filter(# Exclude unknown sources Source !="UNK" ) %>%mutate(Source_org =factor(Source) # original source# every 10th observation, assign to "NEW" , Source =ifelse((1:n() %%10) ==0, "NEW", Source) , Source =factor(Source)# transforming the Ba values separates the tight clustering on the boundary , Ba137 =log10(Ba137) , Ba138 =log10(Ba138) )names(dat_sjrs)
# 1/10 of the sources have been assigned to "NEW"dat_sjrs %>%pull(Source_org) %>%table()
.
DEX GJH NAP SJR
224 199 133 244
dat_sjrs %>%pull(Source ) %>%table()
.
DEX GJH NAP NEW SJR
201 180 120 80 219
## NOTE HERE# NEW group to predictdat_sjrs_new <- dat_sjrs %>%filter( Source =="NEW" )# Known groupsdat_sjrs <- dat_sjrs %>%filter( Source !="NEW" ) %>%filter(# There are a few unual observations, remove those assuming measurement errors# remove two small Ca43 values Ca43 >0.5 ) %>%mutate(Source =factor(Source) # to remove unused levels )# data sizesdat_sjrs_new %>%dim()
[1] 80 11
dat_sjrs %>%dim()
[1] 719 11
Known fish scatterplot
Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant (highly correlated). You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs in the ggpairs() function with dat_sjrs %>% select(col1, col2, ...) and specify the columns to plot.
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
p <-ggpairs( dat_sjrs , mapping = ggplot2::aes(colour = Source, alpha =0.5) , upper =list(continuous ="density", combo ="box") , lower =list(continuous ="points", combo ="dot")#, lower = list(continuous = "cor") , title ="Original data by source" )print(p)
Classification
Goal: Develop a classification model using discriminant analysis that will reliably classify new observations from the known populations. Then, apply the classification model to the NEW population in hopes of providing insight into the untagged razorbacks.
State the assumptions of both the LDA and QDA discriminant methods.
Interpret the plots above to evaluate whether the assumptions are met.
Solution
[answer]
(3 p) Stepwise selection for classification
If multivariate normality is violated, state so, then continue anyway.
Above, if the assumptions of the LDA are met, then use LDA; otherwise use QDA.
Below I provide the code for both backward selection (starting with full model) and forward selection (starting with empty model).
You’ll need to specify LDA or QDA throughout the code (note that find/replace of “lda” to “qda”, or vice-a-verse is the quickest and safest way to do this).
dat_sjrs_d <- dat_sjrs %>%select(Ba137:Sr88) # the datadat_sjrs_c <- dat_sjrs %>%pull(Source) # the classes# start random number generator in same place for everyone# and so that random partitions are the same each time code is runset.seed(7)#library(klaR) # don't run this since it does library(MASS) and breaks select() from dplyr# Backwardstep_sjrs_b <- klaR::stepclass( dat_sjrs_d , dat_sjrs_c , method ="lda"# or "qda" , improvement =0.001# stop criterion: improvement less than , direction ="backward" , start.vars =colnames(dat_sjrs_d) )
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
719 observations of 9 variables in 4 classes; direction: backward
op <-par(no.readonly =TRUE) # the whole list of settable par's.# make wider left margin to fit contrast labelspar(mfrow =c(1,2), mar =0*rep(1, 4)) # order is c(bottom, left, top, right)plot(step_sjrs_f, ylim =c(0, 1), main ="empty model, forward")plot(step_sjrs_b, ylim =c(0, 1), main ="full model, backward")
par(op) # reset plotting options## NOTE HERE# set the formula you're using here, then it will be used throughout the restsjrs_formula <- step_sjrs_b# Select and print the final model#library(MASS) # don't run library(MASS) because it breaks select() from dplyrlda_sjrs_final <- MASS::lda(grouping = dat_sjrs_c , x = dat_sjrs_d %>%select(sjrs_formula$model$name) )lda_sjrs_final
Discuss the differences between the backward and forward selection results.
Specify the final model you’ll use for classification.
Indicate the expected classification accuracy of the selected final model.
Solution
[answer]
(2 p) Classification accuracy
Let’s look more closely at classification accuracy by evaluating the confusion matrix for classification, the table of how many observations from each population were classified into which populations. Numbers along the diagonal are correctly classified, and off-diagonals are errors.
(Make sure you use the correct lda() or qda() function as selected above.)
# CV = TRUE does jackknife (leave-one-out) crossvalidation#library(MASS) # don't run library(MASS) because it breaks select() from dplyrlda_sjrs_cv <- MASS::lda(grouping = dat_sjrs_c , x = dat_sjrs_d %>%select(sjrs_formula$model$name) , CV =TRUE )#lda_sjrs_cv# Create a table of classification and posterior probabilities for each observationclassify_sjrs <-data.frame(Source = dat_sjrs$Source , class = lda_sjrs_cv$class , error ="" , round(lda_sjrs_cv$posterior, 3) )colnames(classify_sjrs) <-c("Source" , "class" , "error" , paste("post", colnames(lda_sjrs_cv$posterior), sep="_") )# error columnclassify_sjrs$error <-as.character(classify_sjrs$error)classify_agree <-as.character(as.numeric(dat_sjrs$Source) -as.numeric(lda_sjrs_cv$class))classify_sjrs$error[!(classify_agree ==0)] <- classify_agree[!(classify_agree ==0)]# print table#classify_sjrs# A list of classification statisticslibrary(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
Confusion Matrix and Statistics
Reference
Prediction DEX GJH NAP SJR
DEX 200 0 0 0
GJH 1 180 4 2
NAP 0 0 11 13
SJR 0 0 105 203
Overall Statistics
Accuracy : 0.8261
95% CI : (0.7964, 0.8532)
No Information Rate : 0.3032
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7591
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity 0.9950 1.0000 0.09167 0.9312
Specificity 1.0000 0.9870 0.97830 0.7904
Pos Pred Value 1.0000 0.9626 0.45833 0.6591
Neg Pred Value 0.9981 1.0000 0.84317 0.9635
Prevalence 0.2796 0.2503 0.16690 0.3032
Detection Rate 0.2782 0.2503 0.01530 0.2823
Detection Prevalence 0.2782 0.2601 0.03338 0.4284
Balanced Accuracy 0.9975 0.9935 0.53498 0.8608
Determine whether some populations are better classified than others, or whether each population seems to have roughly the same error rate.
Solution
[answer]
(0 p) View Classification
klaR::partimat(grouping = dat_sjrs_c , x = dat_sjrs_d %>%select(sjrs_formula$model$name) , plot.matrix =FALSE , method ="lda"# or "qda" , main ="LDA partition"# or "QDA partition" )
(3 p) Classify NEW observations
Now we’ll use the final model selected above to predict observations from the NEW population. These are untagged Razorback Suckers captured in the San Juan River which could be from any of the other sources.
# new observations to classifysummary(dat_sjrs_new$Source)
DEX GJH NAP NEW SJR
0 0 0 80 0
# predict the NEW data from the training data LDFspred_sjrs_new <-predict( lda_sjrs_final , newdata = dat_sjrs_new %>%select(sjrs_formula$model$name) )# Create a table of classification and posterior probabilities for each observationclassify_sjrs_new <-data.frame(Source = dat_sjrs_new$Source_org , class = pred_sjrs_new$class , error ="" , round(pred_sjrs_new$posterior,3) )colnames(classify_sjrs_new) <-c("Source" , "class" , "error" , paste("post", colnames(pred_sjrs_new$posterior), sep="_") )# error columnclassify_sjrs_new$error <-as.character(classify_sjrs_new$error)classify_agree_new <-as.character(as.numeric(dat_sjrs_new$Source_org) -as.numeric(pred_sjrs_new$class))classify_sjrs_new$error[!(classify_agree_new ==0)] <- classify_agree_new[!(classify_agree_new ==0)]# print table#classify_sjrs_new# A list of classification statisticslibrary(caret)confusionMatrix(data = pred_sjrs_new$class # predictions , reference = dat_sjrs_new$Source_org # true labels , mode ="sens_spec"# restrict output to relevant summaries)
Confusion Matrix and Statistics
Reference
Prediction DEX GJH NAP SJR
DEX 23 0 0 0
GJH 0 19 0 0
NAP 0 0 3 0
SJR 0 0 10 25
Overall Statistics
Accuracy : 0.875
95% CI : (0.7821, 0.9384)
No Information Rate : 0.3125
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8259
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity 1.0000 1.0000 0.2308 1.0000
Specificity 1.0000 1.0000 1.0000 0.8182
Pos Pred Value 1.0000 1.0000 1.0000 0.7143
Neg Pred Value 1.0000 1.0000 0.8701 1.0000
Prevalence 0.2875 0.2375 0.1625 0.3125
Detection Rate 0.2875 0.2375 0.0375 0.3125
Detection Prevalence 0.2875 0.2375 0.0375 0.4375
Balanced Accuracy 1.0000 1.0000 0.6154 0.9091
# update unknown NEW with the class predictiondat_sjrs_new$Class <- pred_sjrs_new$classdat_sjrs_new_pred <-cbind(dat_sjrs_new, pred_sjrs_new, classify_sjrs_new)
dat_sjrs$Class <- dat_sjrs$Sourcedat_sjrs $Label <-"Known"dat_sjrs_new$Label <-"NEW"dat_sjrs_all <-rbind(dat_sjrs, dat_sjrs_new)# plot data by Source with clusters indicatedlibrary(ggplot2)p1 <-ggplot(dat_sjrs_all, aes(x = Ba138, y = Sr87, colour = Class))p1 <- p1 +geom_point()#size = 2)p1 <- p1 +labs(title ="Known and new observations by source classification")p1 <- p1 +facet_grid(Label ~ Source_org)print(p1)
Discuss the main features of the NEW population.
Explain why you think the classification accuracy is expected to be high or low in this case (and check your expectation with the known accuracy because we have the true labels for the NEW observations).