ADA2: Class 23, Chs 16+17, Discrimination for Classification

Advanced Data Analysis 2, Stat 428/528, Spring 2023, Prof. Erik Erhardt, UNM

Author

Your Name

Published

December 17, 2022

San Juan River Razorback Suckers

Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.

Images: San Juan River Basin, San Juan River Photo, Razorback Sucker, and Spiney Rays

Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.

The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset. The UNK fish are very strange, though, so we’re going to exclude them from our analysis for the purpose of this class and focus on a subset from the four Hatchery and Wild sources.

4 known sources, 1 mix of Hatchery and Wild sources

Hatchery
  DEX = Dexter National Fish Hatchery
  GJH = Ouray  National Fish Hatchery, Grand Valley Unit

Wild
  NAP = NAPI ponds
  SJR = San Juan River

Unknown (removed)
  UNK = untagged Razorback Suckers captured in the San Juan River
        these could be from any of the above sources

New (subset of Hatchery and Wild sources)
  NEW = Razorback Suckers

Our goal is to use the observations with linear or quadradic discriminant analysis to evaluate classification accuracy of the four known sources using the jackknife (leave-one-out crossvalidation) and then to predict a set of observations where I have withheld their source identity.

NOTE: This assignment is slightly different than what is in the helper video. The helper video develops a classification model on all known fish to predict an unknown population; however, that unknown population is very different from the known fish and doesn’t demonstrate the classification method well for pedagogical purposes. Therefore, I made the following changes. I removed the unknown population and, instead, I assign 1 in every 10 observations to the NEW group which we’ll use to predict at the end of the assignment. Otherwise, the assignment is similar enough that I won’t record a new video for this version.

Clean and transform data

Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.

library(erikmisc)
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.20 ──
✔ tibble 3.1.8     ✔ dplyr  1.1.0
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
  by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.4.1     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# First, download the data to your computer,
#   save in the same folder as this Rmd file.

# read the data
dat_sjrs_full <-
  read_csv(
    "ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
  )
Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(dat_sjrs_full)
[1] 1512   13
# the last set of observations only have a few isotopes, so exclude
dat_sjrs <-
  dat_sjrs_full %>%
  na.omit()

dim(dat_sjrs)
[1] 1447   13
# no missing values
dat_sjrs %>%
  is.na() %>%
  sum()
[1] 0
#str(dat_sjrs)

dat_sjrs <-
  dat_sjrs %>%
  select(
    Source
  , Ba137:Sr88
  ) %>%
  filter(
    # Exclude unknown sources
    Source != "UNK"
  ) %>%
  mutate(
    Source_org = factor(Source) # original source
    # every 10th observation, assign to "NEW"
  , Source = ifelse((1:n() %% 10) == 0, "NEW", Source)
  , Source = factor(Source)
    # transforming the Ba values separates the tight clustering on the boundary
  , Ba137 = log10(Ba137)
  , Ba138 = log10(Ba138)
  )
names(dat_sjrs)
 [1] "Source"     "Ba137"      "Ba138"      "Ca43"       "Mg24"      
 [6] "Mg25"       "Mg26"       "Sr86"       "Sr87"       "Sr88"      
[11] "Source_org"
# 1/10 of the sources have been assigned to "NEW"
dat_sjrs %>% pull(Source_org) %>% table()
.
DEX GJH NAP SJR 
224 199 133 244 
dat_sjrs %>% pull(Source    ) %>% table()
.
DEX GJH NAP NEW SJR 
201 180 120  80 219 
## NOTE HERE
# NEW group to predict
dat_sjrs_new <-
  dat_sjrs %>%
  filter(
    Source == "NEW"
  )
# Known groups
dat_sjrs <-
  dat_sjrs %>%
  filter(
    Source != "NEW"
  ) %>%
  filter(
    # There are a few unual observations, remove those assuming measurement errors
    # remove two small Ca43 values
    Ca43 > 0.5
  ) %>%
  mutate(
    Source = factor(Source)  # to remove unused levels
  )

# data sizes
dat_sjrs_new %>% dim()
[1] 80 11
dat_sjrs     %>% dim()
[1] 719  11

Known fish scatterplot

Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant (highly correlated). You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs in the ggpairs() function with dat_sjrs %>% select(col1, col2, ...) and specify the columns to plot.

# Scatterplot matrix
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
p <-
  ggpairs(
    dat_sjrs
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "Original data by source"
  )
print(p)

Classification

Goal: Develop a classification model using discriminant analysis that will reliably classify new observations from the known populations. Then, apply the classification model to the NEW population in hopes of providing insight into the untagged razorbacks.

(2 p) LDA or QDA assumptions

Below are some plots that will be helpful.

Correlation matrices (with graphical representation).

  • State the assumptions of both the LDA and QDA discriminant methods.
  • Interpret the plots above to evaluate whether the assumptions are met.

Solution

[answer]

(3 p) Stepwise selection for classification

If multivariate normality is violated, state so, then continue anyway.

Above, if the assumptions of the LDA are met, then use LDA; otherwise use QDA.

Below I provide the code for both backward selection (starting with full model) and forward selection (starting with empty model).

You’ll need to specify LDA or QDA throughout the code (note that find/replace of “lda” to “qda”, or vice-a-verse is the quickest and safest way to do this).

dat_sjrs_d <- dat_sjrs %>% select(Ba137:Sr88) # the data
dat_sjrs_c <- dat_sjrs %>% pull(Source)       # the classes

# start random number generator in same place for everyone
# and so that random partitions are the same each time code is run
set.seed(7)

#library(klaR)  # don't run this since it does library(MASS) and breaks select() from dplyr
# Backward
step_sjrs_b <-
  klaR::stepclass(
    dat_sjrs_d
  , dat_sjrs_c
  , method = "lda"  # or "qda"
  , improvement = 0.001 # stop criterion: improvement less than
  , direction = "backward"
  , start.vars = colnames(dat_sjrs_d)
  )
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
719 observations of 9 variables in 4 classes; direction: backward
stop criterion: improvement less than 0.1%.
correctness rate: 0.83163;  starting variables (9): Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88 
correctness rate: 0.83449;  out: "Ba137";  variables (8): Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88 
correctness rate: 0.83588;  out: "Mg26";  variables (7): Ba138, Ca43, Mg24, Mg25, Sr86, Sr87, Sr88 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        1.04 
## NOTE HERE
step_sjrs_b$formula
dat_sjrs_c ~ Ba138 + Ca43 + Mg24 + Mg25 + Sr86 + Sr87 + Sr88
<environment: 0x000002773564aae8>
# estimated correct/error rates
step_sjrs_b$result.pm
crossval.rate      apparent 
    0.8358764     0.1641168 
# Forward
step_sjrs_f <-
  klaR::stepclass(
    dat_sjrs_d
  , dat_sjrs_c
  , method = "lda"  # or "qda"
  , improvement = 0.001 # stop criterion: improvement less than
  , direction = "forward"
  , start.vars = ""
  )
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
719 observations of 9 variables in 4 classes; direction: forward
stop criterion: improvement less than 0.1%.
Warning in cv.rate(vars = start.vars, data = data, grouping = grouping, :
error(s) in modeling/prediction step
correctness rate: 0;  starting variables (0): ,  
correctness rate: 0.81498;  in: "Ba137";  variables (1): Ba137 
correctness rate: 0.82336;  in: "Sr87";  variables (2): Ba137, Sr87 
correctness rate: 0.83032;  in: "Mg24";  variables (3): Ba137, Sr87, Mg24 
correctness rate: 0.83449;  in: "Mg25";  variables (4): Ba137, Sr87, Mg24, Mg25 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        0.92 
## NOTE HERE
step_sjrs_f$formula
dat_sjrs_c ~ Ba137 + Mg24 + Mg25 + Sr87
<environment: 0x000002772bcbe220>
# estimated correct/error rates
step_sjrs_f$result.pm
crossval.rate      apparent 
    0.8344875     0.1655076 
op <- par(no.readonly = TRUE) # the whole list of settable par's.
  # make wider left margin to fit contrast labels
  par(mfrow = c(1,2), mar = 0*rep(1, 4)) # order is c(bottom, left, top, right)
  plot(step_sjrs_f, ylim = c(0, 1), main = "empty model, forward")
  plot(step_sjrs_b, ylim = c(0, 1), main = "full model, backward")

par(op) # reset plotting options


## NOTE HERE
# set the formula you're using here, then it will be used throughout the rest
sjrs_formula <- step_sjrs_b

# Select and print the final model
#library(MASS)  # don't run library(MASS) because it breaks select() from dplyr
lda_sjrs_final <-
  MASS::lda(
    grouping = dat_sjrs_c
  , x = dat_sjrs_d %>% select(sjrs_formula$model$name)
  )
lda_sjrs_final
Call:
lda(dat_sjrs_d %>% select(sjrs_formula$model$name), grouping = dat_sjrs_c)

Prior probabilities of groups:
      DEX       GJH       NAP       SJR 
0.2795549 0.2503477 0.1668985 0.3031989 

Group means:
        Ba138      Ca43     Mg24      Mg25      Sr86      Sr87     Sr88
DEX -2.590916 0.6670463 2.137403 0.2777136 0.2716435 0.1929946 2.278532
GJH -2.004211 0.6634414 2.206508 0.2859888 0.1572110 0.1121768 1.314173
NAP -1.372585 0.6611007 2.344829 0.3037372 0.1558019 0.1110095 1.306017
SJR -1.348119 0.6611577 2.342173 0.3044489 0.1523088 0.1083400 1.271622

Coefficients of linear discriminants:
             LD1         LD2          LD3
Ba138  6.7477461   3.7374990   0.02830449
Ca43   0.9847371 -11.7501119   4.57703582
Mg24   0.1282485  -0.8420474 -23.84026016
Mg25  -4.7604035  16.2559298 179.00417542
Sr86  -5.2409948  29.8532503 121.02044737
Sr87  -6.9509635   4.6828063  18.08004175
Sr88  -3.3596631   0.9599753 -16.04215963

Proportion of trace:
   LD1    LD2    LD3 
0.9618 0.0380 0.0003 

Discuss the differences between the backward and forward selection results.

Specify the final model you’ll use for classification.

Indicate the expected classification accuracy of the selected final model.

Solution

[answer]

(2 p) Classification accuracy

Let’s look more closely at classification accuracy by evaluating the confusion matrix for classification, the table of how many observations from each population were classified into which populations. Numbers along the diagonal are correctly classified, and off-diagonals are errors.

(Make sure you use the correct lda() or qda() function as selected above.)

# CV = TRUE does jackknife (leave-one-out) crossvalidation
#library(MASS)  # don't run library(MASS) because it breaks select() from dplyr
lda_sjrs_cv <-
  MASS::lda(
    grouping = dat_sjrs_c
  , x = dat_sjrs_d %>% select(sjrs_formula$model$name)
  , CV = TRUE
  )
#lda_sjrs_cv

# Create a table of classification and posterior probabilities for each observation
classify_sjrs <-
  data.frame(
    Source = dat_sjrs$Source
  , class = lda_sjrs_cv$class
  , error = ""
  , round(lda_sjrs_cv$posterior, 3)
  )
colnames(classify_sjrs) <-
  c(
    "Source"
  , "class"
  , "error"
  , paste("post", colnames(lda_sjrs_cv$posterior), sep="_")
  )

# error column
classify_sjrs$error <-
  as.character(classify_sjrs$error)
classify_agree <-
  as.character(as.numeric(dat_sjrs$Source) - as.numeric(lda_sjrs_cv$class))
classify_sjrs$error[!(classify_agree == 0)] <-
  classify_agree[!(classify_agree == 0)]
# print table
#classify_sjrs

# A list of classification statistics
library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
confusionMatrix(
    data      = lda_sjrs_cv$class # predictions
  , reference = dat_sjrs$Source   # true labels
  , mode      = "sens_spec"       # restrict output to relevant summaries
)
Confusion Matrix and Statistics

          Reference
Prediction DEX GJH NAP SJR
       DEX 200   0   0   0
       GJH   1 180   4   2
       NAP   0   0  11  13
       SJR   0   0 105 203

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.7964, 0.8532)
    No Information Rate : 0.3032          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7591          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity              0.9950     1.0000    0.09167     0.9312
Specificity              1.0000     0.9870    0.97830     0.7904
Pos Pred Value           1.0000     0.9626    0.45833     0.6591
Neg Pred Value           0.9981     1.0000    0.84317     0.9635
Prevalence               0.2796     0.2503    0.16690     0.3032
Detection Rate           0.2782     0.2503    0.01530     0.2823
Detection Prevalence     0.2782     0.2601    0.03338     0.4284
Balanced Accuracy        0.9975     0.9935    0.53498     0.8608

Determine whether some populations are better classified than others, or whether each population seems to have roughly the same error rate.

Solution

[answer]

(0 p) View Classification

klaR::partimat(
    grouping    = dat_sjrs_c
  , x           = dat_sjrs_d %>% select(sjrs_formula$model$name)
  , plot.matrix = FALSE
  , method      = "lda"           # or "qda"
  , main        = "LDA partition" # or "QDA partition"
  )

(3 p) Classify NEW observations

Now we’ll use the final model selected above to predict observations from the NEW population. These are untagged Razorback Suckers captured in the San Juan River which could be from any of the other sources.

# new observations to classify
summary(dat_sjrs_new$Source)
DEX GJH NAP NEW SJR 
  0   0   0  80   0 
# predict the NEW data from the training data LDFs
pred_sjrs_new <-
  predict(
    lda_sjrs_final
  , newdata = dat_sjrs_new %>% select(sjrs_formula$model$name)
  )

# Create a table of classification and posterior probabilities for each observation
classify_sjrs_new <-
  data.frame(
    Source = dat_sjrs_new$Source_org
  , class = pred_sjrs_new$class
  , error = ""
  , round(pred_sjrs_new$posterior,3)
  )
colnames(classify_sjrs_new) <-
  c(
    "Source"
  , "class"
  , "error"
  , paste("post", colnames(pred_sjrs_new$posterior), sep="_")
  )

# error column
classify_sjrs_new$error <-
  as.character(classify_sjrs_new$error)
classify_agree_new <-
  as.character(as.numeric(dat_sjrs_new$Source_org) - as.numeric(pred_sjrs_new$class))
classify_sjrs_new$error[!(classify_agree_new == 0)] <-
  classify_agree_new[!(classify_agree_new == 0)]
# print table
#classify_sjrs_new

# A list of classification statistics
library(caret)
confusionMatrix(
    data      = pred_sjrs_new$class # predictions
  , reference = dat_sjrs_new$Source_org   # true labels
  , mode      = "sens_spec"       # restrict output to relevant summaries
)
Confusion Matrix and Statistics

          Reference
Prediction DEX GJH NAP SJR
       DEX  23   0   0   0
       GJH   0  19   0   0
       NAP   0   0   3   0
       SJR   0   0  10  25

Overall Statistics
                                          
               Accuracy : 0.875           
                 95% CI : (0.7821, 0.9384)
    No Information Rate : 0.3125          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8259          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity              1.0000     1.0000     0.2308     1.0000
Specificity              1.0000     1.0000     1.0000     0.8182
Pos Pred Value           1.0000     1.0000     1.0000     0.7143
Neg Pred Value           1.0000     1.0000     0.8701     1.0000
Prevalence               0.2875     0.2375     0.1625     0.3125
Detection Rate           0.2875     0.2375     0.0375     0.3125
Detection Prevalence     0.2875     0.2375     0.0375     0.4375
Balanced Accuracy        1.0000     1.0000     0.6154     0.9091
# update unknown NEW with the class prediction
dat_sjrs_new$Class <- pred_sjrs_new$class

dat_sjrs_new_pred <- cbind(dat_sjrs_new, pred_sjrs_new, classify_sjrs_new)
dat_sjrs$Class <- dat_sjrs$Source
dat_sjrs    $Label <- "Known"
dat_sjrs_new$Label <- "NEW"
dat_sjrs_all <- rbind(dat_sjrs, dat_sjrs_new)

# plot data by Source with clusters indicated
library(ggplot2)
p1 <- ggplot(dat_sjrs_all, aes(x = Ba138, y = Sr87, colour = Class))
p1 <- p1 + geom_point()#size = 2)
p1 <- p1 + labs(title = "Known and new observations by source classification")
p1 <- p1 + facet_grid(Label ~ Source_org)
print(p1)

Discuss the main features of the NEW population.

Explain why you think the classification accuracy is expected to be high or low in this case (and check your expectation with the known accuracy because we have the true labels for the NEW observations).

Solution

[answer]