1 San Juan River Razorback Suckers

Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.

Images: San Juan River Basin, San Juan River Photo, Razorback Sucker, and Spiney Rays

Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.

The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset. The UNK fish are very strange, though, so we’re going to exclude them from our analysis for the purpose of this class and focus on a subset from the four Hatchery and Wild sources.

4 known sources, 1 mix of Hatchery and Wild sources

Hatchery
  DEX = Dexter National Fish Hatchery
  GJH = Ouray  National Fish Hatchery, Grand Valley Unit

Wild
  NAP = NAPI ponds
  SJR = San Juan River

Unknown (removed)
  UNK = untagged Razorback Suckers captured in the San Juan River
        these could be from any of the above sources

New (subset of Hatchery and Wild sources)
  NEW = Razorback Suckers

Our goal is to use the observations with linear or quadradic discriminant analysis to evaluate classification accuracy of the four known sources using the jackknife (leave-one-out crossvalidation) and then to predict a set of observations where I have withheld their source identity.

NOTE: This assignment is slightly different than what is in the helper video. The helper video develops a classification model on all known fish to predict an unknown population; however, that unknown population is very different from the known fish and doesn’t demonstrate the classification method well for pedagogical purposes. Therefore, I made the following changes. I removed the unknown population and, instead, I assign 1 in every 10 observations to the NEW group which we’ll use to predict at the end of the assignment. Otherwise, the assignment is similar enough that I won’t record a new video for this version.

1.1 Clean and transform data

Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.

library(tidyverse)

# load ada functions
source("ada_functions.R")

# First, download the data to your computer,
#   save in the same folder as this Rmd file.

# read the data
dat_sjrs_full <-
  read_csv(
    "ADA2_CL_23_Clustering_SanJuanRazorbackSuckers_data2014.csv"
  )

dim(dat_sjrs_full)
[1] 1512   13
# the last set of observations only have a few isotopes, so exclude
dat_sjrs <-
  dat_sjrs_full %>%
  na.omit()

dim(dat_sjrs)
[1] 1447   13
# no missing values
dat_sjrs %>%
  is.na() %>%
  sum()
[1] 0
#str(dat_sjrs)

dat_sjrs <-
  dat_sjrs %>%
  select(
    Source
  , Ba137:Sr88
  ) %>%
  filter(
    # Exclude unknown sources
    Source != "UNK"
  ) %>%
  mutate(
    Source_org = factor(Source) # original source
    # every 10th observation, assign to "NEW"
  , Source = ifelse((1:n() %% 10) == 0, "NEW", Source)
  , Source = factor(Source)
    # transforming the Ba values separates the tight clustering on the boundary
  , Ba137 = log10(Ba137)
  , Ba138 = log10(Ba138)
  )
names(dat_sjrs)
 [1] "Source"     "Ba137"      "Ba138"      "Ca43"       "Mg24"       "Mg25"       "Mg26"       "Sr86"       "Sr87"       "Sr88"       "Source_org"
# 1/10 of the sources have been assigned to "NEW"
dat_sjrs %>% pull(Source_org) %>% table()
.
DEX GJH NAP SJR 
224 199 133 244 
dat_sjrs %>% pull(Source    ) %>% table()
.
DEX GJH NAP NEW SJR 
201 180 120  80 219 
## NOTE HERE
# NEW group to predict
dat_sjrs_new <-
  dat_sjrs %>%
  filter(
    Source == "NEW"
  )
# Known groups
dat_sjrs <-
  dat_sjrs %>%
  filter(
    Source != "NEW"
  ) %>%
  filter(
    # There are a few unual observations, remove those assuming measurement errors
    # remove two small Ca43 values
    Ca43 > 0.5
  ) %>%
  mutate(
    Source = factor(Source)  # to remove unused levels
  )

# data sizes
dat_sjrs_new %>% dim()
[1] 80 11
dat_sjrs     %>% dim()
[1] 719  11

1.2 Known fish scatterplot

Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant (highly correlated). You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs in the ggpairs() function with dat_sjrs %>% select(col1, col2, ...) and specify the columns to plot.

# Scatterplot matrix
library(ggplot2)
library(GGally)
p <-
  ggpairs(
    dat_sjrs
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "Original data by source"
  )
print(p)