Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.
Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.
The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset. The
UNK fish are very strange, though, so we’re going to exclude them from our analysis for the purpose of this class and focus on a subset from the four Hatchery and Wild sources.
4 known sources, 1 mix of Hatchery and Wild sources Hatchery DEX = Dexter National Fish Hatchery GJH = Ouray National Fish Hatchery, Grand Valley Unit Wild NAP = NAPI ponds SJR = San Juan River Unknown (removed) UNK = untagged Razorback Suckers captured in the San Juan River these could be from any of the above sources New (subset of Hatchery and Wild sources) NEW = Razorback Suckers
Our goal is to use the observations with linear or quadradic discriminant analysis to evaluate classification accuracy of the four known sources using the jackknife (leave-one-out crossvalidation) and then to predict a set of observations where I have withheld their source identity.
NOTE: This assignment is slightly different than what is in the helper video. The helper video develops a classification model on all known fish to predict an unknown population; however, that unknown population is very different from the known fish and doesn’t demonstrate the classification method well for pedagogical purposes. Therefore, I made the following changes. I removed the unknown population and, instead, I assign 1 in every 10 observations to the
NEW group which we’ll use to predict at the end of the assignment. Otherwise, the assignment is similar enough that I won’t record a new video for this version.
Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.
library(tidyverse) # load ada functions source("ada_functions.R") # First, download the data to your computer, # save in the same folder as this Rmd file. # read the data <- dat_sjrs_full read_csv( "ADA2_CL_23_Clustering_SanJuanRazorbackSuckers_data2014.csv" ) dim(dat_sjrs_full)
 1512 13
# the last set of observations only have a few isotopes, so exclude <- dat_sjrs %>% dat_sjrs_full na.omit() dim(dat_sjrs)
 1447 13
# no missing values %>% dat_sjrs is.na() %>% sum()
#str(dat_sjrs) <- dat_sjrs %>% dat_sjrs select( Source:Sr88 , Ba137%>% ) filter( # Exclude unknown sources != "UNK" Source %>% ) mutate( Source_org = factor(Source) # original source # every 10th observation, assign to "NEW" Source = ifelse((1:n() %% 10) == 0, "NEW", Source) , Source = factor(Source) , # transforming the Ba values separates the tight clustering on the boundary Ba137 = log10(Ba137) , Ba138 = log10(Ba138) , )names(dat_sjrs)
 "Source" "Ba137" "Ba138" "Ca43" "Mg24" "Mg25" "Mg26" "Sr86" "Sr87" "Sr88" "Source_org"
# 1/10 of the sources have been assigned to "NEW" %>% pull(Source_org) %>% table()dat_sjrs
. DEX GJH NAP SJR 224 199 133 244
%>% pull(Source ) %>% table()dat_sjrs
. DEX GJH NAP NEW SJR 201 180 120 80 219
## NOTE HERE # NEW group to predict <- dat_sjrs_new %>% dat_sjrs filter( == "NEW" Source )# Known groups <- dat_sjrs %>% dat_sjrs filter( != "NEW" Source %>% ) filter( # There are a few unual observations, remove those assuming measurement errors # remove two small Ca43 values > 0.5 Ca43 %>% ) mutate( Source = factor(Source) # to remove unused levels ) # data sizes %>% dim()dat_sjrs_new
 80 11
 719 11
Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant (highly correlated). You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the
dat_sjrs in the
ggpairs() function with
dat_sjrs %>% select(col1, col2, ...) and specify the columns to plot.
# Scatterplot matrix library(ggplot2) library(GGally) <- p ggpairs( dat_sjrsmapping = ggplot2::aes(colour = Source, alpha = 0.5) , upper = list(continuous = "density", combo = "box") , lower = list(continuous = "points", combo = "dot") , #, lower = list(continuous = "cor") title = "Original data by source" , )print(p)