Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.
Images: San Juan River Basin, San Juan River Photo, Razorback Sucker, and Spiney Rays
Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.
The issue is that hatchery fish can get into the
river and lose their tags. It is important for environmental resource
managers to know whether untagged fish are wild or hatchery fish. There
are five fish sources in the dataset. The UNK
fish are very
strange, though, so we’re going to exclude them from our analysis for
the purpose of this class and focus on a subset from the four Hatchery
and Wild sources.
4 known sources, 1 mix of Hatchery and Wild sources
Hatchery
DEX = Dexter National Fish Hatchery
GJH = Ouray National Fish Hatchery, Grand Valley Unit
Wild
NAP = NAPI ponds
SJR = San Juan River
Unknown (removed)
UNK = untagged Razorback Suckers captured in the San Juan River
these could be from any of the above sources
New (subset of Hatchery and Wild sources)
NEW = Razorback Suckers
Our goal is to use the observations with linear or quadradic discriminant analysis to evaluate classification accuracy of the four known sources using the jackknife (leave-one-out crossvalidation) and then to predict a set of observations where I have withheld their source identity.
NOTE: This assignment is slightly different than
what is in the helper video. The helper video develops a classification
model on all known fish to predict an unknown population; however, that
unknown population is very different from the known fish and doesn’t
demonstrate the classification method well for pedagogical purposes.
Therefore, I made the following changes. I removed the unknown
population and, instead, I assign 1 in every 10 observations to the
NEW
group which we’ll use to predict at the end of the
assignment. Otherwise, the assignment is similar enough that I won’t
record a new video for this version.
Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.
library(erikmisc)
library(tidyverse)
# First, download the data to your computer,
# save in the same folder as this Rmd file.
# read the data
<-
dat_sjrs_full read_csv(
"ADA2_CL_23_Clustering_SanJuanRazorbackSuckers_data2014.csv"
)
dim(dat_sjrs_full)
[1] 1512 13
# the last set of observations only have a few isotopes, so exclude
<-
dat_sjrs %>%
dat_sjrs_full na.omit()
dim(dat_sjrs)
[1] 1447 13
# no missing values
%>%
dat_sjrs is.na() %>%
sum()
[1] 0
#str(dat_sjrs)
<-
dat_sjrs %>%
dat_sjrs select(
Source:Sr88
, Ba137%>%
) filter(
# Exclude unknown sources
!= "UNK"
Source %>%
) mutate(
Source_org = factor(Source) # original source
# every 10th observation, assign to "NEW"
Source = ifelse((1:n() %% 10) == 0, "NEW", Source)
, Source = factor(Source)
, # transforming the Ba values separates the tight clustering on the boundary
Ba137 = log10(Ba137)
, Ba138 = log10(Ba138)
,
)names(dat_sjrs)
[1] "Source" "Ba137" "Ba138" "Ca43" "Mg24" "Mg25" "Mg26" "Sr86" "Sr87" "Sr88"
[11] "Source_org"
# 1/10 of the sources have been assigned to "NEW"
%>% pull(Source_org) %>% table() dat_sjrs
.
DEX GJH NAP SJR
224 199 133 244
%>% pull(Source ) %>% table() dat_sjrs
.
DEX GJH NAP NEW SJR
201 180 120 80 219
## NOTE HERE
# NEW group to predict
<-
dat_sjrs_new %>%
dat_sjrs filter(
== "NEW"
Source
)# Known groups
<-
dat_sjrs %>%
dat_sjrs filter(
!= "NEW"
Source %>%
) filter(
# There are a few unual observations, remove those assuming measurement errors
# remove two small Ca43 values
> 0.5
Ca43 %>%
) mutate(
Source = factor(Source) # to remove unused levels
)
# data sizes
%>% dim() dat_sjrs_new
[1] 80 11
%>% dim() dat_sjrs
[1] 719 11
Note that this plot can take a while to generate. You’re welcome to
subset the data further for this plot if some of the variables are
redundant (highly correlated). You could probably get away with 5
columns of data without any loss of interpretation. If you want to do
this, replace the dat_sjrs
in the ggpairs()
function with dat_sjrs %>% select(col1, col2, ...)
and
specify the columns to plot.
# Scatterplot matrix
library(ggplot2)
library(GGally)
<-
p ggpairs(
dat_sjrsmapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
, #, lower = list(continuous = "cor")
title = "Original data by source"
,
)print(p)