Later code chunks depend on computations I’d like you to do, so those have been commented out by moving the {R} inside the first line of the code chunk. Simply move the {R} back after the first triple-tick to let those compute.

1 San Juan River Razorback Suckers

Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.

Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.

The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset.

4 known sources, 1 mix of unknown sources

Hatchery
  DEX = Dexter National Fish Hatchery
  GJH = Ouray  National Fish Hatchery, Grand Valley Unit

Wild
  NAP = NAPI ponds
  SJR = San Juan River

Unknown
  UNK = untagged Razorback Suckers captured in the San Juan River
        these could be from any of the above sources

Our goal is to use PCA to reduce the sets of isotopes of the same element to a single feature for each element. Then, using the binary response of “Pond” versus “River”, and we’ll fit a logistic regression and do model selection. Finally, we’ll assess the expected classification accuracy using logistic regression for classification, and predict some observations I included from the model building.

1.1 Clean and transform data

Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.

Please download the data into your working directory to save downloads from my website each time you knit this document.

sjrs.full <- read.csv("ADA2_HW_12_MANOVA_SanJuanRazorbackSuckers_data2014.csv")
dim(sjrs.full)
[1] 1512   13
# the last set of observations only have a few isotopes, so exclude
sjrs.full <- na.omit(sjrs.full)
dim(sjrs.full)
[1] 1447   13
# no missing values
sum(is.na(sjrs.full))
[1] 0
#str(sjrs)

sjrs <- sjrs.full
names(sjrs)
 [1] "Sort.Key" "Station"  "Source"   "Type"     "Ba137"    "Ba138"   
 [7] "Ca43"     "Mg24"     "Mg25"     "Mg26"     "Sr86"     "Sr87"    
[13] "Sr88"    
nrow(sjrs)
[1] 1447
# don't include UNK group
sjrs     <- subset(sjrs, subset = (Source != "UNK"), select = c(Source, Ba137:Sr88))

# Simplify Source to be "Pond" and "River"
sjrs$Source <- as.character(sjrs$Source)
sjrs     <- subset(sjrs, subset = (Source %in% c("NAP", "SJR")), select = c(Source, Ba137:Sr88))
sjrs$Source[(sjrs$Source %in% c("NAP"))] <- "Pond"
sjrs$Source[(sjrs$Source %in% c("SJR"))] <- "River"
sjrs$Source <- factor(sjrs$Source)

# transforming the Ba values separates the tight clustering on the boundary
sjrs$Ba137 <- log10(sjrs$Ba137)
sjrs$Ba138 <- log10(sjrs$Ba138)

## There are a few unual observations, remove those assuming measurement errors
# remove two small Ca43 values
sjrs <- subset(sjrs, Ca43 > 0.5)


## NOTE HERE
## Subset for classification later
# start random number generator in same place for everyone
# and so that random partitions are the same each time code is run
set.seed(3)
# sample a subset of observation indices to predict
pred.ind <- sort(sample.int(nrow(sjrs), size = 100))

# prediction subset
sjrs.pred <- sjrs[pred.ind,]
# remove observations to predict from data to develop the model
sjrs <- sjrs[-pred.ind,]

1.2 Known fish scatterplot

Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant. You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the sjrs in the ggpairs() function with subset(sjrs, select = c(col1, col2, ...)) and specify the columns to plot.

# Scatterplot matrix
library(ggplot2)
library(GGally)
p <- ggpairs(sjrs
            , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
            , upper = list(continuous = "density", combo = "box")
            , lower = list(continuous = "points", combo = "dot")
            #, lower = list(continuous = "cor")
            , title = "Original data by source"
            )
print(p)