San Juan River Razorback Suckers
Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.


Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.
The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset.
5 Sources
Hatchery
DEX = Dexter National Fish Hatchery
GJH = Ouray National Fish Hatchery, Grand Valley Unit
Wild
NAP = NAPI ponds
SJR = San Juan River
Unknown
UNK = untagged Razorback Suckers captured in the San Juan River
these could be from any of the above sources
Our goal is to use the observations with various clustering methods and indexes to determine the number of clusters to try to find meaningful groups. We are using the fish from the four known sources so we can check at the end whether our clusters generated meaningful clusters in terms of hatchery and wild fish. No cheating … try to do the clustering first, and once satisfied, then reveal the answers. This will help emphasize how challenging clustering can be.
(2 p) Known fish scatterplot
Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant. You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs
in the ggpairs()
function with dat_sjrs %>% select(col1, col2, ...)
and specify the columns to plot.
# Scatterplot matrix
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
p <-
ggpairs(
dat_sjrs %>% select(Source, Ba137, Ca43, Mg24, Sr86)
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "Original data by source"
)
print(p)
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Describe the relationships between isotopes of the same element (same atomic number) and between different elements.
Source populations may or may not be different, describe the source differences you see.
(3 p) Clustering
You will probably spend most of your time in this part of the problem.
Below I’ve set up code so you can select a cluster method (clus_method_num
) and an index for determining the number of clusters (clus_index_num
). Both of these numbers will select from the respective list of methods and indices.
Your goal is to find a clustering method that creates a dendogram that seems to have a relatively small number of clusters (2-8) that are different between clusters (long branches at a high level of the tree) but similar within clusters (short branches at a low level of the tree). Try a few indices to see whether any help you cut the tree at a number of clusters that you agree with upon visual inspection. Note that the i.clus <- number
line at the bottom is for you to manually enter a number of clusters you prefer, if you differ with the index’s recommendation.
(Look below the plots for the question to answer.)
# Change integer data type to numeric
dat_sjrs_num <-
dat_sjrs %>%
select(
-Source
) %>%
as.matrix()
# these two numeric switches select the cluster method and index for number of clusters
clus_method_num <- 1 ## change this number to make selection
clus_method <- c("ward.D", "ward.D2", "single", "complete", "average"
, "mcquitty", "median", "centroid")[clus_method_num]
clus_index_num <- 1 ## change this number to make selection
## (multiple are possible but some take a very long time to run)
clus_index <- c("kl" , "ch" , "hartigan" , "ccc" , "scott"
, "marriot" , "trcovw" , "tracew" , "friedman" , "rubin"
, "cindex" , "db" , "silhouette" , "duda" , "pseudot2"
, "beale" , "ratkowsky" , "ball" , "ptbiserial" , "gap"
, "frey" , "mcclain" , "gamma" , "gplus" , "tau"
, "dunn" , "hubert" , "sdindex" , "dindex" , "sdbw")[clus_index_num]
# estimate the number of clusters
library(NbClust)
NC_out <- NbClust(dat_sjrs_num, method = clus_method, index = clus_index)
# best number of clusters
NC_out$Best.nc
Number_clusters Value_Index
2.0000 10.0695
# number of clusters to identify with red boxes and ellipses
i_clus <- NC_out$Best.nc[1] #3
# manual number of clusters
#i_clus <- 1
# Number of clusters chosen
i_clus
Visualization of clustering hierarchy and resulting cluster membership.
#par(mfrow=c(1,2))
# create distance matrix between points
dat_sjrs_dist <-
dist(
dat_sjrs[,-1] # only use numeric columns, not labels
)
# create dendrogram
dat_sjrs_hc_complete <-
hclust(
dat_sjrs_dist
, method = clus_method
)
# create a column with group membership
dat_sjrs <-
dat_sjrs %>%
mutate(
cut_comp = factor(cutree(dat_sjrs_hc_complete, k = i_clus))
)
plot(
dat_sjrs_hc_complete
, hang = -1
, main = paste(clus_method, "using ", i_clus, "clusters")
, labels = dat_sjrs$Source
)
rect.hclust(
dat_sjrs_hc_complete
, k = i_clus
)
# create PCA scores plot with ellipses
library(cluster)
clusplot(
dat_sjrs[,-1]
, cutree(dat_sjrs_hc_complete, k = i_clus)
, color = TRUE
, labels = 2
, lines = 0
, cex = 2
, cex.txt = 0.25
, col.txt = "gray20"
, col.p = dat_sjrs$cut_comp
, main = paste(clus_method, "using ", i_clus, "clusters"), sub = NULL
)
Discuss which method you are choosing and why. Did an index help you decide the number of clusters better than your human visual inspection did?
(2 p) Cluster quality compared to known sources
STOP! Only start this part after you’re satisfied with your clustering above.
In the scatterplots below, two isotopes were selected that show differences in the scatterplot matrix above (choose others if you prefer). The top plot shows how each source population were clustered. The bottom plot shows cluster composition of source populations.
I have set eval=FALSE
so that the result is not revealed until you’ve completed the clustering exercise above. When you’re totally satisfied with your clusters, return here and remove the eval=FALSE
option in the code chunk and see how you did!
# eval=FALSE so this chunk doesn't evaluate until clustering complete above
# plot data by Source with clusters indicated
library(ggplot2)
p1 <- ggplot(dat_sjrs, aes(x = Ba137, y = Sr86, colour = cut_comp))
p1 <- p1 + geom_point(size = 2)
p1 <- p1 + labs(title = "By source with clusters indicated")
p1 <- p1 + facet_wrap( ~ Source, nrow = 1)
#print(p1)
# plot data by cluster with Source indicated
library(ggplot2)
p2 <- ggplot(dat_sjrs, aes(x = Ba137, y = Sr86, colour = Source))
p2 <- p2 + geom_point(size = 2)
p2 <- p2 + labs(title = "By cluster with source indicated")
p2 <- p2 + facet_wrap( ~ cut_comp, nrow = 1)
#print(p2)
library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1, top = paste("Clustering method \"", clus_method, "\" using ", i_clus, " clusters determined by \"", clus_index, "\"", sep=""))
Describe how the clustering performed. How reliable does clustering seem to you?
(1 p) Cluster differences using different methods
Given your experience trying different methods to develop the clusters (think about your evaluation of the dendograms) and the different results of indexes to determine the number of clusters, make a few comments about how robust clustering is. That is, if clustering results are very sensitive to the options you select, then it’s not very robust.