ADA2: Class 21, Ch 14, Clustering

Advanced Data Analysis 2, Stat 428/528, Spring 2023, Prof. Erik Erhardt, UNM

Author

Your Name

Published

December 17, 2022

San Juan River Razorback Suckers

Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.

Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.

The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset.

5 Sources

Hatchery
  DEX = Dexter National Fish Hatchery
  GJH = Ouray  National Fish Hatchery, Grand Valley Unit

Wild
  NAP = NAPI ponds
  SJR = San Juan River

Unknown
  UNK = untagged Razorback Suckers captured in the San Juan River
        these could be from any of the above sources

Our goal is to use the observations with various clustering methods and indexes to determine the number of clusters to try to find meaningful groups. We are using the fish from the four known sources so we can check at the end whether our clusters generated meaningful clusters in terms of hatchery and wild fish. No cheating … try to do the clustering first, and once satisfied, then reveal the answers. This will help emphasize how challenging clustering can be.

(2 p) Clean and transform data

Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.

Please download the data into your working directory to save downloads from my website each time you knit this document.

library(erikmisc)
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.20 ──
✔ tibble 3.1.8     ✔ dplyr  1.1.0
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
  by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.4.1     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# First, download the data to your computer,
#   save in the same folder as this Rmd file.

# read the data
dat_sjrs_full <-
  read_csv(
    "ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
  )
Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(dat_sjrs_full)
[1] 1512   13
# the last set of observations only have a few isotopes, so exclude
dat_sjrs <-
  dat_sjrs_full %>%
  na.omit()

dim(dat_sjrs)
[1] 1447   13
# no missing values
dat_sjrs %>%
  is.na() %>%
  sum()
[1] 0
#str(dat_sjrs)

dat_sjrs <-
  dat_sjrs %>%
  select(
    Source, Ba137:Sr88
  ) %>%
  filter(
    # Exclude unknown sources
    Source != "UNK"
  )
names(dat_sjrs)
 [1] "Source" "Ba137"  "Ba138"  "Ca43"   "Mg24"   "Mg25"   "Mg26"   "Sr86"  
 [9] "Sr87"   "Sr88"  
## There are a few unual observations, remove those assuming measurement errors

### SOLUTION

Add code above.

Solution

[answer]

(2 p) Known fish scatterplot

Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant. You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs in the ggpairs() function with dat_sjrs %>% select(col1, col2, ...) and specify the columns to plot.

# Scatterplot matrix
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
p <-
  ggpairs(
    dat_sjrs %>% select(Source, Ba137, Ca43, Mg24, Sr86)
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "Original data by source"
  )
print(p)
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

Describe the relationships between isotopes of the same element (same atomic number) and between different elements.

Source populations may or may not be different, describe the source differences you see.

Solution

[answer]

(3 p) Clustering

You will probably spend most of your time in this part of the problem.

Below I’ve set up code so you can select a cluster method (clus_method_num) and an index for determining the number of clusters (clus_index_num). Both of these numbers will select from the respective list of methods and indices.

Your goal is to find a clustering method that creates a dendogram that seems to have a relatively small number of clusters (2-8) that are different between clusters (long branches at a high level of the tree) but similar within clusters (short branches at a low level of the tree). Try a few indices to see whether any help you cut the tree at a number of clusters that you agree with upon visual inspection. Note that the i.clus <- number line at the bottom is for you to manually enter a number of clusters you prefer, if you differ with the index’s recommendation.

(Look below the plots for the question to answer.)

# Change integer data type to numeric
dat_sjrs_num <-
  dat_sjrs %>%
  select(
    -Source
  ) %>%
  as.matrix()

# these two numeric switches select the cluster method and index for number of clusters
clus_method_num <- 1   ## change this number to make selection

clus_method <- c("ward.D", "ward.D2", "single", "complete", "average"
               , "mcquitty", "median", "centroid")[clus_method_num]

clus_index_num <- 1    ## change this number to make selection
                       ## (multiple are possible but some take a very long time to run)
clus_index <- c("kl"       , "ch"        , "hartigan"   , "ccc"        , "scott"
               , "marriot" , "trcovw"    , "tracew"     , "friedman"   , "rubin"
               , "cindex"  , "db"        , "silhouette" , "duda"       , "pseudot2"
               , "beale"   , "ratkowsky" , "ball"       , "ptbiserial" , "gap"
               , "frey"    , "mcclain"   , "gamma"      , "gplus"      , "tau"
               , "dunn"    , "hubert"    , "sdindex"    , "dindex"     , "sdbw")[clus_index_num]

# estimate the number of clusters
library(NbClust)
NC_out <- NbClust(dat_sjrs_num, method = clus_method, index = clus_index)
# best number of clusters
NC_out$Best.nc
Number_clusters     Value_Index 
         2.0000         10.0695 
# number of clusters to identify with red boxes and ellipses
i_clus <- NC_out$Best.nc[1] #3
# manual number of clusters
#i_clus <- 1

# Number of clusters chosen
i_clus
Number_clusters 
              2 

Visualization of clustering hierarchy and resulting cluster membership.

#par(mfrow=c(1,2))
# create distance matrix between points
dat_sjrs_dist <-
  dist(
    dat_sjrs[,-1] # only use numeric columns, not labels
  )

# create dendrogram
dat_sjrs_hc_complete  <-
  hclust(
    dat_sjrs_dist
  , method = clus_method
  )
# create a column with group membership
dat_sjrs <-
  dat_sjrs %>%
  mutate(
    cut_comp = factor(cutree(dat_sjrs_hc_complete, k = i_clus))
  )

plot(
    dat_sjrs_hc_complete
  , hang = -1
  , main = paste(clus_method, "using ", i_clus, "clusters")
  , labels = dat_sjrs$Source
  )
rect.hclust(
    dat_sjrs_hc_complete
  , k = i_clus
  )

# create PCA scores plot with ellipses
library(cluster)
clusplot(
    dat_sjrs[,-1]
  , cutree(dat_sjrs_hc_complete, k = i_clus)
  , color   = TRUE
  , labels  = 2
  , lines   = 0
  , cex     = 2
  , cex.txt = 0.25
  , col.txt = "gray20"
  , col.p = dat_sjrs$cut_comp
  , main = paste(clus_method, "using ", i_clus, "clusters"), sub = NULL
  )

Discuss which method you are choosing and why. Did an index help you decide the number of clusters better than your human visual inspection did?

Solution

[answer]

(2 p) Cluster quality compared to known sources

STOP! Only start this part after you’re satisfied with your clustering above.

In the scatterplots below, two isotopes were selected that show differences in the scatterplot matrix above (choose others if you prefer). The top plot shows how each source population were clustered. The bottom plot shows cluster composition of source populations.

I have set eval=FALSE so that the result is not revealed until you’ve completed the clustering exercise above. When you’re totally satisfied with your clusters, return here and remove the eval=FALSE option in the code chunk and see how you did!

# eval=FALSE so this chunk doesn't evaluate until clustering complete above

# plot data by Source with clusters indicated
library(ggplot2)
p1 <- ggplot(dat_sjrs, aes(x = Ba137, y = Sr86, colour = cut_comp))
p1 <- p1 + geom_point(size = 2)
p1 <- p1 + labs(title = "By source with clusters indicated")
p1 <- p1 + facet_wrap( ~ Source, nrow = 1)
#print(p1)

# plot data by cluster with Source indicated
library(ggplot2)
p2 <- ggplot(dat_sjrs, aes(x = Ba137, y = Sr86, colour = Source))
p2 <- p2 + geom_point(size = 2)
p2 <- p2 + labs(title = "By cluster with source indicated")
p2 <- p2 + facet_wrap( ~ cut_comp, nrow = 1)
#print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1, top = paste("Clustering method \"", clus_method, "\" using ", i_clus, " clusters determined by \"", clus_index, "\"", sep=""))

Describe how the clustering performed. How reliable does clustering seem to you?

Solution

[answer]

(1 p) Cluster differences using different methods

Given your experience trying different methods to develop the clusters (think about your evaluation of the dendograms) and the different results of indexes to determine the number of clusters, make a few comments about how robust clustering is. That is, if clustering results are very sensitive to the options you select, then it’s not very robust.

Solution

[answer]