ADA2: Class 22, Ch 15, Multivariate Analysis of Variance (MANOVA)

Advanced Data Analysis 2, Stat 428/528, Spring 2023, Prof. Erik Erhardt, UNM

Author

Your Name

Published

December 17, 2022

San Juan River Razorback Suckers

Peer mentor (Spring 2016) Adam Barkalow’s wife uses stable isotope ratios to analyze fish.

Razorback Suckers were collected in 2014 on the San Juan River. Elemental isotopic ratios from finrays were analyzed for Ba (Barium 56), Ca (Calcium 20), Mg (Magnesium 12), and Sr (Strontium 38). Finrays are non-lethally obtained and are used to detect natal origin since the material in the reys are partially developed early in life.

The issue is that hatchery fish can get into the river and lose their tags. It is important for environmental resource managers to know whether untagged fish are wild or hatchery fish. There are five fish sources in the dataset.

5 Sources

Hatchery
  DEX = Dexter National Fish Hatchery
  GJH = Ouray  National Fish Hatchery, Grand Valley Unit

Wild
  NAP = NAPI ponds
  SJR = San Juan River

Unknown
  UNK = untagged Razorback Suckers captured in the San Juan River
        these could be from any of the above sources

Our goal is to test whether the known source populations have different multivariate means, and if so, which pairs of populations differ.

(1 p) Clean and transform data

Looking at the scatterplot matrix below, clean and/or transform the data if you think it will be helpful. Note that measurement error can be an issue in complicated biological measurements. Furthermore, a transformation might help separate observations that are tightly grouped in space.

Please download the data into your working directory to save downloads from my website each time you knit this document.

library(erikmisc)
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.20 ──
✔ tibble 3.1.8     ✔ dplyr  1.1.0
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
  by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ ggplot2   3.4.1     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# First, download the data to your computer,
#   save in the same folder as this Rmd file.

# read the data
dat_sjrs_full <-
  read_csv(
    "ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
  )
Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(dat_sjrs_full)
[1] 1512   13
# the last set of observations only have a few isotopes, so exclude
dat_sjrs <-
  dat_sjrs_full %>%
  na.omit()

dim(dat_sjrs)
[1] 1447   13
# no missing values
dat_sjrs %>%
  is.na() %>%
  sum()
[1] 0
#str(dat_sjrs)

dat_sjrs <-
  dat_sjrs %>%
  select(
    Source
  , Ba137:Sr88
  ) %>%
  filter(
    # Exclude unknown sources
    Source != "UNK"
  )
names(dat_sjrs)
 [1] "Source" "Ba137"  "Ba138"  "Ca43"   "Mg24"   "Mg25"   "Mg26"   "Sr86"  
 [9] "Sr87"   "Sr88"  
### SOLUTION
# if necessary, some data modifications here informed from your analysis

Add code above.

Solution

[answer]

(1 p) Known fish scatterplot

Note that this plot can take a while to generate. You’re welcome to subset the data further for this plot if some of the variables are redundant. You could probably get away with 5 columns of data without any loss of interpretation. If you want to do this, replace the dat_sjrs in the ggpairs() function with subset(dat_sjrs, select = c(col1, col2, ...)) and specify the columns to plot. However, do the analysis using all the columns of data.

# Scatterplot matrix
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
p <- ggpairs(dat_sjrs
            , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
            , upper = list(continuous = "density", combo = "box")
            , lower = list(continuous = "points", combo = "dot")
            #, lower = list(continuous = "cor")
            , title = "Original data by source"
            )
print(p)
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

Describe the relationships between isotopes of the same element (same atomic number) and between different elements.

Source populations may or may not be different, describe the source differences you see.

Solution

[answer]

(2 p) MANOVA Assumptions

Below are Shapiro-Wilk test for multivariate normality, as well as QQ-plots comparing the Mahalanobis D2 distance to a chi-squared distribution. Keep in mind that we have large sample sizes, so the numeric tests may be too sensitive. We do expect some deviations for the extreme D2 values, but they should not be too systematic. If there are gross violations of normality, try a transformation(s) in a subset of the original variables that improve this.

# Test multivariate normality using the Shapiro-Wilk test for multivariate normality
library(mvnormtest)
# The data needs to be transposed t() so each variable is a row
#   with observations as columns.
mshapiro.test(dat_sjrs %>% filter(Source == "DEX") %>% select(Ba137:Sr88) %>% t())

    Shapiro-Wilk normality test

data:  Z
W = 0.65605, p-value < 2.2e-16
mshapiro.test(dat_sjrs %>% filter(Source == "GJH") %>% select(Ba137:Sr88) %>% t())

    Shapiro-Wilk normality test

data:  Z
W = 0.96454, p-value = 6.614e-05
mshapiro.test(dat_sjrs %>% filter(Source == "NAP") %>% select(Ba137:Sr88) %>% t())

    Shapiro-Wilk normality test

data:  Z
W = 0.84929, p-value = 2.469e-10
mshapiro.test(dat_sjrs %>% filter(Source == "SJR") %>% select(Ba137:Sr88) %>% t())

    Shapiro-Wilk normality test

data:  Z
W = 0.52774, p-value < 2.2e-16
par(mfrow=c(2,2))
e_plot_mnv_norm_qqplot(dat_sjrs %>% filter(Source == "DEX") %>% select(Ba137:Sr88), name = "DEX")
e_plot_mnv_norm_qqplot(dat_sjrs %>% filter(Source == "GJH") %>% select(Ba137:Sr88), name = "GJH")
e_plot_mnv_norm_qqplot(dat_sjrs %>% filter(Source == "NAP") %>% select(Ba137:Sr88), name = "NAP")
e_plot_mnv_norm_qqplot(dat_sjrs %>% filter(Source == "SJR") %>% select(Ba137:Sr88), name = "SJR")

Summarize the normality results after trying to address any deviations.

Solution

[answer]

(3 p) MANOVA

Below is the result of the MANOVA test.

State the null and alternative hypotheses in words and notation.

Select the robust test statistic and interpret the result in the context of the dataset.

# Multivariate MANOVA test
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:purrr':

    some
The following object is masked from 'package:dplyr':

    recode
lm_man <- lm(cbind(Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88) ~ Source, data = dat_sjrs)
man_dat_sjrs <- Manova(lm_man)
man_dat_sjrs

Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df    Pr(>F)    
Source  3    1.5054   88.409     27   2370 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Solution

[answer]

(2 p) Multiple comparisons

Below I wrote nested loops to compare all pairs of Sources.

The output gives a title for which comparisons are being made with the test result.

# Multivariate MANOVA test

# NOTE: I specify pairs of sources by identifying the sorted unique list of Sources
Source_list <- dat_sjrs %>% pull(Source) %>% unique() %>% sort()
Source_list
[1] "DEX" "GJH" "NAP" "SJR"
# then I pull out pairs by index, like this
Source_list[c(1,2)]
[1] "DEX" "GJH"
# I think this is much easier than typing all names for the pairs of Sources

# for each pair of Sources
for (i_1 in 1:(length(Source_list) - 1)) {
  for (i_2 in (i_1 + 1):length(Source_list)) {
    # print a header to indicate which comparisons are being made
    cat("\n\n")
    cat(paste("***** Comparison between", i_1, Source_list[i_1], "and", i_2, Source_list[i_2]))

    # perform pairwise comparison
    library(car)
    man_pair <- Manova(lm(cbind(Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88) ~ Source
               , data = dat_sjrs %>% filter(Source %in% Source_list[c(i_1, i_2)])
           ))
    # print result
    print(man_pair)
  }
}


***** Comparison between 1 DEX and 2 GJH
Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df    Pr(>F)    
Source  1   0.94497   787.97      9    413 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


***** Comparison between 1 DEX and 3 NAP
Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df    Pr(>F)    
Source  1    0.9115   397.12      9    347 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


***** Comparison between 1 DEX and 4 SJR
Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df    Pr(>F)    
Source  1   0.94844   936.02      9    458 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


***** Comparison between 2 GJH and 3 NAP
Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df    Pr(>F)    
Source  1   0.68956    79.47      9    322 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


***** Comparison between 2 GJH and 4 SJR
Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df    Pr(>F)    
Source  1   0.80356   196.81      9    433 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


***** Comparison between 3 NAP and 4 SJR
Type II MANOVA Tests: Pillai test statistic
       Df test stat approx F num Df den Df Pr(>F)
Source  1   0.02983   1.2538      9    367  0.261

Specify the Bonferroni-corrected \(\alpha\) level for pairwise tests.

Indicate which pairs differ at this \(\alpha\) level.

Solution

[answer]

(1 p) Canonical Discriminant functions (visualize differences)

The canonical discriminant analysis will indicate the directions that provide the greatest ability to distinguish between the groups.

# perform canonical discriminant analysis
library(candisc)
Loading required package: heplots
Loading required package: broom

Attaching package: 'candisc'
The following object is masked from 'package:stats':

    cancor
can_dat_sjrs <- candisc(lm_man)

## Scatterplot matrix
library(ggplot2)
#suppressMessages(suppressWarnings(library(GGally)))
library(GGally)
p <- ggpairs(can_dat_sjrs$scores
            , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
            , upper = list(continuous = "density", combo = "box")
            , lower = list(continuous = "points", combo = "dot")
            #, lower = list(continuous = "cor")
            , title = "Canonical discriminant variables by source"
            )
print(p)
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

How do the MANOVA differences found above reveal themselves in the projections provided by the discriminant variables?

Which discriminant variables are useful for detecting which differences?

Solution

[answer]