---
title: "ADA1: Class 20, One- and two-way tables"
author: Your Name
date: last-modified
description: |
[Advanced Data Analysis 1](https://StatAcumen.com/teach/ada1),
Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM
format:
html:
theme: litera
highlight-style: atom-one
page-layout: full # article, full # https://quarto.org/docs/output-formats/page-layout.html
toc: true
toc-location: body # body, left, right
number-sections: false
self-contained: false # !!! this can cause a render error
code-overflow: scroll # scroll, wrap
code-block-bg: true
code-block-border-left: "#30B0E0"
code-copy: false # true, false, hover a copy buttom in top-right of code block
fig-width: 6
fig-height: 4
---
# Rubric
```{r load-packages, message=FALSE}
library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw for all plots
```
# World Series expected number of games
One form of [sportsball](https://www.dictionary.com/e/slang/sportsball/)
is the game of [Baseball](http://xkcd.com/1593/).
Its championship, played exclusively in North America,
is called the "[World Series](https://en.wikipedia.org/wiki/List_of_World_Series_champions#Winners)".
How well does the distribution of number of games played during the Baseball World Series
follow the expected distribution?
This is a multinomial goodness-of-fit test.
There are four categories for the number of games (4, 5, 6, and 7)
and for each World Series the number of games must fall into one of these categories.
## Calculate observed and expected frequencies
We read in a table summarizing the World Series history, and calculate the total number of games played.
```{R}
# read data
dat_mlb_ws <-
read_csv(
"ADA1_CL_19_Data-WorldSeries.csv"
, skip = 1
) %>%
# 1. split Games at the hyphen into two columns
separate(
col = Games
, into = c("g1", "g2")
, sep = "-"
, convert = TRUE
) %>%
# 2. add number of games
mutate(
n_Games = g1 + g2
# Note that the 1903, 1919, 1920, and 1921 World Series were in a best-of-nine
# format (carried by the first team to win five games).
# When more than 7 games were played, I assigned those to 7 games.
, n_Games = ifelse(n_Games > 7, 7, n_Games)
) %>%
drop_na(
n_Games
)
head(dat_mlb_ws)
```
### Observed
Here's the observed number of games over the history of the World Series.
```{R}
#| fig-width: 8
#| fig-height: 1.5
library(ggplot2)
p <- ggplot(dat_mlb_ws, aes(x = Year, y = n_Games, group = 1))
p <- p + geom_point()
#p <- p + geom_line()
print(p)
```
```{R}
# observed proportions
dat_games <-
dat_mlb_ws %>%
group_by(
n_Games
) %>%
summarize(
obs_games = n()
, .groups = "drop_last"
) %>%
mutate(
obs_prop = round(obs_games / sum(obs_games), 3)
, n_Games = n_Games %>% factor()
) %>%
ungroup()
dat_games
```
### Expected
Under an assumption of the probability of a particular team winning a single game,
we can calculate the probability of 4, 5, 6, or 7 games,
and therefore the expected number of World Series ending with each number of games.
__Below, set your own probability of a particular team winning each game.__
On average, one team is a little better than the other,
and this number should reflect the slight advantage of the team with superior ability.
For example $p=0.6$ means that the superior team has a per-game probability of winning of 0.6
and the inferior team has a probability of winning of 0.4.
```{R}
# probability of the superior team winning each game
p_win <- 0.6 ## Change this number to your own value
# Expected proportion for each number of games
dat_games$exp_prop <- NA
# 4 wins to 0 losses, through 4 wins to 3 losses
# first value on each line is the superior team, the second is the inferior team
dat_games$exp_prop[1] <- dnbinom(0, 4, p_win) + dnbinom(0, 4, 1 - p_win)
dat_games$exp_prop[2] <- dnbinom(1, 4, p_win) + dnbinom(1, 4, 1 - p_win)
dat_games$exp_prop[3] <- dnbinom(2, 4, p_win) + dnbinom(2, 4, 1 - p_win)
dat_games$exp_prop[4] <- dnbinom(3, 4, p_win) + dnbinom(3, 4, 1 - p_win)
# checking that the proportions sum to 1
sum(dat_games$exp_prop)
# Expected number of games
dat_games <-
dat_games %>%
mutate(
exp_games = sum(obs_games) * exp_prop
)
dat_games
```
## Goodness-of-fit test
That's a lot of work to set up the observed and expected frequencies,
but now we can perform the goodness-of-fit test.
### Perform $\chi^2$ test.
### (1 p) 1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: "xxx"
* In notation: $H_0: xxx$ versus $H_A: xxx$
### (0 p) 2. Let the __significance level__ of the test be $\alpha=0.05$.
```{R}
# calculate chi-square goodness-of-fit test
x_summary <-
chisq.test(
x = dat_games$obs_games
, correct = FALSE
, p = dat_games$exp_prop
)
# print result of test
x_summary
# use names(x_summary) to find the statistic and p.value objects to print in text below
# replace 000 below with the object correct
```
### (0.25 p) 3. Compute the __test statistic__.
The test statistic is $X^2 = `r signif(000, 3)`$.
### (0.25 p) 4. Compute the __$p$-value__.
The p-value $= `r signif(000, 3)`$.
### (1 p) 5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
### (0.5 p) 6. __Check assumptions__ of the test (see notes).
### Chi-sq statistic helps us understand the deviations
```{R}
# use output in x_summary and create table
x_table <-
data.frame(
n_Games = dat_games$n_Games
, obs = x_summary$observed
, exp = x_summary$expected
, res = x_summary$residuals
, chisq = x_summary$residuals^2
, stdres = x_summary$stdres
)
x_table
```
Plot observed vs expected values to help identify `n_Games` groups that deviate the most.
Plot contribution to chi-square values to help identify `n_Games` groups that deviate the most.
The term "Contribution to Chi-Square" (`chisq`) refers to the values of
$\frac{(O-E)^2}{E}$ for each category.
$\chi^2_{\textrm{s}}$ is the sum of those contributions.
```{R}
#| fig-width: 8
#| fig-height: 4
x_table_long <-
x_table %>%
pivot_longer(
cols = c(obs, exp)
, names_to = "stat"
, values_to = "value"
)
x_table_long
# Observed vs Expected counts
library(ggplot2)
p1 <- ggplot(x_table_long, aes(x = n_Games, fill = stat, weight=value))
p1 <- p1 + geom_bar(position="dodge")
p1 <- p1 + labs(title = "Observed and Expected frequencies")
p1 <- p1 + xlab("Number of games")
#print(p1)
p2 <- ggplot(x_table, aes(x = reorder(n_Games, -chisq), weight = chisq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Contribution to Chi-sq statistic")
p2 <- p2 + xlab("Sorted number of games")
p2 <- p2 + ylab("Chi-sq")
#print(p2)
p_arranged <-
cowplot::plot_grid(
plotlist = list(p1, p2) # list of plots
, nrow = 1 # number of rows for grid of plots
, ncol = NULL # number of columns, left unspecified
, labels = "AUTO" # A and B panel labels
)
print(p_arranged)
```
## Multiple Comparisons
The goodness-of-fit test suggests that at least one of the `n_Games` category
proportions for the observed number of games differs from the expected number of games.
A reasonable next step in the analysis would be to __separately__ test
the four hypotheses:
$H_0: p_{4} = `r dat_games$exp_prop[1] %>% signif(3)`$,
$H_0: p_{5} = `r dat_games$exp_prop[2] %>% signif(3)`$,
$H_0: p_{6} = `r dat_games$exp_prop[3] %>% signif(3)`$, and
$H_0: p_{7} = `r dat_games$exp_prop[4] %>% signif(3)`$ to see which `n_Games` categories led to this conclusion.
A Bonferroni comparison with a Family Error Rate $\alpha=0.05$
sets each test's significance level to $\alpha = 0.05 / 4 = `r signif(0.05/4, 3)`$,
which corresponds to `r signif(100 * (1 - 0.05/4), 4)`% two-sided CIs
for the individual proportions.
Below I perform exact binomial tests of proportion for each of the four `n_Games` categories
at the Bonferroni-adjusted significance level.
I save the p-values and confidence intervals in a table along with the
observed and census proportions in order to display the table below.
```{R}
b_sum1 <- binom.test(dat_games$obs_games[1], sum(dat_games$obs_games), p = dat_games$exp_prop[1], alternative = "two.sided", conf.level = 1-0.05/4)
b_sum2 <- binom.test(dat_games$obs_games[2], sum(dat_games$obs_games), p = dat_games$exp_prop[2], alternative = "two.sided", conf.level = 1-0.05/4)
b_sum3 <- binom.test(dat_games$obs_games[3], sum(dat_games$obs_games), p = dat_games$exp_prop[3], alternative = "two.sided", conf.level = 1-0.05/4)
b_sum4 <- binom.test(dat_games$obs_games[4], sum(dat_games$obs_games), p = dat_games$exp_prop[4], alternative = "two.sided", conf.level = 1-0.05/4)
# put the p-value and CI into a data.frame
b_sum <-
data.frame(
rbind( c(b_sum1$p.value, b_sum1$conf.int)
, c(b_sum2$p.value, b_sum2$conf.int)
, c(b_sum3$p.value, b_sum3$conf.int)
, c(b_sum4$p.value, b_sum4$conf.int)
)
)
names(b_sum) <- c("p.value", "CI.lower", "CI.upper")
b_sum$n_Games <- dat_games$n_Games
b_sum$obs_prop <- x_table$obs / sum(x_table$obs)
b_sum$exp_prop <- dat_games$exp_prop
b_sum <-
b_sum %>%
select(
n_Games, obs_prop, exp_prop
, p.value, CI.lower, CI.upper
)
b_sum
```
### (1 p) State the conclusions of the four hypothesis tests above.
(Note that the p-value is your guide here, not the CI.)
```{R}
#| fig-width: 6
#| fig-height: 4
# Observed vs Expected proportions
library(ggplot2)
p <- ggplot(b_sum, aes(x = n_Games, y = obs_prop))
# observed
p <- p + geom_point(size = 4)
p <- p + geom_line(aes(group = 1))
p <- p + geom_errorbar(aes(ymin = CI.lower, ymax = CI.upper), width = 0.1)
# expected
p <- p + geom_point(aes(y = exp_prop), colour = "red", shape = "x", size = 6)
p <- p + geom_line(aes(y = exp_prop, group = 1), colour = "red")
p <- p + labs(caption= "Black = Observed (w/98.75% CI)\nRed = Expected proportions")
p <- p + xlab("Number of games")
p <- p + expand_limits(y = 0)
print(p)
```
---
# Popular kids
Subjects were students in grades 4-6 from three school districts in Ingham and
Clinton Counties, Michigan. Chase and Dummer stratified their sample, selecting
students from urban, suburban, and rural school districts with approximately
1/3 of their sample coming from each district. __Students indicated whether good
grades, athletic ability, or popularity was most important to them.__ They also
ranked four factors: grades, sports, looks, and money, in order of their
importance for popularity. The questionnaire also asked for gender, grade
level, and other demographic information.
__[Reference](http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html):__
Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social
Determinant for Children," _Research Quarterly for Exercise and Sport_, 63,
418-424.
```{R}
# read data
dat_kids <-
read_delim(
"ADA1_CL_20_Data-PopularKids.dat"
, delim = " "
, skip = 39
) %>%
filter(
Area %in% c("Urban", "Rural", "Suburban")
) %>%
select(
Goals
, Grade
, Area
) %>%
mutate(
Grade = factor(Grade)
, Goals = factor(Goals)
, Area = factor(Area )
)
summary(dat_kids)
```
We will investigate the association between two pairs of categorical variables:
1. Goals (grades, popular, sports) vs Grade level (4, 5, 6) and
2. Goals (grades, popular, sports) vs School area (rural, suburban, urban).
---
## Hypothesis test for Homogeneity of Proportions (categorical association)
You'll perform this hypothesis test for both scenarios below.
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
(Note that the statement in words is in terms of the alternative hypothesis.)
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
This is another way of saying that the probability of row category $i$ conditional on column category $j$,
$p(i|j) = p(i)$, does not depend on which column category $j$ we consider.
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $\chi^2$.
4. Compute the __$p$-value__ from the test statistic
with degrees of freedom $df = (R - 1)(C - 1)$,
where $R$ and $C$ are the number of rows and columns.
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
---
## Goals (grades, popular, sports) vs Grade level (4, 5, 6)
We start by summarizing the frequencies and column proportions.
The column proportions are interpretted as the proportion of students
with each goal for (or conditional on) each Grade.
If these column proportions are roughly the same for each Grade,
then that's an indication of no association.
```{R}
# Tabulate by two categorical variables:
tab_GoalsGrade <-
xtabs(
~ Goals + Grade
, data = dat_kids
)
tab_GoalsGrade
# column proportions
prop.table(
tab_GoalsGrade
, margin = 2
) %>%
signif(2)
```
### (1 p) 1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
### (0 p) 2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
### (0 p) 3. Compute the __test statistic__, such as $X^2$.
```{R}
chisq_gg <-
chisq.test(
tab_GoalsGrade
, correct = FALSE
)
chisq_gg
names(chisq_gg) # for the objects to report
```
The test statistic is $X^2 = `r chisq_gg$statistic %>% signif(3)`$.
### (0 p) 4. The p-value $= `r chisq_gg$p.value %>% signif(3)`$.
### (0.5 p) 5. State the __conclusion__ in terms of the problem.
### (0.5 p) 6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
```{R}
# table of expected frequencies:
chisq_gg$expected
# smallest expected frequency:
min(chisq_gg$expected)
```
Are the model assumptions met?
### (1 p) Interpret the Pearson residuals
If you rejected the null hypothesis above, the Pearson residuals are a way to
indicate which cells of the table were different from expected.
Residuals more extreme than roughly $\pm 2$ are considered ``large''.
```{R}
# The Pearson residuals
chisq_gg$residuals %>% signif(3)
# The sum of the squared residuals is the chi-squared statistic:
chisq_gg$residuals^2 %>% signif(3)
sum(chisq_gg$residuals^2)
```
### (1 p) Interpret the mosaic plot.
The mosaic plot is a visual representation of the observed frequencies (areas of each box)
and the Pearson residual (color bar).
If rectangles are the same size along rows and columns (and gray),
then they're close to expected.
Differences between observed and expected frequencies are indicated by
different sized rectangles across rows or down columns and
colors indicate substantial contributions to the $X^2$ statistic.
```{R}
#| fig-width: 6
#| fig-height: 4
# mosaic plot
library(vcd)
#mosaic(tab_GoalsGrade, shade=TRUE, legend=TRUE)
# this layout gives us the interpretation we want:
vcd::mosaic(
~ Grade + Goals
, data = dat_kids
, main = "Kids: Grade and Goals"
, shade = TRUE
, legend = TRUE
, direction = "v"
)
```
---
## Goals vs School area (rural, suburban, urban)
_Repeat the analysis above, but compare School area instead of Grade level._
```{R}
# Tabulate by two categorical variables:
tab_GoalsArea <-
xtabs(
~ Goals + Area
, data = dat_kids
)
tab_GoalsArea
# column proportions
prop.table(
tab_GoalsArea
, margin = 2
) %>%
signif(2)
```
### (0.5 p) Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
### (0 p) 2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
### (0 p) 3. Compute the __test statistic__, such as $X^2$.
```{R}
chisq_ga <-
chisq.test(
tab_GoalsArea
, correct = FALSE
)
chisq_ga
# names(chisq_ga) # for the objects to report
```
The test statistic is $X^2 = `r chisq_ga$statistic %>% signif(3)`$.
### (0 p) 4. The p-value $= `r chisq_ga$p.value %>% signif(3)`$.
### (0.5 p) 5. State the __conclusion__ in terms of the problem.
### (0 p) 6. __Check assumptions__ of the test.
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
```{R}
# table of expected frequencies:
chisq_ga$expected
# smallest expected frequency:
min(chisq_ga$expected)
```
Are the model assumptions met?
### (0.5 p) Interpret the Pearson residuals.
If you rejected the null hypothesis above, the Pearson residuals are a way to
indicate which cells of the table were different from expected.
Residuals more extreme than roughly $\pm 2$ are considered ``large''.
```{R}
# The Pearson residuals
chisq_ga$residuals %>% signif(3)
# The sum of the squared residuals is the chi-squared statistic:
chisq_ga$residuals^2 %>% signif(3)
sum(chisq_ga$residuals^2)
```
### (0.5 p) Interpret the mosaic plot.
The mosaic plot is a visual representation of the observed frequencies (areas of each box)
and the Pearson residual (color bar).
If rectangles are the same size along rows and columns (and gray),
then they're close to expected.
Differences between observed and expected frequencies are indicated by
different sized rectangles across rows or down columns and
colors indicate substantial contributions to the $X^2$ statistic.
```{R}
#| fig-width: 6
#| fig-height: 4
# mosaic plot
library(vcd)
#mosaic(tab_GoalsArea, shade=TRUE, legend=TRUE)
vcd::mosaic(
~ Area + Goals
, data = dat_kids
, main = "Kids: Area and Goals"
, shade = TRUE
, legend = TRUE
, direction = "v" # this layout gives us the interpretation we want
)
```