Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).

Answer the questions with the data example.

One form of sportsball is the game of Baseball. Its championship, played exclusively in North America, is called the “World Series”.

How well does the distribution of number of games played during the Baseball World Series follow the expected distribution? This is a multinomial goodness-of-fit test. There are four categories for the number of games (4, 5, 6, and 7) and for each World Series the number of games must fall into one of these categories.

We read in a table summarizing the World Series history, and calculate the total number of games played.

`library(tidyverse)`

`## -- Attaching packages ------------------------------ tidyverse 1.3.0 --`

```
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
```

```
## -- Conflicts --------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
```

```
# read data
dat_mlb_ws <-
read_csv(
"ADA1_CL_24_WorldSeries.csv"
, skip = 1
) %>%
# 1. split Games at the hyphen into two columns
separate(
col = Games
, into = c("g1", "g2")
, sep = "-"
, convert = TRUE
) %>%
# 2. add number of games
mutate(
n_Games = g1 + g2
# Note that the 1903, 1919, 1920, and 1921 World Series were in a best-of-nine
# format (carried by the first team to win five games).
# When more than 7 games were played, I assigned those to 7 games.
, n_Games = ifelse(n_Games > 7, 7, n_Games)
) %>%
drop_na(
n_Games
)
```

```
## Warning: Duplicated column names deduplicated: 'Manager' => 'Manager_1' [7],
## 'League' => 'League_1' [8]
```

```
## Parsed with column specification:
## cols(
## Year = col_double(),
## `Winning team` = col_character(),
## Manager = col_character(),
## League = col_character(),
## Games = col_character(),
## `Losing team` = col_character(),
## Manager_1 = col_character(),
## League_1 = col_character(),
## Ref. = col_character()
## )
```

`head(dat_mlb_ws)`

```
## # A tibble: 6 x 11
## Year `Winning team` Manager League g1 g2 `Losing team` Manager_1
## <dbl> <chr> <chr> <chr> <int> <int> <chr> <chr>
## 1 1903 Boston Americ~ Jimmy ~ AL 5 3 Pittsburg Pi~ Fred Cla~
## 2 1905 New York Gian~ John M~ NL 4 1 Philadelphia~ Connie M~
## 3 1906 Chicago White~ Fielde~ AL 4 2 Chicago Cubs Frank Ch~
## 4 1907 Chicago Cubs Frank ~ NL 4 0 Detroit Tige~ Hugh Jen~
## 5 1908 Chicago Cubs Frank ~ NL 4 1 Detroit Tige~ Hugh Jen~
## 6 1909 Pittsburgh Pi~ Fred C~ NL 4 3 Detroit Tige~ Hugh Jen~
## # ... with 3 more variables: League_1 <chr>, Ref. <chr>, n_Games <dbl>
```

Here’s the observed number of games over the history of the World Series.

```
library(ggplot2)
p <- ggplot(dat_mlb_ws, aes(x = Year, y = n_Games, group = 1))
p <- p + geom_point()
#p <- p + geom_line()
print(p)
```

```
# observed proportions
dat_games <-
dat_mlb_ws %>%
group_by(
n_Games
) %>%
summarize(
obs_games = n()
, .groups = "drop_last"
) %>%
mutate(
obs_prop = round(obs_games / sum(obs_games), 3)
, n_Games = n_Games %>% factor()
) %>%
ungroup()
dat_games
```

```
## # A tibble: 4 x 4
## n_Games obs_games .groups obs_prop
## <fct> <int> <chr> <dbl>
## 1 4 21 drop_last 0.183
## 2 5 26 drop_last 0.226
## 3 6 24 drop_last 0.209
## 4 7 44 drop_last 0.383
```

Under an assumption of the probability of a particular team winning a single game, we can calculate the probability of 4, 5, 6, or 7 games, and therefore the expected number of World Series ending with each number of games.

**(1 p) Below, set your own probability of a particular team winning each game.**

On average, one team is a little better than the other, and this number should reflect the slight advantage of the team with superior ability. For example \(p=0.6\) means that the superior team has a per-game probability of winning of 0.6 and the inferior team has a probability of winning of 0.4.

```
# probability of the superior team winning each game
p_win <- 0.6 ## Change this number to your own value
# Expected proportion for each number of games
dat_games$exp_prop <- NA
# 4 wins to 0 losses, through 4 wins to 3 losses
# first value on each line is the superior team, the second is the inferior team
dat_games$exp_prop[1] <- pnbinom(0, 4, p_win) + pnbinom(0, 4, 1 - p_win)
dat_games$exp_prop[2] <- (pnbinom(1, 4, p_win) - pnbinom(0, 4, p_win)) + (pnbinom(1, 4, 1 - p_win) - pnbinom(0, 4, 1 - p_win))
dat_games$exp_prop[3] <- (pnbinom(2, 4, p_win) - pnbinom(1, 4, p_win)) + (pnbinom(2, 4, 1 - p_win) - pnbinom(1, 4, 1 - p_win))
dat_games$exp_prop[4] <- 1 - pnbinom(2, 4, p_win) - pnbinom(2, 4, 1 - p_win)
# checking that the proportions sum to 1
sum(dat_games$exp_prop)
```

`## [1] 1`

```
# Expected number of games
dat_games <-
dat_games %>%
mutate(
exp_games = sum(obs_games) * exp_prop
)
dat_games
```

```
## # A tibble: 4 x 6
## n_Games obs_games .groups obs_prop exp_prop exp_games
## <fct> <int> <chr> <dbl> <dbl> <dbl>
## 1 4 21 drop_last 0.183 0.155 17.8
## 2 5 26 drop_last 0.226 0.269 30.9
## 3 6 24 drop_last 0.209 0.300 34.4
## 4 7 44 drop_last 0.383 0.276 31.8
```

That’s a lot of work to set up the observed and expected frequencies, but now we can perform the goodness-of-fit test.

- (2 p) Set up the
**null and alternative hypotheses**in words and notation.- In words: “xxx”
- In notation: \(H_0: xxx\) versus \(H_A: xxx\)

- Let the
**significance level**of the test be \(\alpha=0.05\).

```
# calculate chi-square goodness-of-fit test
x_summary <-
chisq.test(
x = dat_games$obs_games
, correct = FALSE
, p = dat_games$exp_prop
)
# print result of test
x_summary
```

```
##
## Chi-squared test for given probabilities
##
## data: dat_games$obs_games
## X-squared = 9.1893, df = 3, p-value = 0.02688
```

```
# use names(x_summary) to find the statistic and p.value objects to print in text below
# replace 000 below with the object correct
```

- (1 p) Compute the
**test statistic**.

The test statistic is \(X^2 = 0\).

- (1 p) Compute the
**\(p\)-value**.

The p-value \(= 0\).

- (1 p) State the
**conclusion**in terms of the problem.- Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
- Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT
*accept*\(H_0\).)

- (1 p)
**Check assumptions**of the test (see notes).

```
# use output in x_summary and create table
x_table <-
data.frame(
n_Games = dat_games$n_Games
, obs = x_summary$observed
, exp = x_summary$expected
, res = x_summary$residuals
, chisq = x_summary$residuals^2
, stdres = x_summary$stdres
)
x_table
```

```
## n_Games obs exp res chisq stdres
## 1 4 21 17.8480 0.7460904 0.5566508 0.8117354
## 2 5 26 30.9120 -0.8834761 0.7805300 -1.0331817
## 3 6 24 34.4448 -1.7796651 3.1672080 -2.1263778
## 4 7 44 31.7952 2.1644616 4.6848940 2.5446298
```

Plot observed vs expected values to help identify `n_Games`

groups that deviate the most. Plot contribution to chi-square values to help identify `n_Games`

groups that deviate the most. The term “Contribution to Chi-Square” (`chisq`

) refers to the values of \(\frac{(O-E)^2}{E}\) for each category. \(\chi^2_{\textrm{s}}\) is the sum of those contributions.

```
x_table_long <-
x_table %>%
pivot_longer(
cols = c(obs, exp)
, names_to = "stat"
, values_to = "value"
)
x_table_long
```

```
## # A tibble: 8 x 6
## n_Games res chisq stdres stat value
## <fct> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 4 0.746 0.557 0.812 obs 21
## 2 4 0.746 0.557 0.812 exp 17.8
## 3 5 -0.883 0.781 -1.03 obs 26
## 4 5 -0.883 0.781 -1.03 exp 30.9
## 5 6 -1.78 3.17 -2.13 obs 24
## 6 6 -1.78 3.17 -2.13 exp 34.4
## 7 7 2.16 4.68 2.54 obs 44
## 8 7 2.16 4.68 2.54 exp 31.8
```

```
# Observed vs Expected counts
library(ggplot2)
p1 <- ggplot(x_table_long, aes(x = n_Games, fill = stat, weight=value))
p1 <- p1 + geom_bar(position="dodge")
p1 <- p1 + labs(title = "Observed and Expected frequencies")
p1 <- p1 + xlab("Number of games")
#print(p1)
p2 <- ggplot(x_table, aes(x = reorder(n_Games, -chisq), weight = chisq))
p2 <- p2 + geom_bar()
p2 <- p2 + labs(title = "Contribution to Chi-sq statistic")
p2 <- p2 + xlab("Sorted number of games")
p2 <- p2 + ylab("Chi-sq")
#print(p2)
library(gridExtra)
```

```
##
## Attaching package: 'gridExtra'
```

```
## The following object is masked from 'package:dplyr':
##
## combine
```

`grid.arrange(p1, p2, nrow=1)`

The goodness-of-fit test suggests that at least one of the `n_Games`

category proportions for the observed number of games differs from the expected number of games. A reasonable next step in the analysis would be to **separately** test the four hypotheses: \(H_0: p_{4} = 0.155\), \(H_0: p_{5} = 0.269\), \(H_0: p_{6} = 0.3\), and \(H_0: p_{7} = 0.276\) to see which `n_Games`

categories led to this conclusion.

A Bonferroni comparison with a Family Error Rate \(\alpha=0.05\) sets each test’s significance level to \(\alpha = 0.05 / 4 = 0.0125\), which corresponds to 98.75% two-sided CIs for the individual proportions.

Below I perform exact binomial tests of proportion for each of the four `n_Games`

categories at the Bonferroni-adjusted significance level. I save the p-values and confidence intervals in a table along with the observed and census proportions in order to display the table below.

```
b_sum1 <- binom.test(dat_games$obs_games[1], sum(dat_games$obs_games), p = dat_games$exp_prop[1], alternative = "two.sided", conf.level = 1-0.05/4)
b_sum2 <- binom.test(dat_games$obs_games[2], sum(dat_games$obs_games), p = dat_games$exp_prop[2], alternative = "two.sided", conf.level = 1-0.05/4)
b_sum3 <- binom.test(dat_games$obs_games[3], sum(dat_games$obs_games), p = dat_games$exp_prop[3], alternative = "two.sided", conf.level = 1-0.05/4)
b_sum4 <- binom.test(dat_games$obs_games[4], sum(dat_games$obs_games), p = dat_games$exp_prop[4], alternative = "two.sided", conf.level = 1-0.05/4)
# put the p-value and CI into a data.frame
b_sum <-
data.frame(
rbind( c(b_sum1$p.value, b_sum1$conf.int)
, c(b_sum2$p.value, b_sum2$conf.int)
, c(b_sum3$p.value, b_sum3$conf.int)
, c(b_sum4$p.value, b_sum4$conf.int)
)
)
names(b_sum) <- c("p.value", "CI.lower", "CI.upper")
b_sum$n_Games <- dat_games$n_Games
b_sum$obs_prop <- x_table$obs / sum(x_table$obs)
b_sum$exp_prop <- dat_games$exp_prop
b_sum <-
b_sum %>%
select(
n_Games, obs_prop, exp_prop
, p.value, CI.lower, CI.upper
)
b_sum
```

```
## n_Games obs_prop exp_prop p.value CI.lower CI.upper
## 1 4 0.1826087 0.15520 0.43864745 0.1022933 0.2887758
## 2 5 0.2260870 0.26880 0.34416061 0.1367781 0.3376171
## 3 6 0.2086957 0.29952 0.03265599 0.1227887 0.3182686
## 4 7 0.3826087 0.27648 0.01596705 0.2716628 0.5030121
```

(2 p) State the conclusions of the four hypothesis tests above. (Note that the p-value is your guide here, not the CI.)

```
# Observed vs Expected proportions
library(ggplot2)
p <- ggplot(b_sum, aes(x = n_Games, y = obs_prop))
# observed
p <- p + geom_point(size = 4)
p <- p + geom_line(aes(group = 1))
p <- p + geom_errorbar(aes(ymin = CI.lower, ymax = CI.upper), width = 0.1)
# expected
p <- p + geom_point(aes(y = exp_prop), colour = "red", shape = "x", size = 6)
p <- p + geom_line(aes(y = exp_prop, group = 1), colour = "red")
p <- p + labs(caption= "Black = Observed (w/98.75% CI)\nRed = Expected proportions")
p <- p + xlab("Number of games")
p <- p + expand_limits(y = 0)
print(p)
```

(2 p) Write one or two sentences of discussion. Speculate as to why the deviations are the way they are.