Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).

Rubric

Answer the questions with the data example.

World Series expected number of games

One form of sportsball is the game of Baseball. Its championship, played exclusively in North America, is called the “World Series”.

How well does the distribution of number of games played during the Baseball World Series follow the expected distribution? This is a multinomial goodness-of-fit test. There are four categories for the number of games (4, 5, 6, and 7) and for each World Series the number of games must fall into one of these categories.

Calculate observed and expected frequencies

We read in a table summarizing the World Series history, and calculate the total number of games played.

# read datadat_mlb_ws <-read_csv("ADA1_CL_24_WorldSeries.csv" , skip =1 ) %>%# 1. split Games at the hyphen into two columnsseparate(col = Games , into =c("g1", "g2") , sep ="-" , convert =TRUE ) %>%# 2. add number of gamesmutate(n_Games = g1 + g2# Note that the 1903, 1919, 1920, and 1921 World Series were in a best-of-nine# format (carried by the first team to win five games).# When more than 7 games were played, I assigned those to 7 games. , n_Games =ifelse(n_Games >7, 7, n_Games) ) %>%drop_na( n_Games )

New names:
Rows: 118 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): Winning team, Manager...3, League...4, Games, Losing team, Manager.... dbl
(1): Year
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `Manager` -> `Manager...3`
• `League` -> `League...4`
• `Manager` -> `Manager...7`
• `League` -> `League...8`

head(dat_mlb_ws)

# A tibble: 6 × 11
Year `Winning team` Manag…¹ Leagu…² g1 g2 Losin…³ Manag…⁴ Leagu…⁵ Ref.
<dbl> <chr> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 1903 Boston Americ… Jimmy … AL 5 3 Pittsb… Fred C… NL [9]
2 1905 New York Gian… John M… NL 4 1 Philad… Connie… AL [10]
3 1906 Chicago White… Fielde… AL 4 2 Chicag… Frank … NL [11]
4 1907 Chicago Cubs Frank … NL 4 0 Detroi… Hugh J… AL [12]
5 1908 Chicago Cubs Frank … NL 4 1 Detroi… Hugh J… AL [13]
6 1909 Pittsburgh Pi… Fred C… NL 4 3 Detroi… Hugh J… AL [14]
# … with 1 more variable: n_Games <dbl>, and abbreviated variable names
# ¹Manager...3, ²League...4, ³`Losing team`, ⁴Manager...7, ⁵League...8
# ℹ Use `colnames()` to see all variable names

Observed

Here’s the observed number of games over the history of the World Series.

library(ggplot2)p <-ggplot(dat_mlb_ws, aes(x = Year, y = n_Games, group =1))p <- p +geom_point()#p <- p + geom_line()print(p)

Under an assumption of the probability of a particular team winning a single game, we can calculate the probability of 4, 5, 6, or 7 games, and therefore the expected number of World Series ending with each number of games.

(1 p) Below, set your own probability of a particular team winning each game.

On average, one team is a little better than the other, and this number should reflect the slight advantage of the team with superior ability. For example \(p=0.6\) means that the superior team has a per-game probability of winning of 0.6 and the inferior team has a probability of winning of 0.4.

# probability of the superior team winning each gamep_win <-0.6## Change this number to your own value# Expected proportion for each number of gamesdat_games$exp_prop <-NA# 4 wins to 0 losses, through 4 wins to 3 losses# first value on each line is the superior team, the second is the inferior teamdat_games$exp_prop[1] <-pnbinom(0, 4, p_win) +pnbinom(0, 4, 1- p_win)dat_games$exp_prop[2] <- (pnbinom(1, 4, p_win) -pnbinom(0, 4, p_win)) + (pnbinom(1, 4, 1- p_win) -pnbinom(0, 4, 1- p_win))dat_games$exp_prop[3] <- (pnbinom(2, 4, p_win) -pnbinom(1, 4, p_win)) + (pnbinom(2, 4, 1- p_win) -pnbinom(1, 4, 1- p_win))dat_games$exp_prop[4] <-1-pnbinom(2, 4, p_win) -pnbinom(2, 4, 1- p_win)# checking that the proportions sum to 1sum(dat_games$exp_prop)

[1] 1

# Expected number of gamesdat_games <- dat_games %>%mutate(exp_games =sum(obs_games) * exp_prop )dat_games

Plot observed vs expected values to help identify n_Games groups that deviate the most. Plot contribution to chi-square values to help identify n_Games groups that deviate the most. The term “Contribution to Chi-Square” (chisq) refers to the values of \(\frac{(O-E)^2}{E}\) for each category. \(\chi^2_{\textrm{s}}\) is the sum of those contributions.

# Observed vs Expected countslibrary(ggplot2)p1 <-ggplot(x_table_long, aes(x = n_Games, fill = stat, weight=value))p1 <- p1 +geom_bar(position="dodge")p1 <- p1 +labs(title ="Observed and Expected frequencies")p1 <- p1 +xlab("Number of games")#print(p1)p2 <-ggplot(x_table, aes(x =reorder(n_Games, -chisq), weight = chisq))p2 <- p2 +geom_bar()p2 <- p2 +labs(title ="Contribution to Chi-sq statistic")p2 <- p2 +xlab("Sorted number of games")p2 <- p2 +ylab("Chi-sq")#print(p2)library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':
combine

grid.arrange(p1, p2, nrow=1)

Multiple Comparisons

The goodness-of-fit test suggests that at least one of the n_Games category proportions for the observed number of games differs from the expected number of games. A reasonable next step in the analysis would be to separately test the four hypotheses: \(H_0: p_{4} = 0.155\), \(H_0: p_{5} = 0.269\), \(H_0: p_{6} = 0.3\), and \(H_0: p_{7} = 0.276\) to see which n_Games categories led to this conclusion.

A Bonferroni comparison with a Family Error Rate \(\alpha=0.05\) sets each test’s significance level to \(\alpha = 0.05 / 4 = 0.0125\), which corresponds to 98.75% two-sided CIs for the individual proportions.

Below I perform exact binomial tests of proportion for each of the four n_Games categories at the Bonferroni-adjusted significance level. I save the p-values and confidence intervals in a table along with the observed and census proportions in order to display the table below.

b_sum1 <-binom.test(dat_games$obs_games[1], sum(dat_games$obs_games), p = dat_games$exp_prop[1], alternative ="two.sided", conf.level =1-0.05/4)b_sum2 <-binom.test(dat_games$obs_games[2], sum(dat_games$obs_games), p = dat_games$exp_prop[2], alternative ="two.sided", conf.level =1-0.05/4)b_sum3 <-binom.test(dat_games$obs_games[3], sum(dat_games$obs_games), p = dat_games$exp_prop[3], alternative ="two.sided", conf.level =1-0.05/4)b_sum4 <-binom.test(dat_games$obs_games[4], sum(dat_games$obs_games), p = dat_games$exp_prop[4], alternative ="two.sided", conf.level =1-0.05/4)# put the p-value and CI into a data.frameb_sum <-data.frame(rbind( c(b_sum1$p.value, b_sum1$conf.int) , c(b_sum2$p.value, b_sum2$conf.int) , c(b_sum3$p.value, b_sum3$conf.int) , c(b_sum4$p.value, b_sum4$conf.int) ) )names(b_sum) <-c("p.value", "CI.lower", "CI.upper")b_sum$n_Games <- dat_games$n_Gamesb_sum$obs_prop <- x_table$obs /sum(x_table$obs)b_sum$exp_prop <- dat_games$exp_propb_sum <- b_sum %>%select( n_Games, obs_prop, exp_prop , p.value, CI.lower, CI.upper )b_sum