Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).

# Rubric

Answer the questions with the data example.

# Goals vs School area (rural, suburban, urban)

Repeat the analysis above, but compare School area instead of Grade level.

# Tabulate by two categorical variables:
tab_GoalsArea <-
xtabs(
~ Goals + Area
, data = dat_kids
)
tab_GoalsArea
##          Area
## Goals     Rural Suburban Urban
##   Popular    50       42    49
##   Sports     42       22    26
# column proportions
prop.table(
tab_GoalsArea
, margin = 2
) %>%
signif(2)
##          Area
## Goals     Rural Suburban Urban
##   Popular  0.34     0.28  0.28
##   Sports   0.28     0.15  0.15
1. (1 p) Set up the null and alternative hypotheses in words and notation.

• In words: There is an association between [row variable] and [column variable].’’
• In notation: $$H_0: p(i \textrm{ and } j) = p(i)p(j)$$ versus $$H_A: p(i \textrm{ and } j) \ne p(i)p(j)$$, for all row categories $$i$$ and column categories $$j$$.
2. Choose the significance level of the test, such as $$\alpha=0.05$$.

3. Compute the test statistic, such as $$X^2$$.

chisq_ga <-
chisq.test(
tab_GoalsArea
, correct = FALSE
)
chisq_ga
##
##  Pearson's Chi-squared test
##
## data:  tab_GoalsArea
## X-squared = 18.828, df = 4, p-value = 0.0008497
  # names(chisq_ga) for the objects to report

The test statistic is $$X^2 = 18.8$$.

1. The p-value $$= 8.5\times 10^{-4}$$.

2. (1 p) State the conclusion in terms of the problem.

3. Check assumptions of the test (expected count for each cell is at least 5, or at least 1 provided the expected counts are not too variable).

# table of expected frequencies:
chisq_ga$expected ## Area ## Goals Rural Suburban Urban ## Grades 76.99372 78.02720 91.97908 ## Popular 43.95188 44.54184 52.50628 ## Sports 28.05439 28.43096 33.51464 # smallest expected frequency: min(chisq_ga$expected)
## [1] 28.05439

(1 p) Are the model assumptions met?

## Deviations details

If you rejected the null hypothesis above, the Pearson residuals are a way to indicate which cells of the table were different from expected. Residuals more extreme than roughly $$\pm 2$$ are considered large’’.

# The Pearson residuals
chisq_ga$residuals %>% signif(3) ## Area ## Goals Rural Suburban Urban ## Grades -2.280 1.020 1.150 ## Popular 0.912 -0.381 -0.484 ## Sports 2.630 -1.210 -1.300 # The sum of the squared residuals is the chi-squared statistic: chisq_ga$residuals^2 %>% signif(3)
##          Area
## Goals     Rural Suburban Urban
##   Popular 0.832    0.145 0.234
##   Sports  6.930    1.450 1.680
sum(chisq_ga\$residuals^2)
## [1] 18.82763

(1 p) Interpret the Pearson residuals.

The mosaic plot is a visual representation of the observed frequencies (areas of each box) and the Pearson residual (color bar). If rectangles are the same size along rows and columns (and gray), then they’re close to expected. Differences between observed and expected frequencies are indicated by different sized rectangles across rows or down columns and colors indicate substantial contributions to the $$X^2$$ statistic.

# mosaic plot
library(vcd)

# this layout gives us the interpretation we want:
mosaic(~ Area + Goals, data = dat_kids, shade=TRUE, legend=TRUE, direction = "v")

vcd::mosaic(
~ Area + Goals
, data      = dat_kids
, main      = "Kids: Area and Goals"
)