ADA1: Class 25, Two-way categorical tables

Advanced Data Analysis 1, Stat 427/527, Fall 2022, Prof. Erik Erhardt, UNM

Author

Your Name

Published

August 13, 2022

Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).

Rubric

Answer the questions with the data example.


Goals vs School area (rural, suburban, urban)

Repeat the analysis above, but compare School area instead of Grade level.

# Tabulate by two categorical variables:
tab_GoalsArea <-
  xtabs(
    ~ Goals + Area
  , data = dat_kids
  )
tab_GoalsArea
         Area
Goals     Rural Suburban Urban
  Grades     57       87   103
  Popular    50       42    49
  Sports     42       22    26
# column proportions
prop.table(
    tab_GoalsArea
  , margin = 2
  ) %>%
  signif(2)
         Area
Goals     Rural Suburban Urban
  Grades   0.38     0.58  0.58
  Popular  0.34     0.28  0.28
  Sports   0.28     0.15  0.15
  1. (1 p) Set up the null and alternative hypotheses in words and notation.

    • In words: ``There is an association between [row variable] and [column variable].’’
    • In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\).
  2. Choose the significance level of the test, such as \(\alpha=0.05\).

  3. Compute the test statistic, such as \(X^2\).

chisq_ga <-
  chisq.test(
    tab_GoalsArea
  , correct = FALSE
  )
chisq_ga

    Pearson's Chi-squared test

data:  tab_GoalsArea
X-squared = 18.828, df = 4, p-value = 0.0008497
  # names(chisq_ga) for the objects to report

The test statistic is \(X^2 = 18.8\).

  1. The p-value \(= 8.5\times 10^{-4}\).

  2. (1 p) State the conclusion in terms of the problem.

  3. Check assumptions of the test (expected count for each cell is at least 5, or at least 1 provided the expected counts are not too variable).

# table of expected frequencies:
chisq_ga$expected
         Area
Goals        Rural Suburban    Urban
  Grades  76.99372 78.02720 91.97908
  Popular 43.95188 44.54184 52.50628
  Sports  28.05439 28.43096 33.51464
# smallest expected frequency:
min(chisq_ga$expected)
[1] 28.05439

(1 p) Are the model assumptions met?

Deviations details

If you rejected the null hypothesis above, the Pearson residuals are a way to indicate which cells of the table were different from expected. Residuals more extreme than roughly \(\pm 2\) are considered ``large’’.

# The Pearson residuals
chisq_ga$residuals   %>% signif(3)
         Area
Goals      Rural Suburban  Urban
  Grades  -2.280    1.020  1.150
  Popular  0.912   -0.381 -0.484
  Sports   2.630   -1.210 -1.300
# The sum of the squared residuals is the chi-squared statistic:
chisq_ga$residuals^2 %>% signif(3)
         Area
Goals     Rural Suburban Urban
  Grades  5.190    1.030 1.320
  Popular 0.832    0.145 0.234
  Sports  6.930    1.450 1.680
sum(chisq_ga$residuals^2)
[1] 18.82763

(1 p) Interpret the Pearson residuals.

The mosaic plot is a visual representation of the observed frequencies (areas of each box) and the Pearson residual (color bar). If rectangles are the same size along rows and columns (and gray), then they’re close to expected. Differences between observed and expected frequencies are indicated by different sized rectangles across rows or down columns and colors indicate substantial contributions to the \(X^2\) statistic.

# mosaic plot
library(vcd)
#mosaic(tab_GoalsArea, shade=TRUE, legend=TRUE)

# this layout gives us the interpretation we want:
mosaic(~ Area + Goals, data = dat_kids, shade=TRUE, legend=TRUE, direction = "v")

vcd::mosaic(
    ~ Area + Goals
  , data      = dat_kids
  , main      = "Kids: Area and Goals"
  , shade     = TRUE
  , legend    = TRUE
  , direction = "v"
  )

(1 p) Interpret the mosaic plot.