---
title: "ADA1: Class 20, Two-way categorical tables"
author: anonymous
date: "11/01/2016"
output:
html_document:
toc: true
---
Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).
# Rubric
Answer the questions with the data example.
---
# Popular kids
Subjects were students in grades 4-6 from three school districts in Ingham and
Clinton Counties, Michigan. Chase and Dummer stratified their sample, selecting
students from urban, suburban, and rural school districts with approximately
1/3 of their sample coming from each district. __Students indicated whether good
grades, athletic ability, or popularity was most important to them.__ They also
ranked four factors: grades, sports, looks, and money, in order of their
importance for popularity. The questionnaire also asked for gender, grade
level, and other demographic information.
__[Reference](http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html):__
Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social
Determinant for Children," _Research Quarterly for Exercise and Sport_, 63,
418-424.
```{R}
# read data
## original source: http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html
fn.dat <- "http://statacumen.com/teach/ADA1/worksheet/ADA1_WS_20_PopularKids.dat"
dat <- read.csv(fn.dat, skip = 40, header = TRUE, sep = " ", stringsAsFactors = TRUE)
dat <- subset(dat
, subset = (Area %in% c("Urban", "Rural", "Suburban"))
, select = c("Goals", "Grade", "Area"))
dat$Grade <- factor(dat$Grade)
dat$Goals <- factor(dat$Goals)
summary(dat)
```
We will investigate the association between two pairs of categorical variables:
(1) Goals (grades, popular, sports) vs Grade level (4, 5, 6) and
(2) Goals (grades, popular, sports) vs School area (rural, suburban, urban).
## Hypothesis test for Homogeneity of Proportions (categorical association)
You'll perform this hypothesis test for both scenarios below.
1. Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
(Note that the statement in words is in terms of the alternative hypothesis.)
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
This is another way of saying that the probability of row category $i$ conditional on column category $j$,
$p(i|j) = p(i)$, does not depend on which column category $j$ we consider.
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $\chi^2$.
4. Compute the __$p$-value__ from the test statistic
with degrees of freedom $df = (R-1)(C-1)$,
where $R$ and $C$ are the number of rows and columns.
5. State the __conclusion__ in terms of the problem.
* Reject $H_0$ in favor of $H_A$ if $p\textrm{-value} < \alpha$.
* Fail to reject $H_0$ if $p\textrm{-value} \ge \alpha$.
(Note: We DO NOT _accept_ $H_0$.)
6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
# Goals (grades, popular, sports) vs Grade level (4, 5, 6)
We start by summarizing the frequencies and column proportions.
The column proportions are interpretted as the proportion of students
with each goal for (or conditional on) each Grade.
If these column proportions are roughly the same for each Grade,
then that's an indication of no association.
```{R}
# Tabulate by two categorical variables:
tab.GoalsGrade <- xtabs(~ Goals + Grade, data = dat)
tab.GoalsGrade
# column proportions
prop.table(tab.GoalsGrade, margin = 2)
```
1. __(1 p)__ Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $X^2$.
```{R}
chisq.gg <- chisq.test(tab.GoalsGrade, correct=FALSE)
chisq.gg
# names(chisq.gg) for the objects to report
```
The test statistic is $X^2 = `r signif(chisq.gg$statistic, 3)`$.
4. The p-value $= `r signif(chisq.gg$p.value, 3)`$.
5. __(1 p)__ State the __conclusion__ in terms of the problem.
6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
```{R}
# table of expected frequencies:
chisq.gg$expected
# smallest expected frequency:
min(chisq.gg$expected)
```
__(1 p)__ Are the model assumptions met?
## Deviations details
If you rejected the null hypothesis above, the Pearson residuals are a way to
indicate which cells of the table were different from expected.
Residuals more extreme than roughly $\pm 2$ are considered ``large''.
```{R}
# The Pearson residuals
chisq.gg$residuals
# The sum of the squared residuals is the chi-squared statistic:
chisq.gg$residuals^2
sum(chisq.gg$residuals^2)
```
__(1 p)__ Interpret the Pearson residuals.
The mosaic plot is a visual representation of the observed frequencies (areas of each box)
and the Pearson residual (color bar).
If rectangles are the same size along rows and columns (and gray),
then they're close to expected.
Differences between observed and expected frequencies are indicated by
different sized rectangles across rows or down columns and
colors indicate substantial contributions to the $X^2$ statistic.
```{R}
# mosaic plot
library(vcd)
#mosaic(tab.GoalsGrade, shade=TRUE, legend=TRUE)
# this layout gives us the interpretation we want:
mosaic(~ Grade + Goals, data = dat, shade=TRUE, legend=TRUE, direction = "v")
```
__(1 p)__ Interpret the mosaic plot.
# Goals vs School area (rural, suburban, urban)
_Repeat the analysis above, but compare School area instead of Grade level._
```{R}
# Tabulate by two categorical variables:
tab.GoalsArea <- xtabs(~ Goals + Area, data = dat)
tab.GoalsArea
# column proportions
prop.table(tab.GoalsArea, margin = 2)
```
1. __(1 p)__ Set up the __null and alternative hypotheses__ in words and notation.
* In words: ``There is an association between [row variable] and [column variable].''
* In notation: $H_0: p(i \textrm{ and } j) = p(i)p(j)$ versus $H_A: p(i \textrm{ and } j) \ne p(i)p(j)$, for all row categories $i$ and column categories $j$.
2. Choose the __significance level__ of the test, such as $\alpha=0.05$.
3. Compute the __test statistic__, such as $X^2$.
```{R}
chisq.ga <- chisq.test(tab.GoalsArea, correct=FALSE)
chisq.ga
# names(chisq.ga) for the objects to report
```
The test statistic is $X^2 = `r signif(chisq.ga$statistic, 3)`$.
4. The p-value $= `r signif(chisq.ga$p.value, 3)`$.
5. __(1 p)__ State the __conclusion__ in terms of the problem.
6. __Check assumptions__ of the test
(expected count for each cell is at least 5,
or at least 1 provided the expected counts are not too variable).
```{R}
# table of expected frequencies:
chisq.ga$expected
# smallest expected frequency:
min(chisq.ga$expected)
```
__(1 p)__ Are the model assumptions met?
## Deviations details
If you rejected the null hypothesis above, the Pearson residuals are a way to
indicate which cells of the table were different from expected.
Residuals more extreme than roughly $\pm 2$ are considered ``large''.
```{R}
# The Pearson residuals
chisq.ga$residuals
# The sum of the squared residuals is the chi-squared statistic:
chisq.ga$residuals^2
sum(chisq.ga$residuals^2)
```
__(1 p)__ Interpret the Pearson residuals.
The mosaic plot is a visual representation of the observed frequencies (areas of each box)
and the Pearson residual (color bar).
If rectangles are the same size along rows and columns (and gray),
then they're close to expected.
Differences between observed and expected frequencies are indicated by
different sized rectangles across rows or down columns and
colors indicate substantial contributions to the $X^2$ statistic.
```{R}
# mosaic plot
library(vcd)
#mosaic(tab.GoalsArea, shade=TRUE, legend=TRUE)
# this layout gives us the interpretation we want:
mosaic(~ Area + Goals, data = dat, shade=TRUE, legend=TRUE, direction = "v")
```
__(1 p)__ Interpret the mosaic plot.