Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).
Rubric
Answer the questions with the data example.
Popular kids
Subjects were students in grades 4-6 from three school districts in Ingham and Clinton Counties, Michigan. Chase and Dummer stratified their sample, selecting students from urban, suburban, and rural school districts with approximately 1/3 of their sample coming from each district. Students indicated whether good grades, athletic ability, or popularity was most important to them. They also ranked four factors: grades, sports, looks, and money, in order of their importance for popularity. The questionnaire also asked for gender, grade level, and other demographic information.
Reference: Chase, M. A., and Dummer, G. M. (1992), “The Role of Sports as a Social Determinant for Children,” Research Quarterly for Exercise and Sport, 63, 418-424.
Rows: 478 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (5): Gender, Race, Area, School, Goals
dbl (6): Grade, Age, Grades, Sports, Looks, Money
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We will investigate the association between two pairs of categorical variables:
Goals (grades, popular, sports) vs Grade level (4, 5, 6) and
Goals (grades, popular, sports) vs School area (rural, suburban, urban).
Hypothesis test for Homogeneity of Proportions (categorical association)
You’ll perform this hypothesis test for both scenarios below.
Set up the null and alternative hypotheses in words and notation.
In words: ``There is an association between [row variable] and [column variable].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\). This is another way of saying that the probability of row category \(i\) conditional on column category \(j\), \(p(i|j) = p(i)\), does not depend on which column category \(j\) we consider.
Choose the significance level of the test, such as \(\alpha=0.05\).
Compute the test statistic, such as \(\chi^2\).
Compute the \(p\)-value from the test statistic with degrees of freedom \(df = (R - 1)(C - 1)\), where \(R\) and \(C\) are the number of rows and columns.
State the conclusion in terms of the problem.
Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT accept\(H_0\).)
Check assumptions of the test (expected count for each cell is at least 5, or at least 1 provided the expected counts are not too variable).
Goals (grades, popular, sports) vs Grade level (4, 5, 6)
We start by summarizing the frequencies and column proportions. The column proportions are interpretted as the proportion of students with each goal for (or conditional on) each Grade. If these column proportions are roughly the same for each Grade, then that’s an indication of no association.
# Tabulate by two categorical variables:tab_GoalsGrade <-xtabs(~ Goals + Grade , data = dat_kids )tab_GoalsGrade
(1 p) Set up the null and alternative hypotheses in words and notation.
In words: ``There is an association between [row variable] and [column variable].’’
In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\).
Choose the significance level of the test, such as \(\alpha=0.05\).
If you rejected the null hypothesis above, the Pearson residuals are a way to indicate which cells of the table were different from expected. Residuals more extreme than roughly \(\pm 2\) are considered ``large’’.
# The Pearson residualschisq_gg$residuals %>%signif(3)
The mosaic plot is a visual representation of the observed frequencies (areas of each box) and the Pearson residual (color bar). If rectangles are the same size along rows and columns (and gray), then they’re close to expected. Differences between observed and expected frequencies are indicated by different sized rectangles across rows or down columns and colors indicate substantial contributions to the \(X^2\) statistic.
# mosaic plotlibrary(vcd)
Loading required package: grid
#mosaic(tab_GoalsGrade, shade=TRUE, legend=TRUE)# this layout gives us the interpretation we want:vcd::mosaic(~ Grade + Goals , data = dat_kids , main ="Kids: Grade and Goals" , shade =TRUE , legend =TRUE , direction ="v" )
(1 p) Interpret the mosaic plot.
Goals vs School area (rural, suburban, urban)
Repeat the analysis above, but compare School area instead of Grade level.
# Tabulate by two categorical variables:tab_GoalsArea <-xtabs(~ Goals + Area , data = dat_kids )tab_GoalsArea
Area
Goals Rural Suburban Urban
Grades 57 87 103
Popular 50 42 49
Sports 42 22 26
Area
Goals Rural Suburban Urban
Grades 0.38 0.58 0.58
Popular 0.34 0.28 0.28
Sports 0.28 0.15 0.15
(1 p) Set up the null and alternative hypotheses in words and notation.
In words: ``There is an association between [row variable] and [column variable].’’
In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\).
Choose the significance level of the test, such as \(\alpha=0.05\).
If you rejected the null hypothesis above, the Pearson residuals are a way to indicate which cells of the table were different from expected. Residuals more extreme than roughly \(\pm 2\) are considered ``large’’.
# The Pearson residualschisq_ga$residuals %>%signif(3)
Area
Goals Rural Suburban Urban
Grades -2.280 1.020 1.150
Popular 0.912 -0.381 -0.484
Sports 2.630 -1.210 -1.300
# The sum of the squared residuals is the chi-squared statistic:chisq_ga$residuals^2%>%signif(3)
Area
Goals Rural Suburban Urban
Grades 5.190 1.030 1.320
Popular 0.832 0.145 0.234
Sports 6.930 1.450 1.680
sum(chisq_ga$residuals^2)
[1] 18.82763
(1 p) Interpret the Pearson residuals.
The mosaic plot is a visual representation of the observed frequencies (areas of each box) and the Pearson residual (color bar). If rectangles are the same size along rows and columns (and gray), then they’re close to expected. Differences between observed and expected frequencies are indicated by different sized rectangles across rows or down columns and colors indicate substantial contributions to the \(X^2\) statistic.
# mosaic plotlibrary(vcd)#mosaic(tab_GoalsArea, shade=TRUE, legend=TRUE)# this layout gives us the interpretation we want:mosaic(~ Area + Goals, data = dat_kids, shade=TRUE, legend=TRUE, direction ="v")
vcd::mosaic(~ Area + Goals , data = dat_kids , main ="Kids: Area and Goals" , shade =TRUE , legend =TRUE , direction ="v" )