Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).

Answer the questions with the data example.

Subjects were students in grades 4-6 from three school districts in Ingham and Clinton Counties, Michigan. Chase and Dummer stratified their sample, selecting students from urban, suburban, and rural school districts with approximately 1/3 of their sample coming from each district. **Students indicated whether good grades, athletic ability, or popularity was most important to them.** They also ranked four factors: grades, sports, looks, and money, in order of their importance for popularity. The questionnaire also asked for gender, grade level, and other demographic information.

**Reference:** Chase, M. A., and Dummer, G. M. (1992), “The Role of Sports as a Social Determinant for Children,” *Research Quarterly for Exercise and Sport*, 63, 418-424.

`library(tidyverse)`

`## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.2.1 --`

```
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
```

```
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
```

```
# read data
## original source: http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html
fn_dat <- "http://statacumen.com/teach/ADA1/worksheet/ADA1_WS_20_PopularKids.dat"
dat <- read_delim(fn_dat, skip = 39, delim = " ")
```

```
## Parsed with column specification:
## cols(
## Gender = col_character(),
## Grade = col_double(),
## Age = col_double(),
## Race = col_character(),
## Area = col_character(),
## School = col_character(),
## Goals = col_character(),
## Grades = col_double(),
## Sports = col_double(),
## Looks = col_double(),
## Money = col_double()
## )
```

```
dat <-
dat %>%
filter(
Area %in% c("Urban", "Rural", "Suburban")
) %>%
select(
Goals
, Grade
, Area
) %>%
mutate(
Grade = factor(Grade)
, Goals = factor(Goals)
, Area = factor(Area )
)
summary(dat)
```

```
## Goals Grade Area
## Grades :247 4:119 Rural :149
## Popular:141 5:176 Suburban:151
## Sports : 90 6:183 Urban :178
```

We will investigate the association between two pairs of categorical variables: (1) Goals (grades, popular, sports) vs Grade level (4, 5, 6) and (2) Goals (grades, popular, sports) vs School area (rural, suburban, urban).

You’ll perform this hypothesis test for both scenarios below.

Set up the

**null and alternative hypotheses**in words and notation.- In words: ``There is an association between [row variable] and [column variable].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
- In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\). This is another way of saying that the probability of row category \(i\) conditional on column category \(j\), \(p(i|j) = p(i)\), does not depend on which column category \(j\) we consider.

Choose the

**significance level**of the test, such as \(\alpha=0.05\).Compute the

**test statistic**, such as \(\chi^2\).Compute the

**\(p\)-value**from the test statistic with degrees of freedom \(df = (R-1)(C-1)\), where \(R\) and \(C\) are the number of rows and columns.State the

**conclusion**in terms of the problem.- Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
- Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT
*accept*\(H_0\).)

**Check assumptions**of the test (expected count for each cell is at least 5, or at least 1 provided the expected counts are not too variable).

We start by summarizing the frequencies and column proportions. The column proportions are interpretted as the proportion of students with each goal for (or conditional on) each Grade. If these column proportions are roughly the same for each Grade, then that’s an indication of no association.

```
# Tabulate by two categorical variables:
tab_GoalsGrade <- xtabs(~ Goals + Grade, data = dat)
tab_GoalsGrade
```

```
## Grade
## Goals 4 5 6
## Grades 63 88 96
## Popular 31 55 55
## Sports 25 33 32
```

```
# column proportions
prop.table(tab_GoalsGrade, margin = 2)
```

```
## Grade
## Goals 4 5 6
## Grades 0.5294118 0.5000000 0.5245902
## Popular 0.2605042 0.3125000 0.3005464
## Sports 0.2100840 0.1875000 0.1748634
```

**(1 p)**Set up the**null and alternative hypotheses**in words and notation.- In words: ``There is an association between [row variable] and [column variable].’’
- In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\).

Choose the

**significance level**of the test, such as \(\alpha=0.05\).Compute the

**test statistic**, such as \(X^2\).

```
chisq_gg <- chisq.test(tab_GoalsGrade, correct=FALSE)
chisq_gg
```

```
##
## Pearson's Chi-squared test
##
## data: tab_GoalsGrade
## X-squared = 1.3121, df = 4, p-value = 0.8593
```

` # names(chisq_gg) for the objects to report`

The test statistic is \(X^2 = 1.31\).

The p-value \(= 0.859\).

**(1 p)**State the**conclusion**in terms of the problem.**Check assumptions**of the test (expected count for each cell is at least 5, or at least 1 provided the expected counts are not too variable).

```
# table of expected frequencies:
chisq_gg$expected
```

```
## Grade
## Goals 4 5 6
## Grades 61.49163 90.94561 94.56276
## Popular 35.10251 51.91632 53.98117
## Sports 22.40586 33.13808 34.45607
```

```
# smallest expected frequency:
min(chisq_gg$expected)
```

`## [1] 22.40586`

**(1 p)** Are the model assumptions met?

If you rejected the null hypothesis above, the Pearson residuals are a way to indicate which cells of the table were different from expected. Residuals more extreme than roughly \(\pm 2\) are considered ``large’’.

```
# The Pearson residuals
chisq_gg$residuals
```

```
## Grade
## Goals 4 5 6
## Grades 0.1923532 -0.3088758 0.1477981
## Popular -0.6924375 0.4279743 0.1386692
## Sports 0.5480409 -0.0239857 -0.4184151
```

```
# The sum of the squared residuals is the chi-squared statistic:
chisq_gg$residuals^2
```

```
## Grade
## Goals 4 5 6
## Grades 0.0369997439 0.0954042654 0.0218442699
## Popular 0.4794697546 0.1831619633 0.0192291383
## Sports 0.3003488704 0.0005753138 0.1750711958
```

`sum(chisq_gg$residuals^2)`

`## [1] 1.312105`

**(1 p)** Interpret the Pearson residuals.

The mosaic plot is a visual representation of the observed frequencies (areas of each box) and the Pearson residual (color bar). If rectangles are the same size along rows and columns (and gray), then they’re close to expected. Differences between observed and expected frequencies are indicated by different sized rectangles across rows or down columns and colors indicate substantial contributions to the \(X^2\) statistic.

```
# mosaic plot
library(vcd)
```

`## Loading required package: grid`

```
#mosaic(tab_GoalsGrade, shade=TRUE, legend=TRUE)
# this layout gives us the interpretation we want:
mosaic(~ Grade + Goals, data = dat, shade=TRUE, legend=TRUE, direction = "v")
```

**(1 p)** Interpret the mosaic plot.

*Repeat the analysis above, but compare School area instead of Grade level.*

```
# Tabulate by two categorical variables:
tab_GoalsArea <- xtabs(~ Goals + Area, data = dat)
tab_GoalsArea
```

```
## Area
## Goals Rural Suburban Urban
## Grades 57 87 103
## Popular 50 42 49
## Sports 42 22 26
```

```
# column proportions
prop.table(tab_GoalsArea, margin = 2)
```

```
## Area
## Goals Rural Suburban Urban
## Grades 0.3825503 0.5761589 0.5786517
## Popular 0.3355705 0.2781457 0.2752809
## Sports 0.2818792 0.1456954 0.1460674
```

**(1 p)**Set up the**null and alternative hypotheses**in words and notation.- In words: ``There is an association between [row variable] and [column variable].’’
- In notation: \(H_0: p(i \textrm{ and } j) = p(i)p(j)\) versus \(H_A: p(i \textrm{ and } j) \ne p(i)p(j)\), for all row categories \(i\) and column categories \(j\).

Choose the

**significance level**of the test, such as \(\alpha=0.05\).Compute the

**test statistic**, such as \(X^2\).

```
chisq_ga <- chisq.test(tab_GoalsArea, correct=FALSE)
chisq_ga
```

```
##
## Pearson's Chi-squared test
##
## data: tab_GoalsArea
## X-squared = 18.828, df = 4, p-value = 0.0008497
```

` # names(chisq_ga) for the objects to report`

The test statistic is \(X^2 = 18.8\).

The p-value \(= 8.5\times 10^{-4}\).

**(1 p)**State the**conclusion**in terms of the problem.**Check assumptions**of the test (expected count for each cell is at least 5, or at least 1 provided the expected counts are not too variable).

```
# table of expected frequencies:
chisq_ga$expected
```

```
## Area
## Goals Rural Suburban Urban
## Grades 76.99372 78.02720 91.97908
## Popular 43.95188 44.54184 52.50628
## Sports 28.05439 28.43096 33.51464
```

```
# smallest expected frequency:
min(chisq_ga$expected)
```

`## [1] 28.05439`

**(1 p)** Are the model assumptions met?

If you rejected the null hypothesis above, the Pearson residuals are a way to indicate which cells of the table were different from expected. Residuals more extreme than roughly \(\pm 2\) are considered ``large’’.

```
# The Pearson residuals
chisq_ga$residuals
```

```
## Area
## Goals Rural Suburban Urban
## Grades -2.2785892 1.0157928 1.1491411
## Popular 0.9122869 -0.3808591 -0.4838832
## Sports 2.6329158 -1.2060913 -1.2980491
```

```
# The sum of the squared residuals is the chi-squared statistic:
chisq_ga$residuals^2
```

```
## Area
## Goals Rural Suburban Urban
## Grades 5.1919686 1.0318351 1.3205252
## Popular 0.8322674 0.1450536 0.2341429
## Sports 6.9322457 1.4546562 1.6849315
```

`sum(chisq_ga$residuals^2)`

`## [1] 18.82763`

**(1 p)** Interpret the Pearson residuals.

The mosaic plot is a visual representation of the observed frequencies (areas of each box) and the Pearson residual (color bar). If rectangles are the same size along rows and columns (and gray), then they’re close to expected. Differences between observed and expected frequencies are indicated by different sized rectangles across rows or down columns and colors indicate substantial contributions to the \(X^2\) statistic.

```
# mosaic plot
library(vcd)
#mosaic(tab_GoalsArea, shade=TRUE, legend=TRUE)
# this layout gives us the interpretation we want:
mosaic(~ Area + Goals, data = dat, shade=TRUE, legend=TRUE, direction = "v")
```

**(1 p)** Interpret the mosaic plot.