# 5  One-Way Analysis of Variance

You are reading the work-in-progress online edition of “Statistical Acumen: Advanced Data Analysis.”

This chapter should be readable but is currently undergoing final polishing.

### Learning objectives

After completing this topic, you should be able to:

select graphical displays that meaningfully compare independent populations.

assess the assumptions of the ANOVA visually and by formal tests.

decide whether the means between populations are different, and how.

Achieving these goals contributes to mastery in these course learning outcomes:

1. organize knowledge.

5. define parameters of interest and hypotheses in words and notation.

6. summarize data visually, numerically, and descriptively.

8. use statistical software.

12. make evidence-based decisions.

Code
set.seed(76543)
library(erikmisc)
library(tidyverse)

## 5.1 ANOVA

The one-way analysis of variance (ANOVA) is a generalization of the two sample $$t$$-test to $$k \geq 2$$ groups. Assume that the populations of interest have the following (unknown) population means and standard deviations:

population 1 population 2 $$\cdots$$ population $$k$$
mean $$\mu_1$$ $$\mu_2$$ $$\cdots$$ $$\mu_k$$
std dev $$\sigma_1$$ $$\sigma_2$$ $$\cdots$$ $$\sigma_k$$

A usual interest in ANOVA is whether $$\mu_1=\mu_2=\cdots=\mu_k$$. If not, then we wish to know which means differ, and by how much. To answer these questions we select samples from each of the $$k$$ populations, leading to the following data summary:

sample 1 sample 2 $$\cdots$$ sample $$k$$
size $$n_1$$ $$n_2$$ $$\cdots$$ $$n_k$$
mean $$\bar{Y}_1$$ $$\bar{Y}_2$$ $$\cdots$$ $$\bar{Y}_k$$
std dev $$s_1$$ $$s_2$$ $$\cdots$$ $$s_k$$

A little more notation is needed for the discussion. Let $$Y_{ij}$$ denote the $$j^{\textrm{th}}$$ observation in the $$i^{\textrm{th}}$$ sample and define the total sample size $$n^* = n_1+n_2+\cdots+n_k$$. Finally, let $$\bar{\bar{Y}}$$ be the average response over all samples (combined), that is $\bar{\bar{Y}} = \frac{\sum_{ij} Y_{ij}}{n^*} = \frac{\sum_i n_i \bar{Y}_i}{n^*}.$ Note that $$\bar{\bar{Y}}$$ is not the average of the sample means, unless the sample sizes $$n_i$$ are equal.

An $$F$$-statistic is used to test $$H_0:\mu_1=\mu_2=\cdots=\mu_k$$ against $$H_A:\;{\rm not}\;H_0$$ (that is, at least two means are different). The assumptions needed for the standard ANOVA $$F$$-test are analogous to the independent pooled two-sample $$t$$-test assumptions: (1) Independent random samples from each population. (2) The population frequency curves are normal. (3) The populations have equal standard deviations, $$\sigma_1=\sigma_2=\cdots =\sigma_k$$.

The $$F$$-test is computed from the ANOVA table, which breaks the spread in the combined data set into two components, or Sums of Squares (SS). The Within SS, often called the Residual SS or the Error SS, is the portion of the total spread due to variability within samples:

SS(Within) = $$(n_1-1)s^2_1 + (n_2-1)s^2_2 + \cdots + (n_k-1)s^2_k = \sum_{ij}(Y_{ij}-\bar{Y}_i)^2$$.

The Between SS, often called the Model SS, measures the spread between the sample means

SS(Between) = $$n_1(\bar{Y}_1-\bar{\bar{Y}})^2 + n_2(\bar{Y}_2-\bar{\bar{Y}})^2 + \cdots + n_k(\bar{Y}_k-\bar{\bar{Y}})^2=\sum_i n_i(\bar{Y}_i-\bar{\bar{Y}})^2$$,

weighted by the sample sizes. These two SS add to give

SS(Total) = SS(Between) $$+$$ SS(Within) = $$\sum_{ij}(Y_{ij}-\bar{\bar{Y}})^2$$.

Each SS has its own degrees of freedom ($$df$$). The $$df$$(Between) is the number of groups minus one, $$k-1$$. The $$df$$(Within) is the total number of observations minus the number of groups: $$(n_1-1) + (n_2-1)+ \cdots +(n_k-1) = n^* - k$$. These two $$df$$ add to give $$df$$(Total) $$= (k-1) + (n^*-k) = n^*-1$$.

The Sums of Squares and $$df$$ are neatly arranged in a table, called the ANOVA table:

Source $$df$$ SS MS F
Between Groups (Model) $$dfM=k-1$$ $$SSM=\sum_i n_i(\bar{Y}_i-\bar{\bar{Y}})^2$$ $$MSM = SSM/dfM$$ $$MSM/MSE$$
Within Groups (Error) $$dfE=n^*-k$$ $$SSE=\sum_i (n_i-1)s^2_i$$ $$MSE = SSE/dfE$$
Total $$dfT=n^*-1$$ $$SST=\sum_{ij}(Y_{ij}-\bar{\bar{Y}})^2$$ $$MST = SST/dfT$$

The Mean Square for each source of variation is the corresponding SS divided by its $$df$$. The Mean Squares can be easily interpreted.

The MS(Within) $\textrm{MS(Within)} = \frac{(n_1-1)s^2_1 + (n_2-1)s^2_2 + \cdots + (n_k-1)s^2_k} {n^*-k} = s^2_{\textrm{pooled}}$ is a weighted average of the sample variances. The MS(Within) is known as the pooled estimator of variance, and estimates the assumed common population variance. If all the sample sizes are equal, the MS(Within) is the average sample variance. The MS(Within) is identical to the pooled variance estimator in a two-sample problem when $$k=2$$.

The MS(Between) $\textrm{MS(Between)} = \frac{\sum_i n_i(\bar{Y}_i-\bar{\bar{Y}})^2}{k-1}$ is a measure of variability among the sample means. This MS is a multiple of the sample variance of $$\bar{Y}_1,\;\bar{Y}_2,\ldots,\bar{Y}_k$$ when all the sample sizes are equal.

The MS(Total) $\textrm{MS(Total)} = \frac{\sum_{ij}(Y_{ij}-\bar{\bar{Y}})^2}{n^*-1}$ is the variance in the combined data set.

The decision on whether to reject $$H_0:\mu_1=\mu_2=\cdots=\mu_k$$ is based on the ratio of the MS(Between) and the MS(Within): $F_{s} = \frac{\textrm{MS(Between)}}{\textrm{MS(Within)}}.$ Large values of $$F_{s}$$ indicate large variability among the sample means $$\bar{Y}_1,\;\bar{Y}_2,\ldots,\bar{Y}_k$$ relative to the spread of the data within samples. That is, large values of $$F_{s}$$ suggest that $$H_0$$ is false.

Formally, for a size $$\alpha$$ test, reject $$H_0$$ if $$F_{s} \geq F_{crit}$$, where $$F_{crit}$$ is the upper-$$\alpha$$ percentile from an $$F$$ distribution with numerator degrees of freedom $$k-1$$ and denominator degrees of freedom $$n^*-k$$ (i.e., the $$df$$ for the numerators and denominators in the $$F$$-ratio). The p-value for the test is the area under the $$F$$-probability curve to the right of $$F_{s}$$:

For $$k=2$$ the ANOVA $$F$$-test is equivalent to the pooled two-sample $$t$$-test.

The specification of a one-way analysis of variance is analogous to a regression analysis (discussed later, though we’ve seen the specification in some plotting functions). The only difference is that the descriptive ($$x$$) variable needs to be a factor and not a numeric variable. We calculate a model object using lm() and extract the analysis of variance table with anova().

##### 5.1.0.0.1 Example: Comparison of Fats

During cooking, doughnuts absorb fat in various amounts. A scientist wished to learn whether the amount absorbed depends on the type of fat. For each of 4 fats, 6 batches of 24 doughnuts were prepared. The data are grams of fat absorbed per batch.

Let

$$\mu_i$$ = pop mean grams of fat i absorbed per batch of 24 doughnuts ($$-100$$).

The scientist wishes to test $$H_0: \mu_1=\mu_2=\mu_3=\mu_4$$ against $$H_A:\;{\rm not}\;H_0$$. There is no strong evidence against normality here. Furthermore the sample standard deviations (see output below) are close. The standard ANOVA appears to be appropriate here.

Row fat1 fat2 fat3 fat4 1 164 178 175 155 2 172 191 186 166 3 168 197 178 149 4 177 182 171 164 5 190 185 163 170 6 176 177 176 168

Let’s take a short detour to read the wide table and convert it into long format.

###### 5.1.0.0.1.1 R skills: wide to long table format

Many functions in R expect data to be in a long format rather than a wide format. Let’s use the fat data as an example of how to read a table as text, convert the wide format to long, and then back to wide format.

Code
#### Example: Comparison of Fats
Row  fat1  fat2  fat3  fat4
1   164   178   175   155
2   172   191   186   166
3   168   197   178   149
4   177   182   171   164
5   190   185   163   170
6   176   177   176   168
fat
  Row fat1 fat2 fat3 fat4
1   1  164  178  175  155
2   2  172  191  186  166
3   3  168  197  178  149
4   4  177  182  171  164
5   5  190  185  163  170
6   6  176  177  176  168

From wide to long: Use melt() from the reshape2 package.

Code
#### From wide to long format
library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

smiths
Code
fat.long <- melt(fat,
# id.vars: ID variables
#   all variables to keep but not split apart on
id.vars=c("Row"),
# measure.vars: The source columns
#   (if unspecified then all other variables are measure.vars)
measure.vars = c("fat1", "fat2", "fat3", "fat4"),
# variable.name: Name of the destination column identifying each
#   original column that the measurement came from
variable.name = "type",
# value.name: column name for values in table
value.name = "amount"
)
## naming variables manually, the variable.name and value.name not working 11/2012
#names(fat.long) <- c("Row", "type", "amount")
fat.long
   Row type amount
1    1 fat1    164
2    2 fat1    172
3    3 fat1    168
4    4 fat1    177
5    5 fat1    190
6    6 fat1    176
7    1 fat2    178
8    2 fat2    191
9    3 fat2    197
10   4 fat2    182
11   5 fat2    185
12   6 fat2    177
13   1 fat3    175
14   2 fat3    186
15   3 fat3    178
16   4 fat3    171
17   5 fat3    163
18   6 fat3    176
19   1 fat4    155
20   2 fat4    166
21   3 fat4    149
22   4 fat4    164
23   5 fat4    170
24   6 fat4    168
Code
# or as simple as:
# melt(fat, "Row")

If you don’t specify variable.name, it will name that column “variable”, and if you leave out value.name, it will name that column “value”.

From long to wide: Use dcast() from the reshape2 package.

Code
#### From long to wide format
fat.wide <- dcast(fat.long, Row ~ type, value.var = "amount")
fat.wide
  Row fat1 fat2 fat3 fat4
1   1  164  178  175  155
2   2  172  191  186  166
3   3  168  197  178  149
4   4  177  182  171  164
5   5  190  185  163  170
6   6  176  177  176  168

Now that we’ve got our data in the long format, let’s return to the ANOVA.

##### 5.1.0.0.2 Back to ANOVA:

Let’s look at the numerical summaries. We’ve seen other ways of computing these so I’ll show you another way.

Code
#### Back to ANOVA
# Calculate the mean, sd, n, and se for the four fats

# The plyr package is an advanced way to apply a function to subsets of data
#   "Tools for splitting, applying and combining data"
library(plyr)
---------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
---------------------------------------------------------------------------

Attaching package: 'plyr'
The following object is masked from 'package:purrr':

compact
The following objects are masked from 'package:dplyr':

arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize
Code
# ddply "dd" means the input and output are both data.frames
fat.summary <- ddply(fat.long,
"type",
function(X) {
data.frame( m = mean(X$amount), s = sd(X$amount),
n = length(X$amount) ) } ) # standard errors fat.summary$se   <- fat.summary$s/sqrt(fat.summary$n)
# individual confidence limits
fat.summary$ci.l <- fat.summary$m - qt(1-.05/2, df=fat.summary$n-1) * fat.summary$se
fat.summary$ci.u <- fat.summary$m + qt(1-.05/2, df=fat.summary$n-1) * fat.summary$se
fat.summary
  type        m        s n       se     ci.l     ci.u
1 fat1 174.5000 9.027735 6 3.685557 165.0260 183.9740
2 fat2 185.0000 7.771744 6 3.172801 176.8441 193.1559
3 fat3 174.8333 7.626707 6 3.113590 166.8296 182.8371
4 fat4 162.0000 8.221922 6 3.356586 153.3716 170.6284

Let’s plot the data with boxplots, individual points, mean, and CI by fat type.

Code
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(fat.long, aes(x = type, y = amount))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(fat.long$amount), colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5) Warning: Using size aesthetic for lines was deprecated in ggplot2 3.4.0. ℹ Please use linewidth instead. Code # boxplot, size=.75 to stand out behind CI p <- p + geom_boxplot(size = 0.75, alpha = 0.5) # points for observed data p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5) # diamond at mean for each group p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6, colour = "red", alpha = 0.8) Warning: The fun.y argument of stat_summary() is deprecated as of ggplot2 3.3.0. ℹ Please use the fun argument instead. Code # confidence limits based on normal distribution p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, colour = "red", alpha = 0.8) p <- p + labs(title = "Doughnut fat absorption") + ylab("amount absorbed (g)") print(p) The p-value for the $$F$$-test is 0.001. The scientist would reject $$H_0$$ at any of the usual test levels (such as, 0.05 or 0.01). The data suggest that the population mean absorption rates differ across fats in some way. The $$F$$-test does not say how they differ. The pooled standard deviation $$s_{\textrm{pooled}}=8.18$$ is the “Residual standard error”. We’ll ignore the rest of this output for now. Code fit.f <- aov(amount ~ type, data = fat.long) summary(fit.f)  Df Sum Sq Mean Sq F value Pr(>F) type 3 1596 531.8 7.948 0.0011 ** Residuals 20 1338 66.9 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Code fit.f Call: aov(formula = amount ~ type, data = fat.long) Terms: type Residuals Sum of Squares 1595.500 1338.333 Deg. of Freedom 3 20 Residual standard error: 8.180261 Estimated effects may be unbalanced ## 5.2 Multiple Comparison Methods: Fisher’s Method The ANOVA $$F$$-test checks whether all the population means are equal. Multiple comparisons are often used as a follow-up to a significant ANOVA $$F$$-test to determine which population means are different. I will discuss Fisher’s, Bonferroni’s, and Tukey’s methods for comparing all pairs of means. Fisher’s least significant difference method (LSD or FSD) is a two-step process: 1. Carry out the ANOVA $$F$$-test of $$H_0:\mu_1=\mu_2=\cdots=\mu_k$$ at the $$\alpha$$ level. If $$H_0$$ is not rejected, stop and conclude that there is insufficient evidence to claim differences among population means. If $$H_0$$ is rejected, go to step 2. 2. Compare each pair of means using a pooled two sample $$t$$-test at the $$\alpha$$ level. Use $$s_{\textrm{pooled}}$$ from the ANOVA table and $$df = dfE$$ (Residual). To see where the name LSD originated, consider the $$t$$-test of $$H_0:\mu_i =\mu_j$$ (i.e., populations $$i$$ and $$j$$ have same mean). The $$t$$-statistic is $t_{s} = \frac{\bar{Y}_i-\bar{Y}_j}{s_{\textrm{pooled}} \sqrt{\frac{1}{n_i}+\frac{1}{n_j}}}.$ You reject $$H_0$$ if $$|t_{s}| \geq t_{\textrm{crit}}$$, or equivalently, if $|\bar{Y}_i-\bar{Y}_j| \geq t_{\textrm{crit}}s_{\textrm{pooled}} \sqrt{\frac{1}{n_i}+\frac{1}{n_j}}.$ The minimum absolute difference between $$\bar{Y}_i$$ and $$\bar{Y}_j$$ needed to reject $$H_0$$ is the LSD, the quantity on the right hand side of this inequality. If all the sample sizes are equal $$n_1=n_2=\cdots=n_k$$ then the LSD is the same for each comparison: $LSD = t_{\textrm{crit}}s_{\textrm{pooled}} \sqrt{\frac{2}{n_1}},$ where $$n_1$$ is the common sample size. I will illustrate Fisher’s method on the doughnut data, using $$\alpha=0.05$$. At the first step, you reject the hypothesis that the population mean absorptions are equal because p-value$$=0.001$$. At the second step, compare all pairs of fats at the 5% level. Here, $$s_{\textrm{pooled}}=8.18$$ and $$t_{\textrm{crit}} =2.086$$ for a two-sided test based on 20 $$df$$ (the $$dfE$$ for Residual SS). Each sample has six observations, so the LSD for each comparison is $LSD = 2.086\times 8.18\times \sqrt{\frac{2}{6}} = 9.85.$ Any two sample means that differ by at least 9.85 in magnitude are significantly different at the 5% level. An easy way to compare all pairs of fats is to order the samples by their sample means. The samples can then be grouped easily, noting that two fats are in the same group if the absolute difference between their sample means is smaller than the LSD. Fats Sample Mean 2 185.00 3 174.83 1 174.50 4 162.00 There are six comparisons of two fats. From this table, you can visually assess which sample means differ by at least the LSD=9.85, and which ones do not. For completeness, the table below summarizes each comparison: Comparison Absolute difference in means Exceeds LSD? Fats 2 and 3 10.17 Yes 2 and 1 10.50 Yes 2 and 4 23.00 Yes Fats 3 and 1 0.33 No 3 and 4 12.83 Yes Fats 1 and 4 12.50 Yes The end product of the multiple comparisons is usually presented as a collection of groups, where a group is defined to be a set of populations with sample means that are not significantly different from each other. Overlap among groups is common, and occurs when one or more populations appears in two or more groups. Any overlap requires a more careful interpretation of the analysis. There are three groups for the doughnut data, with no overlap. Fat 2 is in a group by itself, and so is Fat 4. Fats 3 and 1 are in a group together. This information can be summarized by ordering the samples from lowest to highest average, and then connecting the fats in the same group using an underscore:  FAT 4 FAT 1 FAT 3 FAT 2 ----- -------------- ----- The results of a multiple comparisons must be interpreted carefully. At the 5% level, you have sufficient evidence to conclude that the population mean absorption for Fat 2 and Fat 4 are each different than the other population means. However, there is insufficient evidence to conclude that the population mean absorptions for Fats 1 and 3 differ. #### 5.2.0.1 Be Careful with Interpreting Groups in Multiple Comparisons! To see why you must be careful when interpreting groupings, suppose you obtain two groups in a three sample problem. One group has samples 1 and 3. The other group has samples 3 and 2:  1 3 2 ----------- ----------- This occurs, for example, when $$|\bar{Y}_1-\bar{Y}_2| \geq LSD$$, but both $$|\bar{Y}_1-\bar{Y}_3|$$ and $$|\bar{Y}_3-\bar{Y}_2|$$ are less than the LSD. There is a tendency to conclude, and please try to avoid this line of attack, that populations 1 and 3 have the same mean, populations 2 and 3 have the same mean, but populations 1 and 2 have different means. This conclusion is illogical. The groupings imply that we have sufficient evidence to conclude that population means 1 and 2 are different, but insufficient evidence to conclude that population mean 3 differs from either of the other population means. ### 5.2.1 FSD Multiple Comparisons in R One way to get Fisher comparisons in R uses pairwise.t.test() with p.adjust.method = \"none\". The resulting summary of the multiple comparisons is in terms of p-values for all pairwise two-sample t-tests using the pooled standard deviation from the ANOVA using pool.sd = TRUE. This output can be used to generate groupings. A summary of the p-values is given below. Let us see that we can recover the groups from this output. Code #### Multiple Comparisons # all pairwise comparisons among levels of fat # Fisher's LSD (FSD) uses "none" pairwise.t.test(fat.long$amount, fat.long$type, pool.sd = TRUE, p.adjust.method = "none")  Pairwise comparisons using t tests with pooled SD data: fat.long$amount and fat.long$type fat1 fat2 fat3 fat2 0.038 - - fat3 0.944 0.044 - fat4 0.015 9.3e-05 0.013 P value adjustment method: none  #### 5.2.1.1 Discussion of the FSD Method: family error rate There are $$c=k(k-1)/2$$ pairs of means to compare in the second step of the FSD method. Each comparison is done at the $$\alpha$$ level, where for a generic comparison of the $$i^{\textrm{th}}$$ and $$j^{\textrm{th}}$$ populations $$\alpha=$$ probability of rejecting $$H_0:\mu_i=\mu_j$$ when $$H_0$$ is true. The individual error rate is not the only error rate that is important in multiple comparisons. The family error rate (FER), or the experimentwise error rate, is defined to be the probability of at least one false rejection of a true hypothesis $$H_0: \mu_i=\mu_j$$ over all comparisons. When many comparisons are made, you may have a large probability of making one or more false rejections of true null hypotheses. In particular, when all $$c$$ comparisons of two population means are performed, each at the $$\alpha$$ level, then $\alpha < FER < c \alpha.$ For example, in the doughnut problem where $$k=4$$, there are $$c=4(3)/2=6$$ possible comparisons of pairs of fats. If each comparison is carried out at the 5% level, then $$0.05 < FER < 0.30$$. At the second step of the FSD method, you could have up to a 30% chance of claiming one or more pairs of population means are different if no differences existed between population means. Most statistical packages do not evaluate the FER, so the upper bound is used. The first step of the FSD method is the ANOVA “screening” test. The multiple comparisons are carried out only if the $$F$$-test suggests that not all population means are equal. This screening test tends to deflate the FER for the two-step FSD procedure. However, the FSD method is commonly criticized for being extremely liberal (too many false rejections of true null hypotheses) when some, but not many, differences exist — especially when the number of comparisons is large. This conclusion is fairly intuitive. When you do a large number of tests, each, say, at the 5% level, then sampling variation alone will suggest differences in 5% of the comparisons where the $$H_0$$ is true. The number of false rejections could be enormous with a large number of comparisons. For example, chance variation alone would account for an average of 50 significant differences in 1000 comparisons (about 45 populations) each at the 5% level. ### 5.2.2 Bonferroni Comparisons The Bonferroni method controls the FER by reducing the individual comparison error rate. The FER is guaranteed to be no larger than a prespecified amount, say $$\alpha$$, by setting the individual error rate for each of the $$c$$ comparisons of interest to $$\alpha/c$$. Therefore, $$\alpha/c < FER < c \alpha/c = \alpha$$, thus the upper bound for FER is $$\alpha$$. Larger differences in the sample means are needed before declaring statistical significance using the Bonferroni adjustment than when using the FSD method at the $$\alpha$$ level. Assuming all comparisons are of interest, you can implement the Bonferroni adjustment in R by specifying p.adjust.method = \"bonf\" A by-product of the Bonferroni adjustment is that we have at least $$100(1-\alpha)$$% confidence that all pairwise $$t$$-test statements hold simultaneously! Code # Bonferroni 95% Individual p-values # All Pairwise Comparisons among Levels of fat pairwise.t.test(fat.long$amount, fat.long$type, pool.sd = TRUE, p.adjust.method = "bonf")  Pairwise comparisons using t tests with pooled SD data: fat.long$amount and fat.long$type fat1 fat2 fat3 fat2 0.22733 - - fat3 1.00000 0.26241 - fat4 0.09286 0.00056 0.07960 P value adjustment method: bonferroni  Looking at the output, can you create the groups? You should get the groups given below, which implies you have sufficient evidence to conclude that the population mean absorption for Fat 2 is different than that for Fat 4.  FAT 4 FAT 1 FAT 3 FAT 2 ----------------------- ----------------------- The Bonferroni method tends to produce “coarser” groups than the FSD method, because the individual comparisons are conducted at a lower (alpha/error) level. Equivalently, the minimum significant difference is inflated for the Bonferroni method. For example, in the doughnut problem with $$FER \leq 0.05$$, the critical value for the individual comparisons at the 0.0083 level is $$t_{\textrm{crit}}=2.929$$ with $$df=20$$. The minimum significant difference for the Bonferroni comparisons is $LSD = 2.929\times 8.18\times \sqrt{\frac{2}{6}} = 13.824$ versus an LSD=9.85 for the FSD method. Referring back to our table of sample means on page , we see that the sole comparison where the absolute difference between sample means exceeds 13.824 involves Fats 2 and 4. ##### 5.2.2.0.1 Example from Koopmans: glabella facial tissue thickness In an anthropological study of facial tissue thickness for different racial groups, data were taken during autopsy at several points on the faces of deceased individuals. The Glabella measurements taken at the bony ridge for samples of individuals from three racial groups (cauc = Caucasian, afam = African American, and naaa = Native American and Asian) follow. The data values are in mm. Code #### Example from Koopmans: glabella facial tissue thickness glabella <- read.table(text=" Row cauc afam naaa 1 5.75 6.00 8.00 2 5.50 6.25 7.00 3 6.75 6.75 6.00 4 5.75 7.00 6.25 5 5.00 7.25 5.50 6 5.75 6.75 4.00 7 5.75 8.00 5.00 8 7.75 6.50 6.00 9 5.75 7.50 7.25 10 5.25 6.25 6.00 11 4.50 5.00 6.00 12 6.25 5.75 4.25 13 NA 5.00 4.75 14 NA NA 6.00 ", header=TRUE) glabella.long <- melt(glabella, id.vars=c("Row"), variable.name = "pop", value.name = "thickness", # remove NAs na.rm = TRUE ) # naming variables manually, the variable.name and value.name not working 11/2012 names(glabella.long) <- c("Row", "pop", "thickness") # another way to remove NAs: #glabella.long <- subset(glabella.long, !is.na(thickness)) Code # Plot the data using ggplot library(ggplot2) p <- ggplot(glabella.long, aes(x = pop, y = thickness)) # plot a reference line for the global mean (assuming no groups) p <- p + geom_hline(yintercept = mean(glabella.long$thickness),
colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5)
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.5)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
# confidence limits based on normal distribution
p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
width = .2, colour = "red", alpha = 0.8)
p <- p + labs(title = "Glabella thickness") + ylab("thickness (mm)")
print(p)

There are 3 groups, so there are 3 possible pairwise comparisons. If you want a Bonferroni analysis with FER of no greater than 0.05, you should do the individual comparisons at the $$0.05/3 = 0.0167$$ level. Except for the mild outlier in the Caucasian sample, the observed distributions are fairly symmetric, with similar spreads. I would expect the standard ANOVA to perform well here.

Let $$\mu_c$$ = population mean Glabella measurement for Caucasians, $$\mu_a$$ = population mean Glabella measurement for African Americans, and $$\mu_n$$ = population mean Glabella measurement for Native Americans and Asians.

Code
glabella.summary <- ddply(glabella.long, "pop",
function(X) { data.frame( m = mean(X$thickness), s = sd(X$thickness),
n = length(X$thickness) ) } ) glabella.summary  pop m s n 1 cauc 5.812500 0.8334280 12 2 afam 6.461538 0.8946959 13 3 naaa 5.857143 1.1168047 14 Code fit.g <- aov(thickness ~ pop, data = glabella.long) summary(fit.g)  Df Sum Sq Mean Sq F value Pr(>F) pop 2 3.40 1.6991 1.828 0.175 Residuals 36 33.46 0.9295  Code fit.g Call: aov(formula = thickness ~ pop, data = glabella.long) Terms: pop Residuals Sum of Squares 3.39829 33.46068 Deg. of Freedom 2 36 Residual standard error: 0.9640868 Estimated effects may be unbalanced At the 5% level, you would not reject the hypothesis that the population mean Glabella measurements are identical. That is, you do not have sufficient evidence to conclude that these racial groups differ with respect to their average Glabella measurement. This is the end of the analysis! The Bonferroni intervals reinforce this conclusion, all the p-values are greater than 0.05. If you were to calculate CIs for the difference in population means, each would contain zero. You can think of the Bonferroni intervals as simultaneous CI. We’re (at least) 95% confident that all of the following statements hold simultaneously: $$-1.62 \leq \mu_c - \mu_a \leq 0.32,\; -0.91 \leq \mu_n - \mu_c \leq 1.00,\;$$ and $$-1.54 \leq \mu_n - \mu_a \leq 0.33$$. The individual CIs have level $$100(1-0.0167)\% = 98.33\%$$. Code # Bonferroni 95% Individual p-values # All Pairwise Comparisons among Levels of glabella pairwise.t.test(glabella.long$thickness, glabella.long$pop, pool.sd = TRUE, p.adjust.method = "bonf")  Pairwise comparisons using t tests with pooled SD data: glabella.long$thickness and glabella.long$pop cauc afam afam 0.30 - naaa 1.00 0.34 P value adjustment method: bonferroni  ## 5.3 Further Discussion of Multiple Comparisons The FSD and Bonferroni methods comprise the ends of the spectrum of multiple comparisons methods. Among multiple comparisons procedures, the FSD method is most likely to find differences, whether real or due to sampling variation, whereas Bonferroni is often the most conservative method. You can be reasonably sure that differences suggested by the Bonferroni method will be suggested by almost all other methods, whereas differences not significant under FSD will not be picked up using other approaches. The Bonferroni method is conservative, but tends to work well when the number of comparisons is small, say 4 or less. A smart way to use the Bonferroni adjustment is to focus attention only on the comparisons of interest (generated independently of looking at the data!), and ignore the rest. I will return to this point later. A commonly-used alternative is Tukey’s honest significant difference method (HSD). John Tukey’s honest significant difference method is to reject the equality of a pair of means based, not on the $$t$$-distribution, but the studentized range distribution. To implement Tukey’s method with a FER of $$\alpha$$, reject $$H_0:\mu_i =\mu_j$$ when $|\bar{Y}_i-\bar{Y}_j| \geq \frac{q_{crit}}{\sqrt{2}} s_{\textrm{pooled}} \sqrt{\frac{1}{n_i}+\frac{1}{n_j}},$ where $$q_{crit}$$ is the $$\alpha$$ level critical value of the studentized range distribution. For the doughnut fats, the groupings based on Tukey and Bonferroni comparisons are identical. Code #### Tukey's honest significant difference method (HSD) ## Fat # Tukey 95% Individual p-values # All Pairwise Comparisons among Levels of fat TukeyHSD(fit.f)  Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = amount ~ type, data = fat.long)$type
diff        lwr        upr     p adj
fat2-fat1  10.5000000  -2.719028 23.7190277 0.1510591
fat3-fat1   0.3333333 -12.885694 13.5523611 0.9998693
fat4-fat1 -12.5000000 -25.719028  0.7190277 0.0679493
fat3-fat2 -10.1666667 -23.385694  3.0523611 0.1709831
fat4-fat2 -23.0000000 -36.219028 -9.7809723 0.0004978
fat4-fat3 -12.8333333 -26.052361  0.3856944 0.0590077
Code
## Glabella
# Tukey 95% Individual p-values
# All Pairwise Comparisons among Levels of pop
TukeyHSD(fit.g)
  Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = thickness ~ pop, data = glabella.long)

$pop diff lwr upr p adj afam-cauc 0.64903846 -0.2943223 1.5923993 0.2259806 naaa-cauc 0.04464286 -0.8824050 0.9716907 0.9923923 naaa-afam -0.60439560 -1.5120412 0.3032500 0.2472838 Another popular method controls the false discovery rate (FDR) instead of the FER. The FDR is the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others, though with a higher FER. I encourage you to learn more about the methods by Benjamini, Hochberg, and Yekutieli. Code #### false discovery rate (FDR) ## Fat # FDR pairwise.t.test(fat.long$amount, fat.long$type, pool.sd = TRUE, p.adjust.method = "BH")  Pairwise comparisons using t tests with pooled SD data: fat.long$amount and fat.long$type fat1 fat2 fat3 fat2 0.05248 - - fat3 0.94443 0.05248 - fat4 0.03095 0.00056 0.03095 P value adjustment method: BH  Code ## Glabella # FDR pairwise.t.test(glabella.long$thickness, glabella.long$pop, pool.sd = TRUE, p.adjust.method = "BH")  Pairwise comparisons using t tests with pooled SD data: glabella.long$thickness and glabella.long$pop cauc afam afam 0.17 - naaa 0.91 0.17 P value adjustment method: BH  ## 5.4 Checking Assumptions in ANOVA Problems The classical ANOVA assumes that the populations have normal frequency curves and the populations have equal variances (or spreads). You can test the normality assumption using multiple normal QQ-plots and normal scores tests, which we discussed in Chapter @ch:ADA1_04_CheckAssump. An alternative approach that is useful with three or more samples is to make a single normal scores plot for the entire data set. The samples must be centered at the same location for this to be meaningful. (WHY?) This is done by subtracting the sample mean from each observation in the sample, giving the so-called residuals. A normal scores plot or histogram of the residuals should resemble a sample from a normal population. These two plots can be generated with output in $residuals from the anova() procedure.

For the glabella residuals, there are a few observations outside the confidence bands, but the formal normality tests each have p-values $$>0.2$$, so there’s weak but unconvincing evidence of nonnormality.

Code
#### Checking Assumptions in ANOVA Problems
# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(fit.g$residuals, freq = FALSE, breaks = 20) points(density(fit.g$residuals), type = "l")
rug(fit.g$residuals) # violin plot library(vioplot) Loading required package: sm Package 'sm', version 2.2-5.7: type help(sm) for summary information Loading required package: zoo  Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric Code vioplot(fit.g$residuals, horizontal=TRUE, col="gray")

# boxplot
boxplot(fit.g$residuals, horizontal=TRUE) Code # QQ plot par(mfrow=c(1,1)) library(car) Loading required package: carData  Attaching package: 'car' The following object is masked from 'package:purrr': some The following object is masked from 'package:dplyr': recode Code qqPlot(fit.g$residuals, las = 1, id = list(n = 8, cex = 1), lwd = 1, main="QQ Plot")

29  8 34 40 21 25 27 37
26  8 31 37 19 23 25 34 
Code
shapiro.test(fit.g$residuals)  Shapiro-Wilk normality test data: fit.g$residuals
W = 0.97693, p-value = 0.5927
Code
library(nortest)
ad.test(fit.g$residuals)  Anderson-Darling normality test data: fit.g$residuals
A = 0.37731, p-value = 0.3926
Code
# lillie.test(fit.g$residuals) cvm.test(fit.g$residuals)

Cramer-von Mises normality test

data:  fit.g$residuals W = 0.070918, p-value = 0.2648 In Chapter @ch:ADA1_04_CheckAssump, I illustrated the use of Bartlett’s test and Levene’s test for equal population variances, and showed how to evaluate these tests in R. R does the calculation for us, as illustrated below. Because the p-value $$>0.5$$, we fail to reject the null hypothesis that the population variances are equal. This result is not surprising given how close the sample variances are to each other. Code ## Test equal variance # Barlett assumes populations are normal bartlett.test(thickness ~ pop, data = glabella.long)  Bartlett test of homogeneity of variances data: thickness by pop Bartlett's K-squared = 1.1314, df = 2, p-value = 0.568 Levene’s and Flinger’s tests are consistent with Bartlett’s. Code # Levene does not assume normality, requires car package library(car) leveneTest(thickness ~ pop, data = glabella.long) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 2 0.5286 0.5939 36  Code # Fligner is a nonparametric test fligner.test(thickness ~ pop, data = glabella.long)  Fligner-Killeen test of homogeneity of variances data: thickness by pop Fligner-Killeen:med chi-squared = 1.0311, df = 2, p-value = 0.5972 ## 5.5 Example from the Child Health and Development Study (CHDS) We consider data from the birth records of 680 live-born white male infants. The infants were born to mothers who reported for pre-natal care to three clinics of the Kaiser hospitals in northern California. As an initial analysis, we will examine whether maternal smoking has an effect on the birth weights of these children. To answer this question, we define 3 groups based on mother’s smoking history: (1) mother does not currently smoke or never smoked, (2) mother smoked less than one pack of cigarettes a day during pregnancy, and (3) mother smoked at least one pack of cigarettes a day during pregnancy. Let $$\mu_i$$ = pop mean birth weight (lb) for children in group $$i$$, $$(i=1,2,3)$$. We wish to test $$H_0: \mu_1=\mu_2=\mu_3$$ against $$H_A: \textrm{not } H_0$$. We read in the data, create a smoke factor variable, and plot the data by smoking group. Code #### Example from the Child Health and Development Study (CHDS) # description at http://statacumen.com/teach/ADA1/ADA1_notes_05-CHDS_desc.txt # read data from website chds <- read.csv("http://statacumen.com/teach/ADA1/ADA1_notes_05-CHDS.csv") chds$smoke <- rep(NA, nrow(chds));
# no cigs
chds[(chds$m_smok == 0), "smoke"] <- "0 cigs"; # less than 1 pack (20 cigs = 1 pack) chds[(chds$m_smok > 0) & (chds$m_smok < 20),"smoke"] <- "1-19 cigs"; # at least 1 pack (20 cigs = 1 pack) chds[(chds$m_smok >= 20),"smoke"] <- "20+ cigs";
chds$smoke <- factor(chds$smoke)
Code
# histogram using ggplot
p1 <- ggplot(chds, aes(x = c_bwt))
p1 <- p1 + geom_histogram(binwidth = .4)
p1 <- p1 + geom_rug()
p1 <- p1 + facet_grid(smoke ~ .)
p1 <- p1 + labs(title = "Child birthweight vs maternal smoking") +
xlab("child birthweight (lb)")
#print(p1)

p2 <- ggplot(chds, aes(x = c_bwt, fill=smoke))
p2 <- p2 + geom_histogram(binwidth = .4, alpha = 1/3, position="identity")
p2 <- p2 + geom_rug(aes(colour = smoke), alpha = 1/3)
p2 <- p2 + labs(title = "Child birthweight vs maternal smoking") +
xlab("child birthweight (lb)")
#print(p2)

library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

combine
Code
grid.arrange(grobs = list(p1, p2), ncol=1)

Code
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(chds, aes(x = smoke, y = c_bwt))
# plot a reference line for the global mean (assuming no groups)
p <- p + geom_hline(yintercept = mean(chds$c_bwt), colour = "black", linetype = "dashed", size = 0.3, alpha = 0.5) # boxplot, size=.75 to stand out behind CI p <- p + geom_boxplot(size = 0.75, alpha = 0.5) # points for observed data p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.2) # diamond at mean for each group p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 4, colour = "red", alpha = 0.8) # confidence limits based on normal distribution p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = .2, colour = "red", alpha = 0.8) p <- p + labs(title = "Child birthweight vs maternal smoking") + ylab("child birthweight (lb)") print(p) Looking at the boxplots, there is some evidence of non-normality here. Although there are outliers in the no smoking group, we need to recognize that the sample size for this group is fairly large (381). Given that boxplots are calibrated in such a way that 7 outliers per 1000 observations are expected when sampling from a normal population, 5 outliers (only 4 are visible!) out of 381 seems a bit excessive. A formal test rejects the hypothesis of normality in the no and low smoker groups. The normal probability plot and the histogram of the residuals also suggests that the population distributions are heavy tailed. Code library(car) par(mfrow=c(1,3)) qqPlot(subset(chds, smoke == "0 cigs" )$c_bwt, las = 1, id = list(n = 0, cex = 1),
lwd = 1, main="QQ Plot, 0 cigs")
qqPlot(subset(chds, smoke == "1-19 cigs")$c_bwt, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot, 1-19 cigs") qqPlot(subset(chds, smoke == "20+ cigs" )$c_bwt, las = 1, id = list(n = 0, cex = 1),
lwd = 1, main="QQ Plot, 20+ cigs")

Code
library(nortest)
# 0 cigs    --------------------
shapiro.test(subset(chds, smoke == "0 cigs"   )$c_bwt)  Shapiro-Wilk normality test data: subset(chds, smoke == "0 cigs")$c_bwt
W = 0.98724, p-value = 0.00199
Code
ad.test(     subset(chds, smoke == "0 cigs"   )$c_bwt)  Anderson-Darling normality test data: subset(chds, smoke == "0 cigs")$c_bwt
A = 0.92825, p-value = 0.01831
Code
cvm.test(    subset(chds, smoke == "0 cigs"   )$c_bwt)  Cramer-von Mises normality test data: subset(chds, smoke == "0 cigs")$c_bwt
W = 0.13844, p-value = 0.03374
Code
# 1-19 cigs --------------------
shapiro.test(subset(chds, smoke == "1-19 cigs")$c_bwt)  Shapiro-Wilk normality test data: subset(chds, smoke == "1-19 cigs")$c_bwt
W = 0.97847, p-value = 0.009926
Code
ad.test(     subset(chds, smoke == "1-19 cigs")$c_bwt)  Anderson-Darling normality test data: subset(chds, smoke == "1-19 cigs")$c_bwt
A = 0.83085, p-value = 0.03149
Code
cvm.test(    subset(chds, smoke == "1-19 cigs")$c_bwt)  Cramer-von Mises normality test data: subset(chds, smoke == "1-19 cigs")$c_bwt
W = 0.11332, p-value = 0.07317
Code
# 20+ cigs  --------------------
shapiro.test(subset(chds, smoke == "20+ cigs" )$c_bwt)  Shapiro-Wilk normality test data: subset(chds, smoke == "20+ cigs")$c_bwt
W = 0.98127, p-value = 0.06962
Code
ad.test(     subset(chds, smoke == "20+ cigs" )$c_bwt)  Anderson-Darling normality test data: subset(chds, smoke == "20+ cigs")$c_bwt
A = 0.40008, p-value = 0.3578
Code
cvm.test(    subset(chds, smoke == "20+ cigs" )$c_bwt)  Cramer-von Mises normality test data: subset(chds, smoke == "20+ cigs")$c_bwt
W = 0.040522, p-value = 0.6694

Fit the ANOVA (we’ll look at the table below).

Code
fit.c <- aov(c_bwt ~ smoke, data = chds)

A formal test of normality on the residuals of the combined sample is marginally significant (SW p-value$$=0.047$$, others $$>0.10$$). However, I am not overly concerned about this for the following reasons: in large samples, small deviations from normality are often statistically significant and in my experience, the small deviations we are seeing here are not likely to impact our conclusions, in the sense that non-parametric methods that do not require normality will lead to the same conclusions.

Code
# plot of data
par(mfrow=c(3,1))
# Histogram overlaid with kernel density curve
hist(fit.c$residuals, freq = FALSE, breaks = 20) points(density(fit.c$residuals), type = "l")
rug(fit.c$residuals) # violin plot library(vioplot) vioplot(fit.c$residuals, horizontal=TRUE, col="gray")

# boxplot
boxplot(fit.c$residuals, horizontal=TRUE) Code # QQ plot par(mfrow=c(1,1)) library(car) qqPlot(fit.c$residuals, las = 1, id = list(n = 0, cex = 1), lwd = 1, main="QQ Plot")

Code
shapiro.test(fit.c$residuals)  Shapiro-Wilk normality test data: fit.c$residuals
W = 0.99553, p-value = 0.04758
Code
library(nortest)
ad.test(fit.c$residuals)  Anderson-Darling normality test data: fit.c$residuals
A = 0.62184, p-value = 0.1051
Code
cvm.test(fit.c$residuals)  Cramer-von Mises normality test data: fit.c$residuals
W = 0.091963, p-value = 0.1449

Looking at the summaries, we see that the sample standard deviations are close. Formal tests of equal population variances are far from significant. The p-values for Bartlett’s test and Levene’s test are greater than 0.4. Thus, the standard ANOVA appears to be appropriate here.

Code
# calculate summaries
chds.summary <- ddply(chds, "smoke",
function(X) { data.frame( m = mean(X$c_bwt), s = sd(X$c_bwt),
n = length(X$c_bwt) ) } ) chds.summary  smoke m s n 1 0 cigs 7.732808 1.052341 381 2 1-19 cigs 7.221302 1.077760 169 3 20+ cigs 7.266154 1.090946 130 Code ## Test equal variance # assumes populations are normal bartlett.test(c_bwt ~ smoke, data = chds)  Bartlett test of homogeneity of variances data: c_bwt by smoke Bartlett's K-squared = 0.3055, df = 2, p-value = 0.8583 Code # does not assume normality, requires car package library(car) leveneTest(c_bwt ~ smoke, data = chds) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 2 0.7591 0.4685 677  Code # nonparametric test fligner.test(c_bwt ~ smoke, data = chds)  Fligner-Killeen test of homogeneity of variances data: c_bwt by smoke Fligner-Killeen:med chi-squared = 2.0927, df = 2, p-value = 0.3512 The p-value for the $$F$$-test is less than 0.0001. We would reject $$H_0$$ at any of the usual test levels (such as 0.05 or 0.01). The data suggest that the population mean birth weights differ across smoking status groups. Code summary(fit.c)  Df Sum Sq Mean Sq F value Pr(>F) smoke 2 40.7 20.351 17.9 2.65e-08 *** Residuals 677 769.5 1.137 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Code fit.c Call: aov(formula = c_bwt ~ smoke, data = chds) Terms: smoke Residuals Sum of Squares 40.7012 769.4943 Deg. of Freedom 2 677 Residual standard error: 1.066126 Estimated effects may be unbalanced The Tukey multiple comparisons suggest that the mean birth weights are different (higher) for children born to mothers that did not smoke during pregnancy. Code ## CHDS # Tukey 95% Individual p-values TukeyHSD(fit.c)  Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = c_bwt ~ smoke, data = chds)$smoke
diff        lwr        upr     p adj
1-19 cigs-0 cigs   -0.51150662 -0.7429495 -0.2800637 0.0000008
20+ cigs-0 cigs    -0.46665455 -0.7210121 -0.2122970 0.0000558
20+ cigs-1-19 cigs  0.04485207 -0.2472865  0.3369907 0.9308357