---
title: "ADA1: Class 16, Paired data, assumption assessment"
author: "Your Name Here"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
---
Include your answers in this document in the sections below the rubric where I have point values indicated (1 p).
# Rubric
Answer the questions with the data examples.
---
# Paired Comparisons of Two Sleep Remedies
The following data give the amount of sleep gained in hours from two
sleep remedies, A and B, applied to 10 individuals who have trouble sleeping
an adequate amount. Negative values imply sleep loss.
Let $\mu_A$ = population mean sleep gain (among troubled sleepers) on remedy
A, and $\mu_B$ = population mean sleep gain (among troubled sleepers) on
remedy B. Consider testing $H_0: \mu_B - \mu_A = 0$ or equivalently
$\mu_D = 0$, where $\mu_D = \mu_B - \mu_A$.
```{R}
library(tidyverse)
# Data
dat_sleep <- read.table(text="
A B
0.7 1.9
-1.6 0.8
-0.2 1.1
-1.2 0.1
0.1 -0.1
3.4 4.4
3.7 5.5
0.8 1.6
0.0 4.6
2.0 3.0
", header = TRUE)
# calculate paired difference bewteen two remedies, D = B - A
dat_sleep <-
dat_sleep %>%
mutate(
D = B - A
)
summary(dat_sleep)
```
A plot of both sleep remedies, with a 1-to-1 reference line to indicate
which remedy is better for which people.
```{R}
# determine range of each axis to use most extreme of each for square plot below
axis.lim <- range(c(dat_sleep$A, dat_sleep$B))
library(ggplot2)
# scatterplot of A and B sleep times, with 1:1 line
p <- ggplot(dat_sleep, aes(x = A, y = B))
# draw a 1:1 line, dots above line indicate "B > A"
p <- p + geom_abline(intercept = 0, slope = 1, alpha = 0.4)
p <- p + geom_vline(xintercept = 0, alpha = 0.2)
p <- p + geom_hline(yintercept = 0, alpha = 0.2)
p <- p + geom_point()
p <- p + geom_rug()
# make the axes square so it's a fair visual comparison
p <- p + coord_equal()
p <- p + scale_x_continuous(limits=axis.lim)
p <- p + scale_y_continuous(limits=axis.lim)
p <- p + labs(title = "Sleep hours gained on two sleep remedies: A vs B")
print(p)
```
A plot of the paired differences.
```{R}
library(ggplot2)
p1 <- ggplot(dat_sleep, aes(x = D))
p1 <- p1 + scale_x_continuous(limits=c(-5,+5), breaks = seq(-6, 6, by=2))
p1 <- p1 + geom_histogram(aes(y=..density..), binwidth = 0.5)
p1 <- p1 + geom_density(alpha=0.1, fill="white", adjust = 2)
# vertical reference line at 0
p1 <- p1 + geom_vline(xintercept = 0, colour="red", linetype="dashed")
p1 <- p1 + geom_vline(xintercept = mean(dat_sleep$D), colour="blue", alpha = 0.5)
p1 <- p1 + geom_rug()
p1 <- p1 + labs(title = "Difference of sleep hours gained: D = B - A")
print(p1)
```
__Hypothesis test__
1. (2 p) ``''
* $H_0: $ versus $H_A: $
2. Let $\alpha = 0.05$, the significance level of the test and the Type-I error probability if the null hypothesis is true.
```{R}
# one-sample t-test of differences (paired t-test)
# t.summary <- t.test(dat_sleep$D) # this gives the same result
t.summary <- t.test(dat_sleep$B, dat_sleep$A, paired = TRUE)
t.summary
```
3. (1 p) $t_{s} = 99$.
4. (1 p) $p = 99$, this is the observed significance of the test.
5. (2 p) Because $p = 99$, ...
---
# Test assumptions
For one- and two-sample $t$-tests,
our assumptions are that we have a representative (random) sample from the population
(not something that can be tested very well),
the the sampling distributions of our test statistic $t$ is normal.
Hidden in a code chunk are
functions for one-sample and two-sample boostrap appoximations of the
sampling distributions of the mean to
compare with a normal distribution for the normality assumption.
```{R, echo = FALSE}
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution with
# a normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 1e4) {
n <- length(dat);
# resample from data
sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
# draw a histogram of the means
sam.mean <- colMeans(sam);
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(2,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat, freq = FALSE, breaks = 6
, main = "Plot of data with smoothed density curve")
points(density(dat), type = "l")
rug(dat)
hist(sam.mean, freq = FALSE, breaks = 25
, main = "Bootstrap sampling distribution of the mean"
, xlab = paste("Data: n =", n
, ", mean =", signif(mean(dat), digits = 5)
, ", se =", signif(sd(dat)/sqrt(n)), digits = 5))
# overlay a density curve for the sample means
points(density(sam.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(sam.mean), max(sam.mean), length = 1000)
points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(sam.mean)
# restore par() settings
par(old.par)
}
#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution
# of the difference of means from two samples with
# a normal distribution with mean and SEM estimated from the data
bs.two.samp.diff.dist <- function(dat1, dat2, N = 1e4) {
n1 <- length(dat1);
n2 <- length(dat2);
# resample from data
sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
# calculate the means and take difference between populations
sam1.mean <- colMeans(sam1);
sam2.mean <- colMeans(sam2);
diff.mean <- sam1.mean - sam2.mean;
# save par() settings
old.par <- par(no.readonly = TRUE)
# make smaller margins
par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
# Histogram overlaid with kernel density curve
hist(dat1, freq = FALSE, breaks = 6
, main = paste("Sample 1", "\n"
, "n =", n1
, ", mean =", signif(mean(dat1), digits = 5)
, ", sd =", signif(sd(dat1), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat1), type = "l")
rug(dat1)
hist(dat2, freq = FALSE, breaks = 6
, main = paste("Sample 2", "\n"
, "n =", n2
, ", mean =", signif(mean(dat2), digits = 5)
, ", sd =", signif(sd(dat2), digits = 5))
, xlim = range(c(dat1, dat2)))
points(density(dat2), type = "l")
rug(dat2)
hist(diff.mean, freq = FALSE, breaks = 25
, main = paste("Bootstrap sampling distribution of the difference in means", "\n"
, "mean =", signif(mean(diff.mean), digits = 5)
, ", se =", signif(sd(diff.mean), digits = 5)))
# overlay a density curve for the sample means
points(density(diff.mean), type = "l")
# overlay a normal distribution, bold and red
x <- seq(min(diff.mean), max(diff.mean), length = 1000)
points(x, dnorm(x, mean = mean(diff.mean), sd = sd(diff.mean))
, type = "l", lwd = 2, col = "red")
# place a rug of points under the plot
rug(diff.mean)
# restore par() settings
par(old.par)
}
```
## Paired Comparisons of Two Sleep Remedies
6. __Check assumptions__ of the test.
__Description of plots:__
The top plot is of the data, a histogram with smoothed density curve.
The bottom plot is a comparison of the boostrap appoximation of the
sampling distributions of the mean.
If the histogram and smooth density curve are _very_ close (basically underneath)
the red normal distribution, then we can say the
sampling distribution is consistent with normality.
If there is systematic bias (as below) where one side is over and the other side is under the red curve,
then that's evidence against normality,
in which case we would pursue a nonparametric test (next week).
```{R}
bs.one.samp.dist(dat_sleep$D)
```
(2 p) Interpret the plots above.
---
# African countries in the UN example
Check the assumptions of the two-sample $t$-test for this example
(explained in a previous worksheet).
```{R}
# install.packages("gsheet")
library(gsheet)
dat_UN_Africa_url <- "docs.google.com/spreadsheets/d/1PkjIygZdtlopH6kvyle-KeoECsycHWyCaya92erlvaE"
dat_UN_Africa <-
gsheet2tbl(dat_UN_Africa_url) %>%
as.data.frame() %>%
na.omit() %>%
mutate(
PrimingNumber = factor(PrimingNumber)
, HighLow = factor(HighLow)
) %>%
filter(
Class == "F19"
)
str(dat_UN_Africa)
```
Here are some summaries and plots.
```{R}
## If we create a summary data.frame with a similar structure as our data, then we
## can annotate our plot with those summaries.
# calculate the estimated mean and order M then F
est_mean <- as.numeric(by(dat_UN_Africa$UN_Percentage, dat_UN_Africa$PrimingNumber, mean))
# combine true US mean with our estimated mean
mean_UN_Africa <- data.frame(PrimingNumber = rev(unique(dat_UN_Africa$PrimingNumber))
, UN_Percentage = est_mean)
mean_UN_Africa <-
mean_UN_Africa %>%
mutate(
PrimingNumber = factor(PrimingNumber)
)
mean_UN_Africa
# summary
by(dat_UN_Africa$UN_Percentage, dat_UN_Africa$PrimingNumber, summary)
# histogram using ggplot
p <- ggplot(dat_UN_Africa, aes(x = UN_Percentage))
p <- p + geom_histogram(binwidth = 4)
p <- p + geom_rug()
p <- p + geom_vline(data = mean_UN_Africa, aes(xintercept = UN_Percentage), colour = "red")
p <- p + facet_grid(PrimingNumber ~ .)
print(p)
```
## Questions to answer
_A priori_, before we observed the data, we hypothesized that those who were primed
with a larger number (65) would provide a higher percentage (`UN_Percentage`)
than those with the lower number (10).
Therefore, this is a one-sided test.
```{R}
# two-sample t-test
t_summary_UN <- t.test(UN_Percentage ~ PrimingNumber, data = dat_UN_Africa
, alternative = "less")
t_summary_UN
```
6. __Check assumptions__ of the test.
```{R}
bs.two.samp.diff.dist(subset(dat_UN_Africa, PrimingNumber == "10")$UN_Percentage
, subset(dat_UN_Africa, PrimingNumber == "65")$UN_Percentage)
```
(2 p) Interpret the plots above.