---
title: "ADA1: Class 13, Inference"
author: "Your Name Here"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
---
Include your answers in this document in the sections below the rubric.
# Rubric
Answer the questions with the two data examples.
---
# Guess the ages example
## Read and reshape the data to produce summary table and plot
Erik has written a lot of code here to reshape, summarize, and combine the data
in order to create the table and plot.
You are encouraged to look at the code, run each line at a time and examine the results
to understand the steps.
The steps used are a common set of transformations for basic analysis.
```{R}
library(tidyverse)
# install.packages("gsheet")
# Read Ages data from google spreadsheet
library(gsheet)
dat_ages_url <- "docs.google.com/spreadsheets/d/1ALMmYN0AKafrOk0iO1AIbJPSdhrVPBfv27sOIo2wELk"
# convert the spreadsheet to csv-formatted test
dat_ages_all <- gsheet2text(dat_ages_url)
```
```{R, echo = FALSE}
# process data, reshape and organize
# read body of table as data (first two rows have other information
dat_ages <- read.table(text=dat_ages_all, skip = 2, header = TRUE, sep = ",", stringsAsFactors = TRUE)
# convert gender letters to capital
dat_ages$GenderOfGuesser <- toupper(dat_ages$GenderOfGuesser)
# extract top two rows for true ages and genders
dat_ages_true <- read.table(text=dat_ages_all, nrows = 2, header = FALSE, sep = ",", stringsAsFactors = FALSE)
colnames(dat_ages_true) <- c("Image", colnames(dat_ages)[-1])
#str(dat_ages_true)
# guessed ages
#str(dat_ages)
# reshape into long format
dat_ages_true_long <-
data.frame(Image = as.character(colnames(dat_ages_true[,-1]))
, Age = as.numeric(dat_ages_true[1,-1])
, Gender = as.character(dat_ages_true[2,-1]))
dat_ages_long <-
dat_ages %>%
gather(
key = "Image"
, value = "Age"
, X1:X10
)
# order images by age
dat_ages_true_long$Image <-
factor(dat_ages_true_long$Image
, ordered = TRUE
, levels = levels(reorder(dat_ages_true_long$Image, dat_ages_true_long$Age))
)
dat_ages_long$Image <-
factor(dat_ages_long$Image
, ordered = TRUE
, levels = levels(reorder(dat_ages_true_long$Image, dat_ages_true_long$Age))
)
# join datasets to include image Gender with guesses
dat_ages_long <-
dat_ages_long %>%
left_join(
dat_ages_true_long %>% select(-Age)
)
# calcuate mean, sd, and CI from data for each image
dat_ages_est_long <-
dat_ages_long %>%
group_by(Image) %>%
do(
data.frame(
SD_Age = sd(.$Age)
, Gender = .$Gender[1]
, CI_lower = t.test(.$Age, conf.level = 0.95)$conf.int[1]
, CI_upper = t.test(.$Age, conf.level = 0.95)$conf.int[2]
, Age = t.test(.$Age, conf.level = 0.95)$estimate
)
) %>%
ungroup()
# To create a table, create a "TrueAge" column, and join the true with est table
dat_ages_est_long <-
left_join(
dat_ages_est_long
, dat_ages_true_long %>% rename(TrueAge = Age)
) %>%
# Calculate the bias
mutate(
Bias = Age - TrueAge
) %>%
# sort by Gender and TrueAge so in same order as plot
arrange(
Gender, TrueAge
) %>%
# reorder columns
select(
Image, Gender, TrueAge, Age, Bias, SD_Age, CI_lower, CI_upper
)
```
Below is a summary table for each image, and the associated plot.
We will display the original images in class for comparisons.
```{R}
# finally, display beautiful table
dat_ages_est_long
```
```{R, fig.width = 7, fig.height = 7}
# plot data and create table
library(ggplot2)
p <- ggplot(dat_ages_long, aes(x = Age))
p <- p + geom_histogram(aes(fill = GenderOfGuesser), position="stack", binwidth = 2, alpha = 1)
p <- p + geom_rug(alpha = 1/8)
# true ages
p <- p + geom_vline(data = dat_ages_true_long, aes(xintercept = Age)
, colour = "red", linetype = "dotted", size = 1)
# est ages
p <- p + geom_vline(data = dat_ages_est_long, aes(xintercept = Age)
, colour = "blue", size = 1)
p <- p + geom_rect(data = dat_ages_est_long, aes(xmin = CI_lower, xmax = CI_upper, ymin = -1, ymax = 0)
, fill = "blue", alpha = 1)
p <- p + facet_grid(Gender + Image ~ ., space = "free")
# Legend: Put top-right corner of legend box in top -right corner of graph
p <- p + theme(legend.justification=c(1,1), legend.position=c(1,1))
p <- p + labs(title = "Age guesses", caption = "red dotted = true, blue = est +- 95% CI")
print(p)
```
## Questions to answer
1. (1 p) From the table or plot,
what is the overall pattern of age guesses based on the Gender and Age of the Image
based on the Gender of the guesser?
2. (1 p) The sample who took the Guess the Ages survey were
"Self-selected undergraduate and graduate students in
Stat 427/527 this semester".
Define the population this sample was taken from.
3. (1 p) The population parameter being tested is
"The mean age $\mu_j$ that people in the population would assess for each image $j=1, \ldots, 10$."
(Note, the population parameter is _not_ the true age, rather, it is the _mean assessed age_.
The bias difference of these two is an indicator of whether a person looks young or old for their age.)
The sample statistic is $\bar{Y}_j$, the sample mean for each image.
Give the name for and define the standard deviation for this sample statistic.
4. (2 p) Report and interpret the confidence interval for image X1.
---
# Lego example
## Draw samples, estimate the mean "cells" of 50 Lego assemblages
A "cell" is defined as a one-unit high square that includes a single lego circle.
Some assemblages have a 1/5-high base; ignore this, it is only for structure.
Procedure:
a. Select a "representative or random sample" of 5 assemblages out of the bag.
b. Count the number of cells for each assemblage.
c. Calculate the sample mean of the 5 (to estimate population mean of the 50); you can use this R code by replacing the numbers: `mean(c(1, 2, 3, 4, 5))`.
d. Write your mean cell-count estimate on a portable white board in big numbers (3 decimal places) and lay it on your table (so other tables don't see).
e. When all tables are done, hang up your board so everyone can see each table's cell-count estimate.
f. Record all the estimates in the R code chunk below and plot the estimate of the sampling distribution of the mean cell count based on $n=5$.
```{R}
# enter list of cells means here
#sam.lego <- data.frame(mean.cells = c(0, 0, 0, 0, 0, 0
# , 0, 0, 0, 0, 0, 0))
sam.lego <-
data.frame(
mean.cells =
c(14.2, 29.4, 27.8, 40.8, 29.0
, 47.0, 42.8, 50.0, 44.2, 56.4
, 42.8, 44.4, 46.0, 44.4
)
)
# we'll fill this in after data is collected
#true.mean.cells = 0
true.mean.cells = 12.06
summary(sam.lego$mean.cells)
```
```{R}
# plot data and create table
library(ggplot2)
p <- ggplot(sam.lego, aes(x = mean.cells))
p <- p + geom_histogram(binwidth = 5)
p <- p + geom_rug(alpha = 1/2)
# est mean cells
p <- p + geom_vline(aes(xintercept = mean(sam.lego$mean.cells))
, colour = "blue", size = 1)
# true mean cells
p <- p + geom_vline(aes(xintercept = true.mean.cells)
, colour = "red", linetype = "dotted", size = 1)
p <- p + labs(title = "Sampling distribution of mean cells of 50 assemblages\nn=5, red dotted = true, blue = est")
print(p)
```
## Questions to answer
5. (1 p) Adjust the histogram `binwidth=` to provide an informative representation of the distribution.
6. (1 p) Describe the sampling distribution: mention the center, spread, and any outliers in the plot.
7. (1 p) Are the estimates larger or smaller than the actual mean cells of the assemblages,
and what might cause this bias?
8. (1 p) What would happen to the standard error and the bias if we increased the sample size from 5 to 10 or 20?
For example, write: "As the sample size increases, we expect the standard error to ... and the bias to ...".
9. (1 p) Does the Central Limit Theorem apply to the lego example? Why or why not?