Include your answers in this document in the sections below the rubric.

# Rubric

Answer the questions with the two data examples.

# Guess the ages example

## Read and reshape the data to produce summary table and plot

Erik has written a lot of code here to reshape, summarize, and combine the data in order to create the table and plot. You are encouraged to look at the code, run each line at a time and examine the results to understand the steps. The steps used are a common set of transformations for basic analysis.

## No encoding supplied: defaulting to UTF-8.
## Joining by: Image, Gender

Below is a summary table for each image, and the associated plot. We will display the original images in class for comparisons.

# finally, display beautiful table
dat.ages.est.long
##    Image Gender TrueAge      Age        Bias   SD.Age CI.lower CI.upper
## 1    X10      F      26 27.85246   1.8524590 4.296650 26.75204 28.95288
## 2     X4      F      26 26.90164   0.9016393 3.394628 26.03223 27.77104
## 3     X8      F      32 30.16393  -1.8360656 5.544307 28.74397 31.58390
## 4     X1      F      40 29.47541 -10.5245902 4.678342 28.27723 30.67359
## 5     X5      F      73 67.47541  -5.5245902 5.679221 66.02089 68.92993
## 6     X7      M      24 26.65574   2.6557377 4.392741 25.53070 27.78077
## 7     X3      M      32 28.01639  -3.9836066 5.129951 26.70255 29.33023
## 8     X9      M      43 31.57377 -11.4262295 5.349327 30.20374 32.94380
## 9     X2      M      50 46.45902  -3.5409836 6.235847 44.86194 48.05609
## 10    X6      M      76 61.33607 -14.6639344 6.036501 59.79005 62.88209
# plot data and create table
library(ggplot2)
p <- ggplot(dat.ages.long, aes(x = Age))
p <- p + geom_histogram(aes(fill = GenderOfGuesser), position="stack", binwidth = 2, alpha = 1)
p <- p + geom_rug(alpha = 1/8)
# true ages
p <- p + geom_vline(data = dat.ages.true.long, aes(xintercept = Age)
, colour = "red", linetype = "dotted", size = 1)
# est ages
p <- p + geom_vline(data = dat.ages.est.long, aes(xintercept = Age)
, colour = "blue", size = 1)
p <- p + geom_rect(data = dat.ages.est.long, aes(xmin = CI.lower, xmax = CI.upper, ymin = -1, ymax = 0)
, fill = "blue", alpha = 1)
p <- p + facet_grid(Gender + Image ~ ., space = "free")
# Legend: Put top-right corner of legend box in top -right corner of graph
p <- p + theme(legend.justification=c(1,1), legend.position=c(1,1))
p <- p + labs(title = "Age guesses\nred dotted = true, blue = est +- 95% CI")
print(p) ## Questions to answer

1. (1 p) From the table or plot, what is the overall pattern of age guesses based on the Gender and Age of the Image based on the Gender of the guesser?

2. (1 p) The sample who took the Guess the Ages survey were “Self-selected undergraduate and graduate students in Stat 427/527 this semester”. Define the population this sample was taken from.

3. (1 p) The population parameter being tested is “The mean age $$\mu_j$$ that people in the population would assess for each image $$j=1, \ldots, 10$$.” (Note, the population parameter is not the true age, rather, it is the mean assessed age. The bias difference of these two is an indicator of whether a person looks young or old for their age.) The sample statistic is $$\bar{Y}_j$$, the sample mean for each image. Give the name for and interpret the standard deviation for this sample statistic.

4. (2 p) Report and interpret the confidence interval for image X1.

# Lego example

## Draw samples, estimate the mean “cells” of 50 Lego assemblages

A “cell” is defined as a one-unit high square that includes a single lego circle. Some assemblages have a 1/5-high base; ignore this, it is only for structure.

Procedure:

1. Select a “representative or random sample” of 5 assemblages out of the bag.
2. Count the number of cells for each assemblage.
3. Calculate the sample mean of the 5 (to estimate population mean of the 50); you can use this R code by replacing the numbers: mean(c(1, 2, 3, 4, 5)).
4. Write your mean cell-count estimate on a portable white board in big numbers (3 decimal places) and lay it on your table (so other tables don’t see).
5. When all tables are done, hang up your board so everyone can see each table’s cell-count estimate.
6. Record all the estimates in the R code chunk below and plot the estimate of the sampling distribution of the mean cell count based on $$n=5$$.
# enter list of cells means here
#sam.lego <- data.frame(mean.cells = c(0, 0, 0, 0, 0, 0
#                                    , 0, 0, 0, 0, 0, 0))
sam.lego <- data.frame(mean.cells =
c(33.2, 32.4, 29.2, 19.6, 18.4
, 32.4, 21.6, 41.2, 48, 53.6
, 58.8, 35.2))

# we'll fill this in after data is collected
#true.mean.cells = 0
true.mean.cells = 11.86

summary(sam.lego$mean.cells) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.4 27.3 32.8 35.3 42.9 58.8 # plot data and create table library(ggplot2) p <- ggplot(sam.lego, aes(x = mean.cells)) p <- p + geom_histogram(binwidth = 5) p <- p + geom_rug(alpha = 1/2) # est mean cells p <- p + geom_vline(aes(xintercept = mean(sam.lego$mean.cells))
, colour = "blue", size = 1)
# true mean cells
p <- p + geom_vline(aes(xintercept = true.mean.cells)
, colour = "red", linetype = "dotted", size = 1)
p <- p + labs(title = "Sampling distribution of mean cells of 50 assemblages\nn=5, red dotted = true, blue = est")
print(p) ## Questions to answer

1. (1 p) Adjust the histogram binwidth= to provide an informative representation of the distribution.

2. (1 p) Describe the sampling distribution: mention the center, spread, and any outliers in the plot.

3. (1 p) Are the estimates larger or smaller than the actual mean cells of the assemblages, and what might cause this bias?

4. (1 p) What would happen to the standard error and the bias if we increased the sample size from 5 to 10 or 20? For example, write: “As the sample size increases, we expect the standard error to … and the bias to …”.

5. (1 p) Does the Central Limit Theorem apply to the lego example? Why or why not?