---
title: "ADA1: Class 17, Sampling distributions"
author: Your Name
date: last-modified
description: |
[Advanced Data Analysis 1](https://StatAcumen.com/teach/ada1),
Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM
format:
html:
theme: litera
highlight-style: atom-one
page-layout: full # article, full # https://quarto.org/docs/output-formats/page-layout.html
toc: true
toc-location: body # body, left, right
number-sections: false
self-contained: false # !!! this can cause a render error
code-overflow: scroll # scroll, wrap
code-block-bg: true
code-block-border-left: "#30B0E0"
code-copy: false # true, false, hover a copy buttom in top-right of code block
fig-width: 6
fig-height: 4
---
# Rubric
The context of this assignment comes from:
* [Foundations for statistical inference - Sampling distributions](https://openintro.shinyapps.io/sampling_distributions/)
_This is a template for the assignment. Modify this and turn it in._
Some questions are answered by the code you've written. In those cases, in your answer write "see code".
```{r load-packages, message=FALSE}
library(erikmisc)
library(tidyverse)
library(openintro)
library(infer)
set.seed(87131) # for repeatable random samples
```
# The data
A 2019 Gallup report states the following:
The premise that scientific progress benefits people has been embodied in
discoveries throughout the ages --- from the development of vaccinations to the
explosion of technology in the past few decades, resulting in billions of
supercomputers now resting in the hands and pockets of people worldwide. Still,
not everyone around the world feels science benefits them personally.
Source: [World Science Day: Is Knowledge Power?](https://news.gallup.com/opinion/gallup/268121/world-science-day-knowledge-power.aspx)
The Wellcome Global Monitor finds that 20% of people globally do not believe
that the work scientists do benefits people like them. In this lab, you will
assume this 20% is a true population proportion and learn about how sample
proportions can vary from sample to sample by taking smaller samples from the
population. We will first create our population assuming a population size of
100,000. This means 20,000 (20%) of the population think the work scientists do
does not benefit them personally and the remaining 80,000 think it does.
```{r}
global_monitor <-
tibble(
scientist_work =
c(
rep("Benefits" , 80000)
, rep("Doesn't benefit", 20000)
)
)
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()
global_monitor |>
count(scientist_work) |>
mutate(p = n /sum(n))
```
# The unknown sampling distribution
```{r}
samp1 <-
global_monitor |>
sample_n(50)
```
## 1. (1 p) Exercise 1
*Describe the distribution of responses in this sample. How does it compare to
the distribution of responses in the population. **Hint:** Although the `sample_n`
function takes a random sample of observations (i.e., rows) from the dataset,
you can still refer to the variables in the dataset with the same names. Code
you presented earlier for visualizing and summarising the population data will
still be useful for the sample, however be careful to not label your proportion
$p$ since you're now calculating a sample statistic, not a population parameters.
You can customize the label of the statistics to indicate that it comes from
the sample.*
* Write the text of your answer here...
```{r}
samp1_summary <-
samp1 |>
count(scientist_work) |>
mutate(p_hat = n / sum(n))
samp1_summary
ggplot(samp1_summary, aes(x = scientist_work, y = p_hat)) +
geom_col() +
geom_hline(aes(yintercept = c(0.20, 0.80)), linetype = "dashed", color = "red") +
labs(
#x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
, caption = "Red dashed lines are 0.20 and 0.80 reference to the population proportions."
) +
coord_flip()
```
## 2. (1 p) Exercise 2
*Would you expect the sample proportion to match the sample proportion of
if using a different random number seed? Why, or why
not? If the answer is no, would you expect the proportions to be somewhat
different or very different?*
* Write the text of your answer here...
## 3. (1 p) Exercise 3
*Take a second sample, also of size 50, and call it `samp2`. How does the sample
proportion of `samp2` compare with that of `samp1`? Suppose we took two more
samples, one of size 100 and one of size 1000. Which would you think would
provide a more accurate estimate of the population proportion?*
* Write the text of your answer here...
## 4. (1 p) Exercise 4
```{r}
sample_props50 <-
global_monitor |>
rep_sample_n(size = 50, reps = 15000, replace = TRUE) |>
count(scientist_work) |>
mutate(p_hat = n / sum(n)) |>
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)
```
*How many elements are there in `sample_props50`? Describe the sampling
distribution, and be sure to specifically note its center. Make sure to include
a plot of the distribution in your answer.*
* Write the text of your answer here...
# Interlude: Sampling distributions
## 5. (1 p) Exercise 5
*To make sure you understand how sampling distributions are built, and exactly
what the `rep_sample_n` function does, try modifying the code to create a
sampling distribution of **25 sample proportions** from **samples of size 10**,
and put them in a data frame named `sample_props_small`. Print the output. How
many observations are there in this object called `sample_props_small`? What
does each observation represent?*
* Write the text of your answer here...
# Sample size and the sampling distribution
```{r}
#ggplot(data = sample_props50, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
```
## 6. (1 p) Exercise 6
*Use the app below to create sampling distributions of proportions of
`Doesn't benefit` from samples of size 10, 50, and 100. Use 5,000 simulations.
What does each observation in the sampling distribution represent? How does the
mean, standard error, and shape of the sampling distribution change as the
sample size increases? How (if at all) do these values change if you increase
the number of simulations? (You do not need to include plots in your answer.)*
* Write the text of your answer here...
# More Practice
## 7. (1 p) Exercise 7
*Take a sample of size 15 from the population and calculate the proportion of
people in this sample who think the work scientists do enchances their lives.
Using this sample, what is your best point estimate of the population
proportion of people who think the work scientists do enchances their lives?*
* Write the text of your answer here...
## 8. (1 p) Exercise 8
*Since you have access to the population, simulate the sampling distribution of
proportion of those who think the work scientists do enchances their lives for
samples of size 15 by taking 2000 samples from the population of size 15 and
computing 2000 sample proportions. Store these proportions in as
`sample_props15`. Plot the data, then describe the shape of this sampling
distribution. Based on this sampling distribution, what would you guess the
true proportion of those who think the work scientists do enchances their lives
to be? Finally, calculate and report the population proportion.*
* Write the text of your answer here...
## 9. (1 p) Exercise 9
*Change your sample size from 15 to 150, then compute the sampling distribution
using the same method as above, and store these proportions in a new object
called `sample_props150`. Describe the shape of this sampling distribution and
compare it to the sampling distribution for a sample size of 15. Based on this
sampling distribution, what would you guess to be the true proportion of those
who think the work scientists do enchances their lives?*
* Write the text of your answer here...
## 10. (1 p) Exercise 10
(Typo, should be comparing Exercises 8 and 9.)
*Of the sampling distributions from 8 and 9, which has a smaller spread? If
you're concerned with making estimates that are more often close to the true
value, would you prefer a sampling distribution with a large or small spread?*
* Write the text of your answer here...
![Creative Commons License](https://i.creativecommons.org/l/by-sa/4.0/88x31.png){style="border-width:0"}

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.