1 Document overview

This document is organized by Week and Class number. The in-class assignments are indicated by the Tuesday and Thursday Class numbers. Each week’s homework is often a combination of Tuesday and Thursday, with a small extension. Therefore, “fleshing out” the Tuesday and Thursday sections with a little addition is often sufficient for your homework assignment; that is, you won’t need a separate “homework” section for a week but just extend the in-class assignments. Rarely, the homework assignment is different from the in-class assignments and requires it’s own section in this document.

Consider your readers (graders):

1.1 Global code options

# I set some GLOBAL R chunk options here.
#   (to hide this message add "echo=FALSE" to the code chunk options)
# In particular, see the fig.height and fig.width (in inches)
#   and notes about the cache option.

knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, width = 100)
knitr::opts_chunk$set(fig.align = "center", fig.height = 4, fig.width = 6)

# Note: The "cache=TRUE" option will save the computations of code chunks
#   so R it doesn't recompute everything every time you recompile.
#   This can save _tons of time_ if you're working on a small section of code
#   at the bottom of the document.
#   Code chunks will be recompiled only if they are edited.
#   The autodep=TRUE will also update dependent code chunks.
#   A folder is created with the cache in it -- if you delete the folder, then
#   all the code chunks will recompute again.
#   ** If things are working as expected, or I want to freshly compute everything,
#      I delete the *_cache folder.
knitr::opts_chunk$set(cache = FALSE) #, autodep=TRUE)  #$

1.2 Document

1.2.1 Naming

Note: Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

  • ADA1_ALL_05.Rmd
  • ADA1_ALL_06.Rmd
  • ADA1_ALL_07.Rmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

  • ADA1_ALL_20190903.Rmd
  • ADA1_ALL_20190905.Rmd
  • ADA1_ALL_20190910.Rmd

1.2.2 Structure

Starting in Week03, we will concatenate all our Homework assignments together to retain the relevant information needed for subsequent classes. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, modifying your research questions, improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.


2 Research Questions

2.1 Week01: Personal Codebook

2.1.1 Class 02 Rmd, codebook

2.1.2 Codebook

Copy your CODEBOOK here

Additional variables were created from the original variables:

CREATED VARIABLES

2.2 Week02: Literature Review

2.2.1 Tuesday ———

2.2.2 Class 03 Research questions

Copy your research question assignment here.

2.2.3 Thursday ———

2.2.4 Class 04 Citations and Literature review

2.2.5 Citations

Copy your citations assignment here.

2.2.6 Week 02 Homework Literature review

Copy your literature review assignment here.


3 Data Management

3.1 Week03: Data Subset, Univariate Summaries And Plots

3.1.1 Background

3.1.1.1 Purpose of study

3.1.1.2 Variables

3.1.2 Tuesday ———

3.1.3 Data subset

Starting today, work directly in this document so that you always have all your previous work here.

First, the data is placed on the search path.

# data analysis packages
library(tidyverse)  # Data manipulation and visualization suite
library(forcats)    # Factor variables
library(lubridate)  # Dates

  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Use the load() statement for the dataset you want to use.
  ##
  ## load("AddHealth.RData")
  ## load("addhealth_public4.RData")
  ## load("NESARC.RData")

# read data example
#load("NESARC.RData")

#dim(NESARC)

3.1.4 Renaming Variables

3.1.5 Coding missing values

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

3.1.5.1 Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

3.1.5.1.1 NAs recoded as numeric
3.1.5.1.2 NAs recoded as categorical

3.1.5.2 Coding 9s and 99s as NAs

3.1.6 Creating new variables

3.1.6.1 From categories to numeric

3.1.6.2 From numeric to numeric

3.1.6.3 From numeric to categories based on quantiles

3.1.6.4 From many categories to a few

3.1.6.5 Review results of new variables

3.1.7 Labeling Categorical variable levels

3.1.8 Data subset rows


4 Graphing and Tabulating

4.1 Thursday ———

4.2 Categorical variables

4.3 Tables for categorical variables

4.3.0.1 Graphing frequency tables

4.4 Numeric variables

4.5 Graphing numeric variables

4.5.0.1 Creating Density Plots


4.6 Week04: Bivariate graphs

4.6.1 Tuesday ———

4.6.2 Scatter plot (for regression): x = numerical, y = numerical

4.6.3 Box plots (for ANOVA): x = categorical, y = numerical

4.6.4 Thursday ———

4.6.5 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical

4.6.6 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)

5 Statistical methods

5.1 Week05: Simple linear regression, logarithm transformation

5.1.1 Tuesday ———

5.1.2 1. Scatter plot, add a regression line.

5.1.3 2. Fit the linear regression, interpret slope and \(R^2\) (R-squared) values

5.1.4 3. Interpret the intercept. Does it make sense?

5.1.5 Thursday ———

5.1.6 4. Try plotting on log scale (\(x\)-only, \(y\)-only, both).

5.1.7 5. Does log transformation help?

5.2 Week06: Correlation and Categorical contingency tables

5.2.1 Tuesday ———

5.2.2 Correlation

5.2.3 Interpretation of correlation

5.2.4 Thursday ———

5.2.5 Contingency table

5.3 Week07: Inference and Parameter estimation (one-sample)

5.3.1 Tuesday ———

5.3.2 Dataset description of sampling

#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution with
#   a normal distribution with mean and SEM estimated from the data
bs.one.samp.dist <- function(dat, N = 1e4) {
  n <- length(dat);
  # resample from data
  sam <- matrix(sample(dat, size = N * n, replace = TRUE), ncol=N);
  # draw a histogram of the means
  sam.mean <- colMeans(sam);
  # save par() settings
  old.par <- par(no.readonly = TRUE)
  # make smaller margins
  par(mfrow=c(2,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
  # Histogram overlaid with kernel density curve
  hist(dat, freq = FALSE, breaks = 6
      , main = "Plot of data with smoothed density curve")
  points(density(dat), type = "l")
  rug(dat)

  hist(sam.mean, freq = FALSE, breaks = 25
      , main = "Bootstrap sampling distribution of the mean"
      , xlab = paste("Data: n =", n
                   , ", mean =", signif(mean(dat), digits = 5)
                   , ", se =", signif(sd(dat)/sqrt(n)), digits = 5))
  # overlay a density curve for the sample means
  points(density(sam.mean), type = "l")
  # overlay a normal distribution, bold and red
  x <- seq(min(sam.mean), max(sam.mean), length = 1000)
  points(x, dnorm(x, mean = mean(dat), sd = sd(dat)/sqrt(n))
       , type = "l", lwd = 2, col = "red")
  # place a rug of points under the plot
  rug(sam.mean)
  # restore par() settings
  par(old.par)
}

#### Visual comparison of whether sampling distribution is close to Normal via Bootstrap
# a function to compare the bootstrap sampling distribution
#   of the difference of means from two samples with
#   a normal distribution with mean and SEM estimated from the data
bs.two.samp.diff.dist <- function(dat1, dat2, N = 1e4) {
  n1 <- length(dat1);
  n2 <- length(dat2);
  # resample from data
  sam1 <- matrix(sample(dat1, size = N * n1, replace = TRUE), ncol=N);
  sam2 <- matrix(sample(dat2, size = N * n2, replace = TRUE), ncol=N);
  # calculate the means and take difference between populations
  sam1.mean <- colMeans(sam1);
  sam2.mean <- colMeans(sam2);
  diff.mean <- sam1.mean - sam2.mean;
  # save par() settings
  old.par <- par(no.readonly = TRUE)
  # make smaller margins
  par(mfrow=c(3,1), mar=c(3,2,2,1), oma=c(1,1,1,1))
  # Histogram overlaid with kernel density curve
  hist(dat1, freq = FALSE, breaks = 6
      , main = paste("Sample 1", "\n"
                    , "n =", n1
                    , ", mean =", signif(mean(dat1), digits = 5)
                    , ", sd =", signif(sd(dat1), digits = 5))
      , xlim = range(c(dat1, dat2)))
  points(density(dat1), type = "l")
  rug(dat1)

  hist(dat2, freq = FALSE, breaks = 6
      , main = paste("Sample 2", "\n"
                    , "n =", n2
                    , ", mean =", signif(mean(dat2), digits = 5)
                    , ", sd =", signif(sd(dat2), digits = 5))
      , xlim = range(c(dat1, dat2)))
  points(density(dat2), type = "l")
  rug(dat2)

  hist(diff.mean, freq = FALSE, breaks = 25
      , main = paste("Bootstrap sampling distribution of the difference in means", "\n"
                   , "mean =", signif(mean(diff.mean), digits = 5)
                   , ", se =", signif(sd(diff.mean), digits = 5)))
  # overlay a density curve for the sample means
  points(density(diff.mean), type = "l")
  # overlay a normal distribution, bold and red
  x <- seq(min(diff.mean), max(diff.mean), length = 1000)
  points(x, dnorm(x, mean = mean(diff.mean), sd = sd(diff.mean))
       , type = "l", lwd = 2, col = "red")
  # place a rug of points under the plot
  rug(diff.mean)
  # restore par() settings
  par(old.par)
}

5.3.3 Thursday ———

5.3.4 Numeric variable confidence interval for mean \(\mu\)

5.3.5 Categorical variable confidence interval for proportion \(p\)

5.4 Week08: Hypothesis testing (one- and two-sample)

5.4.1 Mechanics of a hypothesis test (review)

  1. Set up the null and alternative hypotheses in words and notation.

    • In words: ``The population mean for [what is being studied] is different from [value of \(\mu_0\)].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
    • In notation: \(H_0: \mu=\mu_0\) versus \(H_A: \mu \ne \mu_0\) (where \(\mu_0\) is specified by the context of the problem).
  2. Choose the significance level of the test, such as \(\alpha=0.05\).

  3. Compute the test statistic, such as \(t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}\), where \(SE_{\bar{Y}}=s/\sqrt{n}\) is the standard error.

  4. Determine the tail(s) of the sampling distribution where the \(p\)-value from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, \(t_{s}\), with the critical value \(t_{\textrm{crit}}=t_{\alpha/2}\) in the direction of the alternative hypothesis from the \(t\)-distribution table with degrees of freedom \(df = n-1\).)

  5. State the conclusion in terms of the problem.

    • Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
    • Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT accept \(H_0\).)
  6. Check assumptions of the test (for now we skip this).

5.4.2 What do we do about “significance”?

Adapted from Significance Magazine.

Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more).

In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph.

“The difference between the two groups turns out to be small (8%), while the probability (\(p\)) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is \(p = 0.003\) (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold (\(p < 0.05\)). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.”

5.4.3 Tuesday ———

5.4.4 Two-sample \(t\)-test

5.4.5 Thursday ———

Enjoy your Fall Break!

5.5 Week09: ANOVA and Assessing Assumptions

5.5.1 Tuesday ———

5.5.2 Thursday ———

5.5.3 ANOVA: Total cigarettes smoked by Ethnicity

5.5.3.1 Transform the response variable to satisfy assumptions

5.5.3.2 ANOVA Hypothesis test

5.5.3.3 Check assumptions

5.5.3.4 Post Hoc pairwise comparison tests

5.6 Week10: Nonparametric methods and Binomial and multinomial proportion tests

5.6.1 Tuesday ———

5.6.2 Thursday ———

5.6.3 Multinomial goodness-of-fit

5.6.3.1 Observed

5.6.3.2 Expected

5.6.4 Perform \(\chi^2\) Goodness-of-fit test

5.6.4.1 Chi-sq statistic helps us understand the deviations

5.6.4.2 Multiple Comparisons

5.7 Week11: Two-way categorical tables and Simple linear regression, inference

5.7.1 Tuesday ———

5.7.2 Two-way categorical analysis.

5.7.3 Thursday ———

5.7.4 Simple linear regression.

5.8 Week12: Logistic regression and Experiments vs Observational studies

5.8.1 Tuesday ———

5.8.2 Logistic Regression

5.8.3 Thursday ———

5.8.4 Experiments and observational studies


6 Poster presentation

6.1 Week13: Complete poster in HW document

6.1.1 Tuesday ———

6.1.2 Thursday ———

6.1.2.1 Title

6.1.2.2 1. (1 p) Introduction

6.1.2.3 2. (1 p) Research Questions

6.1.2.4 3. (1 p) Methods

6.1.2.5 4. (1 p) Discussion (while this would follow your results, let’s put it here so you have a full column to show the results of the analysis of both research questions)

6.1.2.6 5. (1 p) Further directions or Future work or Next steps or something else that indicates there more to do and you’ve thought about it.

6.1.2.7 6. (1 p) References

  • By citing sources in your introduction, this section will automatically have your bibliography.

  • [Bibliography will go here – it’s currently at the bottom of the document]

6.1.2.8 7. (2 p) Results for your first research question.

6.1.2.9 8. (2 p) Results for your second research question.

6.2 Week14: Posters, finishing up

6.2.1 Tuesday ———

6.2.2 Thursday ———

6.3 Week15: Posters, final touches

6.3.1 Tuesday ———

6.3.2 Thursday ———

Thanksgiving break. Remember to print your posters

6.4 Week16: Posters, presentations

6.4.1 Tuesday ———

6.4.2 Thursday ———

7 References (from Week02)