ADA1: Cumulative project file

Name your project

Advanced Data Analysis 1, Stat 427/527, Fall 2022, Prof. Erik Erhardt, UNM

Author

Your Name

Published

September 27, 2022


1 Document overview

Important

Please don’t let the initial size and detail of this document intimidate you. You got this!

This document is organized by Week and Class number. The worksheet assignments are indicated by the Class numbers.

Consider your readers (graders):

  • organize the document clearly (use this document as an example)
  • label minor sections under each day (use this document as an example)
  • For each thing you do, always have these three parts:
    1. Say what you’re going to do and why.
    2. Do it with code, and document your code.
    3. Interpret the results.

1.1 Document

1.1.1 Naming

Note

Each class save this file with a new name, updating the last two digits to the class number. Then, you’ll have a record of your progress, as well as which files you turned in for grading.

  • ADA1_ALL_05.qmd
  • ADA1_ALL_06.qmd
  • ADA1_ALL_07.qmd

A version that I prefer is to use a date using Year-Month-Day, YYYYMMDD:

  • ADA1_ALL_20220903.qmd
  • ADA1_ALL_20220905.qmd
  • ADA1_ALL_20220910.qmd

1.1.2 Structure

We will include all of our assignments together in this document to retain the relevant information needed for subsequent assignments since our analysis is cumulative. You will also have an opportunity to revisit previous parts to make changes or improvements, such as updating your codebook, recoding variables, and improving tables and plots. I’ve provided an initial predicted organization of our sections and subsections using the # and ## symbols. A table of contents is automatically generated using the “toc: true” in the yaml and can headings in the table of contents are clickable to jump down to each (sub)section.

1.1.3 Classes not appearing in this document

Some assignments are in a separate worksheet and are indicated with “(separate worksheet)”. For these, I’ll provide a dataset I want you to analyze. Typically, you’ll then return to this document and repeat the same type of analysis with your dataset.


2 Research Questions

2.1 Class 02, Personal Codebook

Rubric

  1. (1 p) Is there a topic of interest?

  2. (2 p) Are the variables relevant to a set of research questions?

  3. (4 p) Are there at least 2 categorical and 2 numerical variables (at least 4 “data” variables)?

    • 1 categorical variable with only 2 levels
    • 1 categorical variable with at least 3 levels
    • 2 numerical variables with many possible unique values
    • More variables are welcome and you’re likely to add to this later in the semester
  4. (3 p) For each variable, is there a variable description, a data type, and coded value descriptions?

  5. Compile this qmd file to an html, print/save to pdf, and upload to UNM Canvas.

2.1.1 Topic and research questions

Topic:

As you select variables from the bottom of this document, a general topic should reveal itself to you.

Research questions:

  1. Question 1

  2. Question 2

  3. Question 3

3 Codebook

National Epidemiologic Survey on Alcohol and Related Conditions-III (NESARC-III)

Dataset: NESARC
Primary association: nicotine dependence vs frequency and quantity of smoking

Key:
RenamedVarName
  VarName original in dataset
  Variable description
  Data type (Continuous, Discrete, Nominal, Ordinal)
  Frequency ItemValue Description

ID
  IDNUM
  UNIQUE ID NUMBER WITH NO ALPHABETICS
  Nominal
  43093 1-43093. Unique Identification number

Sex
  SEX
  SEX
  Nominal
  18518 1. Male
  24575 2. Female

Age
  AGE
  AGE
  Continuous
  43079 18-97. Age in years
     14 98. 98 years or older

Height_ft
  S1Q24FT
  HEIGHT: FEET
  42363 4-7. Feet
  730 99. Unknown
          * change 99 to NA

Height_in
  S1Q24IN
  HEIGHT: INCHES
  Continuous
  3572 0. None
  38760 1-11. Inches
  761 99. Unknown
          * change 99 to NA

Weight_lb
  S1Q24LB
  WEIGHT: POUNDS
  Continuous
  41717 62-500. Pounds
  1376 999. Unknown
          * change 999 to NA


ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE

Additional variables were created from the original variables:

CREATED VARIABLES

Height_inches
  Total height in inches
  Height_ft * 12 + Height_in


ADD MORE HERE
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE (If you think you'll combine or transform any variables)
ADD MORE HERE
ADD MORE HERE
ADD MORE HERE

4 Data Management

4.1 Class 03, Data subset and numerical summaries

Rubric

  1. (4 p) The data are loaded and a data.frame subset is created by selecting only the variables in the personal codebook.

    • Scroll down to sections labeled “(Class 03)”.
  2. (1 p) Output confirms the subset is correct (e.g., using dim() and str()).

  3. (3 p) Rename your variables to descriptive names (e.g., from “S3AQ3B1” to “SmokingFreq”).

    • Scroll down to sections labeled “(Class 03)”.
  4. (2 p) Provide numerical summaries for all variables (e.g., using summary()).

    • Scroll down to sections labeled “(Class 03)”.

4.1.1 Data subset (Class 03)

First, the data is placed on the search path.

# data analysis packages
library(erikmisc)   # Helpful functions
── Attaching packages ─────────────────────────────────────── erikmisc 0.1.16 ──
✔ tibble 3.1.8     ✔ dplyr  1.0.9
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
  by Dr. Erik Barry Erhardt <erik@StatAcumen.com>
library(tidyverse)  # Data manipulation and visualization suite
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lubridate)  # Dates

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
  ## 1. Download the ".RData" file for your dataset into your ADA Folder.
  ## 2. Use the load() statement for the dataset you want to use.
# read data example
#load("NESARC.RData")

#dim(NESARC)

4.1.2 Renaming Variables (Class 03)

4.1.3 Coding missing values (Class 04)

There are two steps. The first step is to recode any existing NAs to actual values, if necessary. The method for doing this differs for numeric and categorical variables. The second step is to recode any coded missing values, such as 9s or 99s, as actual NA.

4.1.3.1 Coding NAs as meaningful “missing”

First step: the existing blank values with NA mean “never”, and “never” has a meaning different from “missing”. For each variable we need to decide what “never” means and code it appropriately.

4.1.3.1.1 NAs recoded as numeric
4.1.3.1.2 NAs recoded as categorical

4.1.3.2 Coding 9s and 99s as NAs

4.1.4 Creating new variables (Class 04+)

4.1.4.1 From categories to numeric

4.1.4.2 From numeric to numeric

4.1.4.3 From numeric to categories based on quantiles

4.1.4.4 From many categories to a few

4.1.4.5 Working with Dates

4.1.4.6 Review results of new variables

4.1.5 Labeling Categorical variable levels (Class 04)

4.1.6 Data subset rows

4.2 Data is complete (Class 04)

4.2.1 Plot entire dataset, show missing values

4.2.2 Numerical summaries to assess correctness (Class 03)


5 Graphing and Tabulating

5.1 Class 04, Plotting univariate

Rubric

  1. (3 p) For one categorical variable, a barplot is plotted with axis labels and a title. Interpret the plot: describe the relationship between categories you observe.

  2. (3 p) For one numerical variable, a histogram or boxplot is plotted with axis labels and a title. Interpret the plot: describe the distribution (shape, center, spread, outliers).

  3. (2 p) Code missing variables and indicate with R output that this was done correctly (e.g., table(), str(), dim(), summary()).

    • Scroll up to sections labeled “(Class 04)”.
  4. (2 p) Label levels of factor variables.

    • Scroll up to sections labeled “(Class 04)”.

5.2 Categorical variables

5.2.1 Tables for categorical variables

5.2.2 Graphing frequency tables

5.3 Numeric variables

5.3.1 Graphing numeric variables

5.3.2 Creating Density Plots


5.4 Class 05-1, Plotting bivariate, numeric response

Rubric

  1. Each of the following (2 p for plot, 2 p for labelled axes and title, 1 p for interpretation):

    1. Scatter plot (for regression): \(x\) = numerical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

    2. Box plots (for ANOVA): \(x\) = categorical, \(y\) = numerical, include axis labels and a title. Interpret the plot: describe the relationship.

5.4.1 Scatter plot (for regression): x = numerical, y = numerical

5.4.2 Box plots (for ANOVA): x = categorical, y = numerical

5.5 Class 05-2, Plotting bivariate, categorical response

Rubric

  1. Each of the following (2 p for plot, 2 p for labelled axes and title, 1 p for interpretation):

    1. Mosaic plot or bivariate bar plots (for contingency tables): \(x\) = categorical, \(y\) = categorical, include axis labels and a title. Interpret the plot: describe the relationship.

    2. Logistic scatter plot (for logistic regression): \(x\) = numerical, \(y\) = categorical (binary), include axis labels and a title. Interpret the plot: describe the relationship.

5.5.1 Mosaic plot or bivariate bar plots (for contingency tables): x = categorical, y = categorical

5.5.2 Logistic scatter plot (for logistic regression): x = numerical, y = categorical (binary)

5.6 Class 06, Figure arrangement, captions, cross-referencing

Rubric

  1. Reorganize your Class 05 bivariate plots above using plot_grid() and quarto, creating captions and cross-referencing them from the text.

    1. (3 p) For your numeric response plots, use cowplot::plot_grid() to create a single figure with separate plot panels.

    2. (3 p) For your categorical response plots, use quarto chunk options fig-cap, fig-subcap, and layout-ncol to create a single figure with separate plot panels.

    3. (2 p) Use captions to describe (not interpret) both sets of plots so a reader understands what is being plotted.

    4. (2 p) Use cross-referencing from the text to refer to the plots when you interpret them.

Note

Go above and reformat your plots and update your interpretations with cross-referencing.


6 Statistical methods

6.1 Class 07-1, Simple linear regression (separate worksheet)

Note

Find this assignment on the course website.

6.2 Class 07-2, Simple linear regression

Rubric

  1. With your previous (or new) bivariate scatter plot, add a regression line.
    • (2 p) plot with regression line,
    • (1 p) label axes and title.
  2. Use lm() to fit the linear regression and interpret slope and \(R^2\) (R-squared) values.
    • (2 p) lm summary() table is presented,
    • (2 p) slope is interpreted with respect to a per-unit increase of the \(x\) variable in the context of the variables in the plot,
    • (2 p) \(R^2\) is interpretted in a sentence.
  3. (1 p) Interpret the intercept. Does it make sense in the context of your study?

6.2.1 1. Scatter plot, add a regression line.

6.2.2 2. Fit the linear regression, interpret slope and \(R^2\) (R-squared) values

6.2.3 3. Interpret the intercept. Does it make sense?

6.3 Class 08-1, Logarithm transformation (separate worksheet)

Note

Find this assignment on the course website.

6.4 Class 08-2, Logarithm transformation

Rubric

  1. Try plotting the data on a logarithmic scale
    • (6 p) Each of the logarithmic relationships is plotted, axes are labelled with scale.
    1. original scales
    2. \(\log(x)\)-only
    3. \(\log(y)\)-only
    4. both \(\log(x)\) and \(\log(y)\)
  2. What happened to your data when you transformed it?
    • (2 p) Describe what happened to the relationship after each log transformation (compare transformed scale to original scale; is the relationship more linear, more curved?).
    • (1 p) Choose the best scale for a linear relationship and explain why.
    • (1 p) Does your relationship benefit from a logarithmic transformation? Say why or why not.

6.4.1 1. Try plotting on log scale (original scale, \(\log(x)\)-only, \(\log(y)\)-only, both \(\log(x)\) and \(\log(y)\))

6.4.2 2. What happened to your data when you transformed it?

  • Describe what happened to the relationship after each log transformation (compare transformed scale to original scale).

  • Choose the best scale for a linear relationship and explain why.

  • Does your relationship benefit from a logarithmic transformation? Say why or why not.

6.5 Class 09, Correlation (separate worksheet)

Note

Find this assignment on the course website.

6.6 Class 10, Categorical contingency tables (separate worksheet)

Note

Find this assignment on the course website.

6.7 Class 11, Correlation and Categorical contingency tables

Rubric

  1. With your previous (or a new) bivariate scatter plot, calculate the correlation and interpret.
    • (1 p) plot is repeated here or the plot is referenced and easy to find from a plot above,
    • (1 p) correlation is calculated,
    • (2 p) correlation is interpretted (direction, strength of LINEAR relationship).
  2. With your previous (or a new) two- or three-variable categorical plot, calculate conditional proportions and interpret.
    • (1 p) frequency table of variables is given,
    • (2 p) conditional proportion tables are calculated of the outcome variable conditional on one or two other variables,
    • (1 p) a well-labelled plot of the proportion table is given,
    • (2 p) the conditional proportions are interpretted and compared between conditions.

6.7.1 Correlation

6.7.2 Interpretation of correlation

6.7.3 Contingency table

6.7.4 Interpretation of conditional proportions

6.8 Class 12-1, Parameter estimation (one-sample) (separate worksheet)

Note

Find this assignment on the course website.

6.9 Class 12-2, Inference and Parameter estimation (one-sample)

Rubric

  1. Using a numerical variable, calculate and interpret a confidence interval for the population mean.
    • (1 p) Identify and describe the variable,
    • (1 p) use t.test() to calculate the mean and confidence interval, and
    • (1 p) interpret the confidence interval.
    • (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labelled plot.
  2. Using a two-level categorical variable, calculate and interpret a confidence interval for the population proportion.
    • (1 p) Identify and describe the variable,
    • (1 p) use binom.test() to calculate the mean and confidence interval, and
    • (1 p) interpret the confidence interval.
    • (2 p) Using plotting code from the last two classes, plot the data, estimate, and confidence interval in a single well-labelled plot.

6.9.1 Numeric variable confidence interval for mean \(\mu\)

6.9.2 Categorical variable confidence interval for proportion \(p\)

6.10 Class 13, Hypothesis testing (one- and two-sample) (separate worksheet)

Note

Find this assignment on the course website.

6.11 Class 14, Paired data, assumption assessment (separate worksheet)

Note

Find this assignment on the course website.

6.12 Class 15, Hypothesis testing (one- and two-sample)

6.12.1 Mechanics of a hypothesis test (review)

  1. Set up the null and alternative hypotheses in words and notation.

    • In words: ``The population mean for [what is being studied] is different from [value of \(\mu_0\)].’’ (Note that the statement in words is in terms of the alternative hypothesis.)
    • In notation: \(H_0: \mu=\mu_0\) versus \(H_A: \mu \ne \mu_0\) (where \(\mu_0\) is specified by the context of the problem).
  2. Choose the significance level of the test, such as \(\alpha=0.05\).

  3. Compute the test statistic, such as \(t_{s} = \frac{\bar{Y}-\mu_0}{SE_{\bar{Y}}}\), where \(SE_{\bar{Y}}=s/\sqrt{n}\) is the standard error.

  4. Determine the tail(s) of the sampling distribution where the \(p\)-value from the test statistic will be calculated (for example, both tails, right tail, or left tail). (Historically, we would compare the observed test statistic, \(t_{s}\), with the critical value \(t_{\textrm{crit}}=t_{\alpha/2}\) in the direction of the alternative hypothesis from the \(t\)-distribution table with degrees of freedom \(df = n-1\).)

  5. State the conclusion in terms of the problem.

    • Reject \(H_0\) in favor of \(H_A\) if \(p\textrm{-value} < \alpha\).
    • Fail to reject \(H_0\) if \(p\textrm{-value} \ge \alpha\). (Note: We DO NOT accept \(H_0\).)
  6. Check assumptions of the test (for now we skip this).

6.12.2 What do we do about “significance”?

Adapted from Significance Magazine.

Recent calls have been made to abandon the term “statistical significance”. The American Statistical Association (ASA) issued its statement and recommendation on p-values (see the special issue of p-values for more).

In summary, the problem of “significance” is one of misuse, misunderstanding, and misinterpretation. The recommendation in this class is that it is no longer sufficient to say that a result is “statistically significant” or “non-significant” depending on whether a p-value is less than a threshold. Instead, we will be looking for wording as in the following paragraph.

“The difference between the two groups turns out to be small (8%), while the probability (\(p\)) of observing a result at least as extreme as this under the null hypothesis of no difference between the two groups is \(p = 0.003\) (that is, 0.3%). This p-value is statistically significant as it is below our pre-defined threshold (\(p < 0.05\)). However, the p-value tells us only that the 8% difference between the two groups is somewhat unlikely given our hypothesis and null model’s assumptions. More research is required, or other considerations may be needed, to conclude that the difference is of practical importance and reproducible.”

6.12.3 Two-sample \(t\)-test

Rubric

  1. Using a numerical response variable and a two-level categorical variable (or a categorical variable you can reduce to two levels), specify a two-sample \(t\)-test associated with your research questions.
    • (2 p) Specify the hypotheses in words and notation (either one- or two-sided test),
    • (0 p) use t.test() to calculate the mean, test statistic, and p-value,
    • (3 p) state the significance level, test statistic, and p-value, and
    • (2 p) state the conclusion in the context of the problem.
    • (1 p) Given your conclusion, could you have committed at Type-I or Type-II error?
    • (2 p) Provide an appropriate plot of the data and sample estimates in a well-labelled plot.

6.13 Class 16, ANOVA, Pairwise comparisons (separate worksheet)

Note

Find this assignment on the course website.

6.14 Class 17, ANOVA and Assessing Assumptions

Rubric

  1. Using a numerical response variable and a categorical variable with three to five levels (or a categorical variable you can reduce to three to five levels), specify an ANOVA hypothesis associated with your research questions.
    • (1 p) Specify the ANOVA hypotheses in words and notation,
    • (1 p) plot the data in a way that is consistent with hypothesis test (comparing means, assess equal variance assumption),
    • (1 p) use aov() to calculate the hypothesis test statistic and p-value,
    • (1 p) state the significance level, test statistic, and p-value,
    • (1 p) state the conclusion in the context of the problem,
    • (2 p) assess the normality assumption of the residuals using appropriate methods (QQ-plot and Anderson-Darling test), and
    • (1 p) assess the assumption of equal variance between your groups using an appropriate test (also mention standard deviations of each group).
    • (2 p) If you rejected the ANOVA null hypothesis, perform follow-up pairwise comparisons using Tukey’s HSD to indicate which groups have statistically different means and summarize the results.

6.14.1 Hypothesis and plot

6.14.2 Transform the response variable to satisfy assumptions

(If required.)

6.14.3 ANOVA Hypothesis test

6.14.4 Check assumptions

6.14.5 Post Hoc pairwise comparison tests

(If required.)

6.15 Class 18, Nonparametric methods (separate worksheet)

Note

Find this assignment on the course website.

6.16 Class 19, Binomial and Multinomial tests (separate worksheet)

Note

Find this assignment on the course website.

6.17 Class 20-1, Two-way categorical tables (separate worksheet)

Note

Find this assignment on the course website.

6.18 Class 20-2, Simple linear regression (separate worksheet)

Note

Find this assignment on the course website.

6.19 Class 21, Two-way categorical and simple linear regression

Rubric

  1. Two-way categorical analysis.
    • Using two categorical variables with two to five levels each, specify a hypothesis test for homogeneity of proportions associated with your research questions.
    • (1 p) Specify the hypotheses in words and notation.
    • (1 p) State the conclusion of the test in the context of the problem.
    • (1 p) Plot a mosaic plot of the data and Pearson residuals.
    • (1 p) Interpret the mosaic plot with reference to the Pearson residuals.
  2. Simple linear regression.
    • Select two numerical variables.
    • (1 p) Plot the data and, if required, transform the variables so a roughly linear relationship is observed. All interpretations will be done on this scale of the variables.
    • (0 p) Fit the simple linear regression model.
    • (1 p) Assess the residuals for lack of fit (interpret plots of residuals vs fitted and \(x\)-value).
    • (1 p) Assess the residuals for normality (interpret QQ-plot and histogram).
    • (1 p) Assess the relative influence of points.
    • (1 p) Test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).
    • (1 p) Interpret the \(R^2\) value.

6.19.1 Two-way categorical analysis

6.19.2 Simple linear regression

6.20 Class 22, Logistic regression (separate worksheet)

Note

Find this assignment on the course website.

6.21 Class 23, Logistic regression

Rubric

  1. Logistic regression.
    • Select a binary reponse and continue explanatory/predictor variable.
    • (1 p) Plot the data.
    • (1 p) Summarize the \(\hat{p}\) values for each value of the \(x\)-variable. Also, calculate the empirical logits.
    • (1 p) Plot the \(\hat{p}\) values vs the \(x\)-variable and plot the empirical logits vs the \(x\)-variable.
    • (1 p) Describe the logit-vs-\(x\) plot. Is it linear? If not, consider a transformation of \(x\) to improve linearity; describe the transformation you chose if you needed one.
    • (1 p) Fit the glm() model and assess the deviance lack-of-fit test.
    • (1 p) Calculate the confidence bands around the model fit/predictions. Plot on both the logit and \(\hat{p}\) scales.
    • (1 p) Interpret the sign (\(+\) or \(-\)) of the slope parameter and test whether the slope is different from zero, \(H_A: \beta_1 \ne 0\).

7 Poster

7.1 Classes 24, 25, and 26: Poster Preparation

7.1.1 Class 24, Poster Preparation, research questions, data sources, analyses

See items under Class 26.

From the list in Class 26, complete Items 2, 3, 4, and 5.

7.1.2 Class 25, Poster Preparation, literature review, references, discussion, future work

See items under Class 26.

From the list in Class 26, complete Items 1, 6, 7, and 8.

Citation help is available at this page: https://quarto.org/docs/authoring/footnotes-and-citations.html

7.1.3 Class 26, Poster Preparation, complete content

Rubric

Organize the content of your poster.

Complete the content for each of these sections:

Title: A short title that reveals the main result of the poster.

  1. (Class 25) (3 p) Introduction

    • (Lit Review) 2-4 bullets describing the study, previous research.
  2. (Class 24) (2 p) Research Questions

    • (Class 02, Personal Codebook) 2 bullets, one for each research question, stated as the alternative hypothesis.
  3. (Class 24) (2 p) Methods

    • (Class 02, Personal Codebook) Data source(s).

    • (Class 02, Personal Codebook) Variables used.

    • (Various) Statistical methods used to answer the research questions.

  4. (Class 24) (3 p) Results for your first research question.

    • Plot and describe the data, as well as the statistical model. This can often be done in a single plot. Examples:

      • ANOVA: A mean with CI bars is the statistical model overlayed on the data points.

      • Contingency table: A mosaic plot with colored boxes relative to contribution to Pearson \(\chi^2\) shows the data with evidence towards the alternative hypothesis.

      • Simple linear regression: A regression line is the statistical model overlayed on the data points.

      • Logistic regression: The logistic curve is the statistical model overlayed on the top/bottom histograms of the data.

    • State the conclusion of the hypothesis test and interpret it in the context of the research question.

  5. (Class 24) (3 p) Results for your second research question.

    • Plot and describe the data, as well as the statistical model. This can often be done in a single plot.

    • State the conclusion of the hypothesis test and interpret it in the context of the research question.

  6. (Class 25) (4 p) Discussion

    • Put the results you found for each research question in the context you provided in your introduction.
  7. (Class 25) (1 p) Further directions or Future work or Next steps or something else that indicates there more to do and you’ve thought about it.

    • What do these results lead you to want to investigate?
  8. (Class 25) (2 p) References

    • By citing sources in your introduction, this section will automatically have your bibliography.

References

References are supposed to appear here, but they may appear at the end of the document.

7.2 Class 27, Poster Preparation, into poster template

7.3 Class 28, Poster Preparation, reviewed by instructor

7.4 Class 29, Poster Presentations

  • Graduate students.

7.5 Class 30, Poster Presentations

  • Undergraduate students.

[End]

References