UNM Stat 428/528: Advanced Data Analysis II (ADA2)
Spring 2017 Syllabus is below table. Spring 2017 schedule; Time: TR 1530-1645; Location: CTLB 300; Stat 428, CRN 33933; Stat 528, CRN 33935 + Peer mentors via UNM Stat 495/595: Statistics Education Practicum (SEP) Stat 495.001 or Stat 595.001, CRN 30543 or 41683Goal

Learn to produce beautiful (markdown) and reproducible (knitr) reports with informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to answer questions using fundamental statistical methods (analysis of covariance, logistic regression, and multivariate methods), which you’ll be proud to present (poster).
News
3/1/17 – Data resources for poster: kaggle drivendata 538 agridat package wise data sources statsci datasets vanderbilt datasets 2/16/17 – How to get pairwise comparison plots from lsmeans(); there’s an object with a complicated name from lsmeans (but, below in the second line, start typing “lsm$” and press TAB for the suggested object), for example:lsm <- lsmeans( lm.ml.full, list( pairwise ~ species | sex ), adjust = "tukey" )
plot( lsm$`pairwise differences of contrast, sex | sex` )
2/3/17 – scatter3d() plot from library(rgl) for Mac users (Ch 02, class 03): Install XQuartz (X11), reboot, log out and log back in, then install.packages(“rgl”).
1/22/17 – RStudio, disabling notebook “inline” results, prepared by TA Geoff Schultz.
Classroom computers: Please reboot classroom laptops at the end of class by request of the IT staff. Saving data: If you’re using classroom computers, use flash drives or UNM’s OneDrive (available in LoboMail) for saving files. I recommend using a very systematic folder structure, such as a main folder called Stat428_ADA2, with subfolders called homework, in-class, reading, poster, etc.
Course content
Weekly structure (also see Assessment below)
- Pre-class (Tuesday): Reading, Video, Quiz (due before class — solutions become available Tue 3:30, after the quiz is due)
- In-class: Activities in class Tuesday submitted to UNM Learn (evaluated by TA within 1 week), Wed 5pm turn in completed assignment. Thursday we will start the homework in class to allow you to struggle but get questions answered before finishing on your own.
- Post-class (Thursday): Homework submitted to UNM Learn the following Thursday (evaluated by TA within 1 week). Assignments will be common for all students.
Course notes, code, data, and video lectures
Notes from Spring 2016: ADA2_notes_S16.pdf includes all chapters in one document.
Ch | Chapter Title | Notes | R code | Datasets | Video lectures playlist |
---|---|---|---|---|---|
01 | R statistical software and review | R | turkey.csv, rocket.dat | 01-1, 01-2 | |
02 | Introduction to Multiple Linear Regression | R | indian.dat, gce.dat | 02-1, 02-2 | |
03 | A Taste of Model Selection for Multiple Regression | R | ratliver.csv | 03-1, 03-2 | |
04 | One Factor Designs and Extensions | R | none | 04 | |
05 | Paired Experiments and Randomized Block Experiments | R | battery.dat, beetles.dat, itch.csv, ratinsulin.dat | 05-0 05-1 05-2 05-3 05-4 05-5 05-6 05-7 05-8 05-9 | |
06 | A Short Discussion of Observational Studies | R | sat.dat | 06 | |
07 | Analysis of Covariance: Comparing Regression Lines | R | tools.dat, toolsfake.dat, twins.dat | 07-1 07-2 07-3 HW helper video | |
08 | Polynomial Regression | R | cloudpoint.dat, mooney.dat | 08-1 08-2 | |
09 | Discussion of Response Models with Factors and Predictors | R | faculty.dat | 09-1 09-2 09-3 | |
10 | Automated Model Selection for Multiple Regression | R | oxygen.dat | 10-1 10-2 10-3 | |
11 | Logistic Regression | R | beetles.dat, leuk.dat, menarche.csv, shuttle.csv, trauma.dat | 11-1 11-2 11-3 11-4 | |
12 | An Introduction to Multivariate Methods | R | none | 12 | |
13 | Principal Component Analysis | R | bgs.dat, shells.dat, sparrows.dat, temperature.dat | 13-1 13-2 13-3 | |
14 | Cluster Analysis | R | birthdeath.dat, teeth.dat | 14-1 14-2 14-3 | |
15 | Multivariate Analysis of Variance | R | shells_mf.dat | 15 | |
16 | Discriminant Analysis | R | mower.dat | 16-1 16-2 | |
17 | Classification | R | business.dat | 17-1 17-2 17-3 | |
18 | Data Cleaning | R | conversions.txt, dalton.txt, dirty_iris.csv, edits.txt, people.txt, unnamed.txt |
(I reserve the right to continue to improve the materials throughout the semester.)
Timetable
Wk-Date | Cl | Topic | Reading, Video, Quiz | In-class Worksheet, Data | Homework | Due before class |
---|---|---|---|---|---|---|
00-01/16 | 00 | Install software | See Step 0 video: 00 | |||
01-01/17 | 01 | 01 R, Review | read: Ch 01 video: 01-1, 01-2 | Note: numbers refer to week numbers | ||
01-01/19 | 02 | In-class quiz | In-class: 02 R Review Rmd html dat Videos: 1, 2, 3 | No HW 01 | ||
02-01/24 | 03 | 02 Introduction to Multiple Linear Regression | read: Ch 02 video: 02-1, 02-2 quiz: 02 | In-class: Rmd html dat Submit pdf with solutions by Wed 5pm. | Quiz 02 | |
02-01/26 | 04 | HW: 02 Mult LR Rmd html dat Submit your pdf to UNM Learn. 2/02 Submit | ||||
03-01/31 | 05 | 03 A Taste of Model Selection for Multiple Linear Regression | read: Ch 03, 04 video: 03-1, 03-2, 04 quiz: 03 (2 parts) | In-class: Rmd html dat | Quiz 03 | |
03-02/02 | 06 | 04 Experimental Design: One and Two Factor Designs | HW: 03 Taste Model Sel Rmd html dat 2/09 Submit | Turn in HW 02 | ||
04-02/07 | 07 | 05 Paired Experiments and Randomized Block Designs | read: Ch 05 (start – 5.2) video: 05-0 05-1 05-2 05-3 05-4 05-5 quiz: 04 | In-class: Rmd html | Quiz 04 | |
04-02/09 | 08 | HW: 04 Experiments 1 Rmd html 2/16 Submit | Turn in HW 03 | |||
05-02/14 | 09 | read: Ch 05 (5.3 – end) video: 05-6 05-7 05-8 05-9 quiz: 05 | In-class: Rmd html dat | Quiz 05 | ||
05-02/16 | 10 | HW: 05 Experiments 2 Rmd html dat 2/23 Submit | Turn in HW 04 | |||
06-02/21 | 11 | 06 Discussion of Observational Studies | read: Ch 06-07 video: 06 07-1 07-2 07-3 quiz: 06 (2 parts) | In-class: html turn in paper version | Quiz 06 | |
06-02/23 | 12 | 07 Analysis of Covariance: Comparing Regression Lines | HW: 06 ANCOVA 1 Rmd html dat 3/02 Submit Discuss Wald test matrix specification. | Turn in HW 05 | ||
07-02/28 | 13 | 08 Polynomial Regression | read: Ch 08-1 08-2 09-1 09-2 09-3 video: quiz: 07 (2 parts) | In-class: Rmd html dat | Quiz 07 | |
07-03/02 | 14 | 09 Response Models with Factors and Predictors | HW: 07 ANCOVA 2 Rmd html dat Helper video 3/09 Submit | Turn in HW 06 | ||
08-03/07 | 15 | 10 Model Selection for Multiple Regression | read: Ch 10 video: 10-1 10-2 10-3 quiz: 08 | HW 07 Continued in class | Quiz 08 | |
08-03/09 | 16 | HW 07 Continued in class, due by 5pm | Turn in HW 07 | |||
09-03/14 | 17 | Spring Break | ||||
09-03/16 | 18 | Spring Break | ||||
10-03/21 | 19 | 11 Logistic Regression | read: Ch 11 video: 11-1 11-2 11-3 11-4 quiz: 10 | In-class: Rmd html dat | Poster: Poster Planning Rmd html Due 3/28 Choose/define poster project requiring a method from class: ANCOVA, Logistic multiple regression, PCA, etc. | Quiz 10 |
10-03/23 | 20 | HW: 10 Logistic Regression Rmd html dat 3/30 Submit | ||||
11-03/28 | 21 | 12 An Introduction to Multivariate Methods | read: Ch 12-13 video: 12 13-1 13-2 13-3 quiz: 11 (2 parts) | In-class: Rmd html dat | Quiz 11 | |
11-03/30 | 22 | 13 Principal Components Analysis (PCA) | HW: 11 PCA Rmd html dat 4/06 Submit | Turn in HW 10 | ||
12-04/04 | 23 | 14 Cluster Analysis | read: Ch 14-15 video: 14-1 14-2 14-3 15 quiz: 12 (2 parts) | In-class: Clustering Rmd html dat | Quiz 12 | |
12-04/06 | 24 | 15 Multivariate Analysis of Variance (MANOVA) | HW: 12 MANOVA Rmd html dat 4/13 Submit | Turn in HW 11 | ||
13-04/11 | 25 | 16 Discriminant Analysis 17 Classification | read: Ch 16-17 video: 16-1 16-2 17-1 17-2 17-3 quiz: 13 (2 parts) | In-class: Discriminant analysis for classification Rmd html dat | Quiz 13, Grade HW 11 | |
13-04/13 | 26 | 13+11+17 PCA and logistic regression classifcation | HW: 13+11+17 PCA and logistic Classification Rmd html dat 4/20 Submit | Turn in HW 12 | ||
14-04/18 | 27 | Posters begin | HW: Poster document 1 of 2: Analysis, Due Friday Rmd html | |||
14-04/20 | 28 | 4/21 Submit | Turn in HW 13, Turn in Poster Doc 1/2 Fri 4/21 | |||
15-04/25 | 29 | HW: Poster document 2 of 2: Intro/Discuss/Bib, Due Friday Rmd html | ||||
15-04/27 | 30 | 4/28 Submit | Turn in Poster Doc 2/2 Fri 4/28 | |||
16-05/02 | 31 | Survey Poster finalize | Poster template pdf, Rnw, sty, bib, logo Example poster pdf, Rnw Transition from Markdown to LaTeX Video for poster transition | Poster printing ARI Graphix $9+tax poster printing Open Mon-Fri 7:30-5:30 4716 McLeod Rd NE Do not use their website! Email: plotting@abqrepro.com, Subject: ADA2 class poster Text: indicate to print “in color on bond paper”. Attach: Poster pdf with your name in the filename, such as “FirstLast_ADA2_poster.pdf”. Try to send by Tuesday 5 PM for the poster to be ready by Thursday (earlier is better). Arrange to pick up the poster. Price is $0.75/sq ft for Spring 2017. | ||
16-05/04 | 32 | POSTERS | Poster session in SMLC lobby 3:30-5:30pm | Poster: Submit poster pdf to UNM Learn Fri 5/9 5pm Poster reviewing rubric | ||
17-05/09 | FINALS WEEK | (no final) | Surveys Due — submit receipt or confirmation page to UNM Learn * Learning Studio * EvalKit in Learn | Surveys Due 5/11 5pm |
Syllabus
Description: A continuation of 427/527 that focuses on methods for analyzing multivariate data and categorical data. Topics include MANOVA, principal components, discriminant analysis, classification, factor analysis, analysis of contingency tables including log-linear models for multidimensional tables and logistic regression. Prerequisite: Stat 427 (ADA1) Semesters offered: Spring Lecture: Stat 428/528.001 (CRN 33933 or 33935), TR 1530-1645, CTLB 300 Video email: “Erik B. Erhardt” <erike@stat.unm.edu>, please include “ADA2” in the subject line Textbook: Peter Dalgaard, “Introductory Statistics with R“, Second Edition, 2008, ISBN: 978-0-387-79053-4. The book is not required, but it will provide a backup for what you learn in class. Office hours: SMLC 312, TR 1400-1500 Laptops running R: I encourage you to bring a laptop to class each day so you can try the R programming exercises in class. If you don’t have one, no problem, there are some laptops in class and teamwork is encouraged — sit next to someone friendly who likes to share.Teaching Assistants and Peer Mentors
Stat grad students TAs
Lindsey Pittington <lpittin@unm.edu>, SMLC 301 office hours Mon 12-2 PM, Wed 3-5 PM Yiming Yang <yiming@unm.edu>, SMLC 319 office hours Tue 10 AM – 12 PM, Thu 10 AM – 12 PM Geoffrey Dylan Schultz <gdschultz@unm.edu>, SMLC 345 office hours Wed 1-3 PM, Fri 1-3 PM And Erik’s office hours are SMLC 312, Tue 2-3 PM and Thu 2-3 PM So many office hours! Mon 12-2, Tue 10-12, 2-3, Wed 1-5, Thu 10-12, 2-3, & Fri 1-3.Peer Mentors
Alicia Dominguez, ADA course alumnus Grace Mayer, ADA course alumnusStudent learning outcomes
Similar as in ADA1, but at a higher level.Assessment
- Quizzes will be due each Tuesday before class. Purpose: to assess reading and video comprehension and assure you’re prepared to actively participate in class activities with minimal lecture. (About 12, 20% of final grade, the lowest few are dropped.) Most weeks plan for 1-3 hours reading and video, 30-60 minute quiz.
- In-class assignments are due the following day by 5pm, submitted to UNM Learn. Purpose: to struggle and find success in class with the concepts and skills. (About 12, includes class participation, 20% of final grade, the lowest few are dropped.) Plan to start and finish in class, sometimes 1-2 hours beyond class.
- Homework (HW) assignments are assigned each Thursday and due the following Thursday, submitted to UNM Learn. Purpose: to apply concepts and skills to your class poster project. (About 12, 40% of final grade, the lowest few are dropped.) Most weeks plan on 2-12 hours per assignment.
- Poster will be developed and completed in the last weeks of the semester, and the last week we’ll have poster presentations. Purpose: to have an overarching set of questions to answer using methods learned in the course, with a deliverable you can be proud of! (16% total: 1 poster and presentation, 2% preparation, 10% poster, 2% presentation, and 2% evaluations of others of final grade.) In the last couple weeks, assembling this poster may take 3-5 hours, using a template provided to you.
- Course surveys are to collect information to help facilitate the class or to encourage participation in course evaluations. Purpose: to participate in national project-based learning projects and improve the course. (About 2, 4% of final grade [and a simple way to go from B+ to A].)
Collaboration and citation
For homeworks I encourage you to work together. Please discuss the data, code, and problems with one another, but do your own exploration and write up. We expect everyone to hand in substantially different homeworks, and we will enforce this under the honor code. The small benefit you might get from plagiarism is not worth the severe penalty (of lost trust, being reported to the dean, no points for the assignment, etc.). As in life, please use any resources available to you. Projects and some homeworks will explicitly encourage you to use resources on the internet, but showing extra initiative will always be appreciated. You may find R programming tough at first, so feel free discuss your problems with other classmates or meet with or email questions to the TAs or me. I encourage you to use the ideas of others, but make them your own, giving credit. For projects have a formal bibliography, for homework cite casually, and for code simply copy the URL in as a comment (which is doubly helpful for finding the resource again).Statements
Disability statement
If you have a documented disability that will impact your work in this class, please contact me to discuss your needs. You’ll also need to register with the Accessibility Resource Center in 2021 Mesa Vista Hall (building 56) across the courtyard east from the SUB.Title IX statement
In an effort to meet obligations under Title IX, UNM faculty, Teaching Assistants, and Graduate Assistants are considered “responsible employees” by the Department of Education (see pg 15). This designation requires that any report of gender discrimination which includes sexual harassment, sexual misconduct and sexual violence made to a faculty member, TA, or GA must be reported to the Title IX Coordinator at the Office of Equal Opportunity. For more information on the campus policy regarding sexual misconduct.Our Classroom
We’re doing this because:- We want you to be empowered with statistics.
- We believe everyone should get out of this course with awesome skills
- Real-time feedback promotes efficient learning
GAISE Connections
Our six recommendations include the following:- Emphasize statistical literacy and develop statistical thinking
- Use real data
- Stress conceptual understanding, rather than mere knowledge of procedures
- Foster active learning in the classroom
- Use technology for developing conceptual understanding and analyzing data
- Use assessments to improve and evaluate student learning
Learning without thought is labor lost. What I hear, I forget. What I see, I remember. What I do, I understand. – Confucius
Archive
Did you receive a registration error? Send me an email with the following answers: 1. What registration error did you get (copy/paste is best)? 2. What is your UNM ID? 3. What is your Math/Stat background (that is, do you have the pre-reqs)? If you are waitlisted and qualified and we have enough seats, I will override you into the course. Don’t worry. Step 0: Before our first class (Tue 1/17) please read through the following and install the required software on your computer. If you don’t have a computer, there are classroom computers which will be of limited availability when the room is open.- Install or upgrade R (windows or mac) then Rstudio. Videos that may be helpful:
- Install R on Mac (2 min).
- Install R for Windows (3 min).
- Install R and RStudio on Windows (5 min).
- Install R packages, also update all packages within RStudio.
- Install Mendeley.
- Install LaTeX (for poster at end of semester).
Passion Driven Statistics (PDS) data
Install PDS package. AddHealthW1 Sampling Design, Codebook, RData. AddHealthW4 Sampling Design, Codebook, RData. NESARC Sampling Design, Codebook, RData. OutlookOnLife Sampling Design, Codebook, RData. GapMinder Sampling Design, Codebook, RData.Random stuff
innovationAcademy video UNM has license for free online access to the definitive books for the Lattice and ggplot2 graphing platforms. Note you must be on campus or logged in through the UNM proxy to access these. R is currently available in these UNM Locations: DSH 141 and 143, Econ 1004, SMLC pods, and SUB IT-LoboLab Pod and IT-LoboLab Classroom. R style matters. There is a lot of online help on R, such as at UCLA, try-r, and Google’s Intro to R video series. Try searching for “R [mytopic]” and you’ll get lots of results. ggplot2 plotting cookbook. R reference card by Jonathan Baron. Translate between MATLAB and R. Figure checklist. Choosing the right chart. Nature Methods points of view on visualization. Statistical consulting and collaboration slides Raster vs vector graphics. Statistics pre-req refresher from Khan Academy. Coursera has a free 4-week course on computing for data analysis with R. Muddy points in perspective. R+LaTeX+knitr for reproducible research. See my SC1 lecture notes (Ch01), and Mohammad Arbabshirani’s notes (pdf, rnw). Asking smart questions “Smart Questions” guide (note “hackers build things, crackers break them”) Email Question Rubric: * Send one email per question. — Use “Reply” to continue the conversation on a question; send a new email for a new question. * Include “ADA2” as the first word of the subject line in new emails (if replying, just use reply). * Begin email with a short question summary. * When possible, include commented code in email body — Comments should indicate where the problem is, what the expected behavior is, and what steps are necessary to reproduce the problem. — Code should include a “Minimum representative test cast” (http://www.catb.org/esr/faqs/Why stats now?
140,000 analysts needed. Important enough to have a US Chief Data Scientist (1) (2).Citing and using notes, including previous editions
Citing lecture notes, example: Erhardt EB, Bedrick EJ, and Schrader RM. (2016) Lecture notes for Advanced Data Analysis 2. Retrieved Mar 1, 2016, from statacumen.com/teach/ADA2/ADA2_notes.pdf, 136–144. Notes from Spring 2016 using R: ADA2_notes_S16.pdf includes all chapters in one document.




R tutorials: TryR (gentle), Kelly Black Cookbook for R for helpful examples, visualization tutorials, diagrams.
Table of selected statistical methods
The data and design determine which method you use: original or UCLA. Here’s a table of methods with the applicable semester of ADA and Chapter.Number of Dependent Variables | Number of Independent Variables | Type of Dependent Variable(s) | Type of Independent Variable(s) | Measure | Test(s) | ADA-Ch |
1 | 0 (1 population) | continuous normal | not applicable (none) | mean | one-sample t-test | 1-02 |
continuous non-normal | median | one-sample median | 1-06 | |||
categorical | proportions | Chi Square goodness-of-fit, binomial test | 1-07 | |||
1 (2 independent populations) | normal | 2 categories | mean | 2 independent sample t-test | 1-03 | |
non-normal | medians | Mann Whitney, Wilcoxon rank sum test | 1-06 | |||
categorical | proportions | Chi square test Fisher’s Exact test | 1-07 | |||
0 (1 population measured twice) or 1 (2 matched populations) | normal | not applicable/ categorical | means | paired t-test | 1-02 | |
non-normal | medians | Wilcoxon signed ranks test | 1-06 | |||
categorical | proportions | McNemar, Chi-square test | 1-07 | |||
1 (3 or more populations) | normal | categorical | means | one-way ANOVA | 1-05 | |
non-normal | medians | Kruskal Wallis | 1-06 | |||
categorical | proportions | Chi square test | 1-07 | |||
2 or more (e.g., 2-way ANOVA) | normal | categorical | means | Factorial ANOVA | 2-05 | |
non-normal | medians | Friedman test | not | |||
categorical | proportions | log-linear, logistic regression | 2-11 | |||
0 (1 population measured 3 or more times) | normal | not applicable | means | Repeated measures ANOVA | not | |
1 | normal | continuous | correlation, simple linear regression | 1-08 | ||
non-normal | non-parametric correlation | 1-08 | ||||
categorical | categorical or continuous | logistic regression | 2-11 | |||
continuous | discriminant analysis | 2-16 | ||||
2 or more | normal | continuous | multiple linear regression | 2-02 | ||
non-normal | ||||||
categorical | logistic regression | 2-11 | ||||
normal | mixed categorical and continuous | Analysis of Covariance, General Linear Models (regression) | 2-09 | |||
non-normal | ||||||
categorical | logistic regression | 2-11 | ||||
2 | 2 or more | normal | categorical | MANOVA | 2-15 | |
2 or more | 2 or more | normal | continuous | multivariate multiple linear regression | not | |
2 sets of 2 or more | 0 | normal | not applicable | canonical correlation | not | |
2 or more | 0 | normal | not applicable | factor analysis | not | |
0 or more | mixed categorical and continuous | principal component analysis (w/multiple regression) | 2-13 | |||
categorical | cluster analysis | 2-13 | ||||
discriminant analysis | 2-16 | |||||
classification | 2-17 |