UNM Stat 428/528: Advanced Data Analysis II (ADA2)
Spring 2017 Syllabus is below table.
Spring 2017 schedule; Time: TR 15301645; Location: CTLB 300; Stat 428, CRN 33933; Stat 528, CRN 33935
+ Peer mentors via UNM Stat 495/595: Statistics Education Practicum (SEP) Stat 495.001 or Stat 595.001, CRN 30543 or 41683
Goal
Learn to produce beautiful (markdown) and reproducible (knitr) reports with informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to answer questions using fundamental statistical methods (analysis of covariance, logistic regression, and multivariate methods), which you’ll be proud to present (poster).
News
3/1/17 – Data resources for poster:
kaggle
drivendata
538
agridat package
wise data sources
statsci datasets
vanderbilt datasets
2/16/17 – How to get pairwise comparison plots from lsmeans(); there’s an object with a complicated name from lsmeans (but, below in the second line, start typing “lsm$” and press TAB for the suggested object), for example:
lsm < lsmeans( lm.ml.full, list( pairwise ~ species  sex ), adjust = "tukey" )
plot( lsm$`pairwise differences of contrast, sex  sex` )
2/3/17 – scatter3d() plot from library(rgl) for Mac users (Ch 02, class 03): Install XQuartz (X11), reboot, log out and log back in, then install.packages(“rgl”).
1/22/17 – RStudio, disabling notebook “inline” results, prepared by TA Geoff Schultz.
Classroom computers: Please reboot classroom laptops at the end of class by request of the IT staff.
Saving data: If you’re using classroom computers, use flash drives or UNM’s OneDrive (available in LoboMail) for saving files. I recommend using a very systematic folder structure, such as a main folder called Stat428_ADA2, with subfolders called homework, inclass, reading, poster, etc.
Course content
Weekly structure (also see Assessment below)
 Preclass (Tuesday): Reading, Video, Quiz (due before class — solutions become available Tue 3:30, after the quiz is due)
 Inclass: Activities in class Tuesday submitted to UNM Learn (evaluated by TA within 1 week), Wed 5pm turn in completed assignment. Thursday we will start the homework in class to allow you to struggle but get questions answered before finishing on your own.
 Postclass (Thursday): Homework submitted to UNM Learn the following Thursday (evaluated by TA within 1 week). Assignments will be common for all students.
UNM Learn for content, YouTube Video playlist (try 1.5 speed, then pause/rewatch as you need).
Video: Upgrading R on Windows.
Course notes, code, data, and video lectures
Notes from Spring 2016: ADA2_notes_S16.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at https://statacumen.com/teach/ADA2/ADA2_notes_S16.pdf.
Ch  Chapter Title  Notes  R code  Datasets  Video lectures playlist 

01  R statistical software and review  R  turkey.csv, rocket.dat  011, 012  
02  Introduction to Multiple Linear Regression  R  indian.dat, gce.dat  021, 022  
03  A Taste of Model Selection for Multiple Regression  R  ratliver.csv  031, 032  
04  One Factor Designs and Extensions  R  none  04  
05  Paired Experiments and Randomized Block Experiments  R  battery.dat, beetles.dat, itch.csv, ratinsulin.dat  050 051 052 053 054 055 056 057 058 059  
06  A Short Discussion of Observational Studies  R  sat.dat  06  
07  Analysis of Covariance: Comparing Regression Lines  R  tools.dat, toolsfake.dat, twins.dat  071 072 073 HW helper video 

08  Polynomial Regression  R  cloudpoint.dat, mooney.dat  081 082  
09  Discussion of Response Models with Factors and Predictors  R  faculty.dat  091 092 093  
10  Automated Model Selection for Multiple Regression  R  oxygen.dat  101 102 103  
11  Logistic Regression  R  beetles.dat, leuk.dat, menarche.csv, shuttle.csv, trauma.dat  111 112 113 114  
12  An Introduction to Multivariate Methods  R  none  12  
13  Principal Component Analysis  R  bgs.dat, shells.dat, sparrows.dat, temperature.dat  131 132 133  
14  Cluster Analysis  R  birthdeath.dat, teeth.dat  141 142 143  
15  Multivariate Analysis of Variance  R  shells_mf.dat  15  
16  Discriminant Analysis  R  mower.dat  161 162  
17  Classification  R  business.dat  171 172 173  
18  Data Cleaning  R  conversions.txt, dalton.txt, dirty_iris.csv, edits.txt, people.txt, unnamed.txt 
lm_diag_plots.R function for a large set of standard diagnostic plots
(I reserve the right to continue to improve the materials throughout the semester.)
Timetable
WkDate  Cl  Topic  Reading, Video, Quiz  Inclass Worksheet, Data  Homework  Due before class 

0001/16  00  Install software  See Step 0 video: 00 

0101/17  01  01 R, Review  read: Ch 01 video: 011, 012 
Note: numbers refer to week numbers  
0101/19  02  Inclass quiz  Inclass: 02 R Review Rmd html dat Videos: 1, 2, 3 
No HW 01  
0201/24  03  02 Introduction to Multiple Linear Regression  read: Ch 02 video: 021, 022 quiz: 02 
Inclass: Rmd html dat Submit pdf with solutions by Wed 5pm. 
Quiz 02  
0201/26  04  HW: 02 Mult LR Rmd html dat Submit your pdf to UNM Learn. 2/02 Submit 

0301/31  05  03 A Taste of Model Selection for Multiple Linear Regression  read: Ch 03, 04 video: 031, 032, 04 quiz: 03 (2 parts) 
Inclass: Rmd html dat  Quiz 03  
0302/02  06  04 Experimental Design: One and Two Factor Designs  HW: 03 Taste Model Sel Rmd html dat 2/09 Submit 
Turn in HW 02  
0402/07  07  05 Paired Experiments and Randomized Block Designs  read: Ch 05 (start – 5.2) video: 050 051 052 053 054 055 quiz: 04 
Inclass: Rmd html  Quiz 04  
0402/09  08  HW: 04 Experiments 1 Rmd html 2/16 Submit 
Turn in HW 03  
0502/14  09  read: Ch 05 (5.3 – end) video: 056 057 058 059 quiz: 05 
Inclass: Rmd html dat  Quiz 05  
0502/16  10  HW: 05 Experiments 2 Rmd html dat 2/23 Submit 
Turn in HW 04  
0602/21  11  06 Discussion of Observational Studies  read: Ch 0607 video: 06 071 072 073 quiz: 06 (2 parts) 
Inclass: html turn in paper version 
Quiz 06  
0602/23  12  07 Analysis of Covariance: Comparing Regression Lines  HW: 06 ANCOVA 1 Rmd html dat 3/02 Submit Discuss Wald test matrix specification. 
Turn in HW 05  
0702/28  13  08 Polynomial Regression  read: Ch 081 082 091 092 093 video: quiz: 07 (2 parts) 
Inclass: Rmd html dat  Quiz 07  
0703/02  14  09 Response Models with Factors and Predictors  HW: 07 ANCOVA 2 Rmd html dat Helper video 3/09 Submit 
Turn in HW 06  
0803/07  15  10 Model Selection for Multiple Regression  read: Ch 10 video: 101 102 103 quiz: 08 
HW 07 Continued in class  Quiz 08  
0803/09  16  HW 07 Continued in class, due by 5pm  Turn in HW 07  
0903/14  17  Spring Break  
0903/16  18  Spring Break  
1003/21  19  11 Logistic Regression  read: Ch 11 video: 111 112 113 114 quiz: 10 
Inclass: Rmd html dat  Poster: Poster Planning Rmd html Due 3/28 Choose/define poster project requiring a method from class: ANCOVA, Logistic multiple regression, PCA, etc. 
Quiz 10 
1003/23  20  HW: 10 Logistic Regression Rmd html dat 3/30 Submit 

1103/28  21  12 An Introduction to Multivariate Methods  read: Ch 1213 video: 12 131 132 133 quiz: 11 (2 parts) 
Inclass: Rmd html dat  Quiz 11  
1103/30  22  13 Principal Components Analysis (PCA)  HW: 11 PCA Rmd html dat 4/06 Submit 
Turn in HW 10  
1204/04  23  14 Cluster Analysis  read: Ch 1415 video: 141 142 143 15 quiz: 12 (2 parts) 
Inclass: Clustering Rmd html dat 
Quiz 12  
1204/06  24  15 Multivariate Analysis of Variance (MANOVA)  HW: 12 MANOVA Rmd html dat 4/13 Submit 
Turn in HW 11  
1304/11  25  16 Discriminant Analysis 17 Classification 
read: Ch 1617 video: 161 162 171 172 173 quiz: 13 (2 parts) 
Inclass: Discriminant analysis for classification Rmd html dat 
Quiz 13, Grade HW 11  
1304/13  26  13+11+17 PCA and logistic regression classifcation  HW: 13+11+17 PCA and logistic Classification Rmd html dat 4/20 Submit 
Turn in HW 12  
1404/18  27  Posters begin  HW: Poster document 1 of 2: Analysis, Due Friday Rmd html 

1404/20  28  4/21 Submit  Turn in HW 13, Turn in Poster Doc 1/2 Fri 4/21 

1504/25  29  HW: Poster document 2 of 2: Intro/Discuss/Bib, Due Friday Rmd html 

1504/27  30  4/28 Submit  Turn in Poster Doc 2/2 Fri 4/28  
1605/02  31  Survey Poster finalize 
Poster template pdf, Rnw, sty, bib, logo Example poster pdf, Rnw Transition from Markdown to LaTeX Video for poster transition 
Poster printing ARI Graphix $9+tax poster printing Open MonFri 7:305:30 4716 McLeod Rd NE Do not use their website! Email: plotting@abqrepro.com, Subject: ADA2 class poster Text: indicate to print “in color on bond paper”. Attach: Poster pdf with your name in the filename, such as “FirstLast_ADA2_poster.pdf”. Try to send by Tuesday 5 PM for the poster to be ready by Thursday (earlier is better). Arrange to pick up the poster. Price is $0.75/sq ft for Spring 2017. 

1605/04  32  POSTERS  Poster session in SMLC lobby 3:305:30pm 
Poster: Submit poster pdf to UNM Learn Fri 5/9 5pm Poster reviewing rubric 

1705/09  FINALS WEEK  (no final)  Surveys Due — submit receipt or confirmation page to UNM Learn * Learning Studio * EvalKit in Learn 
Surveys Due 5/11 5pm 
Syllabus
Description: A continuation of 427/527 that focuses on methods for analyzing multivariate data and categorical data. Topics include MANOVA, principal components, discriminant analysis, classification, factor analysis, analysis of contingency tables including loglinear models for multidimensional tables and logistic regression.
Prerequisite: Stat 427 (ADA1)
Semesters offered: Spring
Lecture: Stat 428/528.001 (CRN 33933 or 33935), TR 15301645, CTLB 300 Video
email: “Erik B. Erhardt” <erike@stat.unm.edu>, please include “ADA2” in the subject line
Textbook: Peter Dalgaard, “Introductory Statistics with R“, Second Edition, 2008, ISBN: 9780387790534. The book is not required, but it will provide a backup for what you learn in class.
Office hours: SMLC 312, TR 14001500
Laptops running R: I encourage you to bring a laptop to class each day so you can try the R programming exercises in class. If you don’t have one, no problem, there are some laptops in class and teamwork is encouraged — sit next to someone friendly who likes to share.
Teaching Assistants and Peer Mentors
Stat grad students TAs
Lindsey Pittington <lpittin@unm.edu>, SMLC 301 office hours Mon 122 PM, Wed 35 PM
Yiming Yang <yiming@unm.edu>, SMLC 319 office hours Tue 10 AM – 12 PM, Thu 10 AM – 12 PM
Geoffrey Dylan Schultz <gdschultz@unm.edu>, SMLC 345 office hours Wed 13 PM, Fri 13 PM
And Erik’s office hours are SMLC 312, Tue 23 PM and Thu 23 PM
So many office hours! Mon 122, Tue 1012, 23, Wed 15, Thu 1012, 23, & Fri 13.
Peer Mentors
Alicia Dominguez, ADA course alumnus
Grace Mayer, ADA course alumnus
Student learning outcomes
Similar as in ADA1, but at a higher level.
Assessment
 Quizzes will be due each Tuesday before class. Purpose: to assess reading and video comprehension and assure you’re prepared to actively participate in class activities with minimal lecture. (About 12, 20% of final grade, the lowest few are dropped.) Most weeks plan for 13 hours reading and video, 3060 minute quiz.
 Inclass assignments are due the following day by 5pm, submitted to UNM Learn. Purpose: to struggle and find success in class with the concepts and skills. (About 12, includes class participation, 20% of final grade, the lowest few are dropped.) Plan to start and finish in class, sometimes 12 hours beyond class.
 Homework (HW) assignments are assigned each Thursday and due the following Thursday, submitted to UNM Learn. Purpose: to apply concepts and skills to your class poster project. (About 12, 40% of final grade, the lowest few are dropped.) Most weeks plan on 212 hours per assignment.
 Poster will be developed and completed in the last weeks of the semester, and the last week we’ll have poster presentations. Purpose: to have an overarching set of questions to answer using methods learned in the course, with a deliverable you can be proud of! (16% total: 1 poster and presentation, 2% preparation, 10% poster, 2% presentation, and 2% evaluations of others of final grade.) In the last couple weeks, assembling this poster may take 35 hours, using a template provided to you.
 Course surveys are to collect information to help facilitate the class or to encourage participation in course evaluations. Purpose: to participate in national projectbased learning projects and improve the course. (About 2, 4% of final grade [and a simple way to go from B+ to A].)
Final grade may include a small buffer at the discretion of the instructor. For example, final grade could be the total points earned divided by the total possible points times 0.95 for graduate students and 0.90 for undergraduate students. That is [Final Grade] = [Points Earned]/[Points possible * 0.95], so that your grade is slightly higher than you earned.
Student Attendance: If a student has more than 3 absences, I reserve the right to assign to that student a WF and drop midsemester or assign an F at the end of the semester without warning. Students in this situation need to speak with Erik immediately.
Late assignments will not be accepted.
Rubrics guide assessment (and selfassessment) of homework, code, projects, exams, and presentations. Each assignment will have its own specific rubric. Homework formatting example.
All R code for the assignment should be included with the part of the problem it addresses (for code and output use a fixedwidth font, such as Courier).
Do NOT use your R code and output as your answer to the problem, but include them to show me how you arrived at your answer. Your prose solution (in a nonfixedwidth font) should be provided in addition to R output.
Collaboration and citation
For homeworks I encourage you to work together. Please discuss the data, code, and problems with one another, but do your own exploration and write up. We expect everyone to hand in substantially different homeworks, and we will enforce this under the honor code. The small benefit you might get from plagiarism is not worth the severe penalty (of lost trust, being reported to the dean, no points for the assignment, etc.).
As in life, please use any resources available to you. Projects and some homeworks will explicitly encourage you to use resources on the internet, but showing extra initiative will always be appreciated. You may find R programming tough at first, so feel free discuss your problems with other classmates or meet with or email questions to the TAs or me.
I encourage you to use the ideas of others, but make them your own, giving credit. For projects have a formal bibliography, for homework cite casually, and for code simply copy the URL in as a comment (which is doubly helpful for finding the resource again).
Statements
Disability statement
If you have a documented disability that will impact your work in this class, please contact me to discuss your needs. You’ll also need to register with the Accessibility Resource Center in 2021 Mesa Vista Hall (building 56) across the courtyard east from the SUB.
Title IX statement
In an effort to meet obligations under Title IX, UNM faculty, Teaching Assistants, and Graduate Assistants are considered “responsible employees” by the Department of Education (see pg 15). This designation requires that any report of gender discrimination which includes sexual harassment, sexual misconduct and sexual violence made to a faculty member, TA, or GA must be reported to the Title IX Coordinator at the Office of Equal Opportunity. For more information on the campus policy regarding sexual misconduct.
Our Classroom
We’re doing this because:
 We want you to be empowered with statistics.
 We believe everyone should get out of this course with awesome skills
 Realtime feedback promotes efficient learning
“It encourages me to engage actively with the course material and take responsibility for my learning.”
GAISE Connections
Our six recommendations include the following:
 Emphasize statistical literacy and develop statistical thinking
 Use real data
 Stress conceptual understanding, rather than mere knowledge of procedures
 Foster active learning in the classroom
 Use technology for developing conceptual understanding and analyzing data
 Use assessments to improve and evaluate student learning
Learning without thought is labor lost.
What I hear, I forget.
What I see, I remember.
What I do, I understand.
– Confucius
Archive
Did you receive a registration error? Send me an email with the following answers:
1. What registration error did you get (copy/paste is best)?
2. What is your UNM ID?
3. What is your Math/Stat background (that is, do you have the prereqs)?
If you are waitlisted and qualified and we have enough seats, I will override you into the course. Don’t worry.
Step 0: Before our first class (Tue 1/17) please read through the following and install the required software on your computer. If you don’t have a computer, there are classroom computers which will be of limited availability when the room is open.
 Install or upgrade R (windows or mac) then Rstudio. Videos that may be helpful:
 Install R on Mac (2 min).
 Install R for Windows (3 min).
 Install R and RStudio on Windows (5 min).
 Install R packages, also update all packages within RStudio.
 Install Mendeley.
 Install LaTeX (for poster at end of semester).
Passion Driven Statistics (PDS) data
Install PDS package.
AddHealthW1 Sampling Design, Codebook, RData.
AddHealthW4 Sampling Design, Codebook, RData.
NESARC Sampling Design, Codebook, RData.
OutlookOnLife Sampling Design, Codebook, RData.
GapMinder Sampling Design, Codebook, RData.
Random stuff
innovationAcademy video
UNM has license for free online access to the definitive books for the Lattice and ggplot2 graphing platforms. Note you must be on campus or logged in through the UNM proxy to access these.
R is currently available in these UNM Locations: DSH 141 and 143, Econ 1004, SMLC pods, and SUB ITLoboLab Pod and ITLoboLab Classroom.
R style matters. There is a lot of online help on R, such as at UCLA, tryr, and Google’s Intro to R video series. Try searching for “R [mytopic]” and you’ll get lots of results.
R reference card by Jonathan Baron.
Translate between MATLAB and R.
Figure checklist. Choosing the right chart. Nature Methods points of view on visualization.
Statistical consulting and collaboration slides
Raster vs vector graphics.
Statistics prereq refresher from Khan Academy.
Coursera has a free 4week course on computing for data analysis with R.
Muddy points in perspective.
R+LaTeX+knitr for reproducible research. See my SC1 lecture notes (Ch01), and Mohammad Arbabshirani’s notes (pdf, rnw).
Asking smart questions
“Smart Questions” guide (note “hackers build things, crackers break them”)
Email Question Rubric:
* Send one email per question.
— Use “Reply” to continue the conversation on a question; send a new email for a new question.
* Include “ADA2” as the first word of the subject line in new emails (if replying, just use reply).
* Begin email with a short question summary.
* When possible, include commented code in email body
— Comments should indicate where the problem is, what the expected behavior is, and what steps are necessary to reproduce the problem.
— Code should include a “Minimum representative test cast” (http://www.catb.org/esr/faqs/
* If attaching code, please include all the files necessary to run your code (data, etc.).
Help:
LaTeX wiki, lshort, Detexify LaTeX symbols (linux texlive package management)
R tutorials: TryR (gentle), Kelly Black
R style matters. There is a lot of online help on R, such as at UCLA. Usually try searching for “R [mytopic]” and you’ll get lots of results.
Knitr in Rstudio (knitr is modern version of Sweave intro, demo, guide)
xtable to produce LaTeX tabular environment from R data.frames
Cookbook for R for helpful examples, visualization tutorials, diagrams
Image formats: vector (pdf, eps) vs raster (jpeg, bmp, tiff, gif)
Why stats now?
Important enough to have a US Chief Data Scientist (1) (2).
Citing and using notes, including previous editions
Citing lecture notes, example: Erhardt EB, Bedrick EJ, and Schrader RM. (2016) Lecture notes for Advanced Data Analysis 2. Retrieved Mar 1, 2016, from statacumen.com/teach/ADA2/ADA2_notes.pdf, 136–144.
Notes from Spring 2016 using R: ADA2_notes_S16.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at https://statacumen.com/teach/ADA2/ADA2_notes_S16.pdf.
Notes from Spring 2015 using R: ADA2_notes_S15.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at https://statacumen.com/teach/ADA2/ADA2_notes_S15.pdf.
Notes from Spring 2014 using R: ADA2_notes_S14.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at https://statacumen.com/teach/ADA2/ADA2_notes_S14.pdf.
Notes from Spring 2013 using R: ADA2_notes_S13.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at https://statacumen.com/teach/ADA2/ADA2_notes_S13.pdf.
Notes from Spring 2012 using SAS: ADA2_notes_S12.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at https://statacumen.com/teach/ADA2/ADA2_notes_S12.pdf.
R tutorials: TryR (gentle), Kelly Black
Cookbook for R for helpful examples, visualization tutorials, diagrams.
Table of selected statistical methods
The data and design determine which method you use: original or UCLA.
Here’s a table of methods with the applicable semester of ADA and Chapter.
Number of Dependent Variables 
Number of Independent Variables 
Type of Dependent Variable(s) 
Type of Independent Variable(s) 
Measure  Test(s)  ADACh 
1  0 (1 population) 
continuous normal  not applicable (none) 
mean  onesample ttest 
102 
continuous nonnormal 
median  onesample median 
106  
categorical  proportions  Chi Square goodnessoffit, binomial test 
107  
1 (2 independent populations) 
normal  2 categories  mean  2 independent sample ttest 
103  
nonnormal  medians  Mann Whitney, Wilcoxon rank sum test 
106  
categorical  proportions  Chi square test Fisher’s Exact test 
107  
0 (1 population measured twice) or 1 (2 matched populations) 
normal  not applicable/ categorical 
means  paired ttest  102  
nonnormal  medians  Wilcoxon signed ranks test 
106  
categorical  proportions  McNemar, Chisquare test 
107  
1 (3 or more populations) 
normal  categorical  means  oneway ANOVA  105  
nonnormal  medians  Kruskal Wallis  106  
categorical  proportions  Chi square test  107  
2 or more (e.g., 2way ANOVA) 
normal  categorical  means  Factorial ANOVA  205  
nonnormal  medians  Friedman test  not  
categorical  proportions  loglinear, logistic regression 
211  
0 (1 population measured 3 or more times) 
normal  not applicable  means  Repeated measures ANOVA 
not  
1  normal  continuous  correlation, simple linear regression 
108  
nonnormal  nonparametric correlation 
108  
categorical  categorical or continuous 
logistic regression  211  
continuous  discriminant analysis 
216  
2 or more  normal  continuous  multiple linear regression 
202  
nonnormal  
categorical  logistic regression  211  
normal  mixed categorical and continuous 
Analysis of Covariance, General Linear Models (regression) 
209  
nonnormal  
categorical  logistic regression  211  
2  2 or more  normal  categorical  MANOVA  215  
2 or more  2 or more  normal  continuous  multivariate multiple linear regression 
not  
2 sets of 2 or more 
0  normal  not applicable  canonical correlation  not  
2 or more  0  normal  not applicable  factor analysis  not  
0 or more  mixed categorical and continuous 
principal component analysis (w/multiple regression) 
213  
categorical  cluster analysis  213  
discriminant analysis  216  
classification  217 