UNM Stat 427/527: Advanced Data Analysis I (ADA1)
Fall 2016 Syllabus is below table
Fall 2016 schedule; Time: TR 15301645; Location: CTLB 300 (building 55, northeast of Zimmerman); Stat 427.002, CRN 54725; Stat 527.002, CRN 54726
+ Peer mentors via UNM Stat 495/595: Statistics Education Practicum (SEP) Stat 495.002 or Stat 595.001, CRN 13764 or 55072 (named “Individual Study”)
Goal
Learn to produce beautiful (markdown) and reproducible (knitr) reports with informative plots (ggplot2) and tables (xtable) by writing code (R, Rstudio) to answer questions using fundamental statistical methods (all one and twovariable methods), which you’ll be proud to present (poster).
News
12/4 – Poster Schedule (be on time with your poster):
Tue 12/6 3:307pm …yes, a long session, but no Thurs class and no final :)
Location: SMLC Atrium (not our class building)
3:303:40 Organization
3:404:40 Group 1
4:455:45 Group 2
5:506:50 Group 3
Teaching award
If you are interested in nominating me for the 2016 – 2017 Outstanding Teacher of the Year Awards, then please visit the Nomination Form with these details:
 Erik Erhardt
 (option 1) Tenured
 (option 2) Teacher of the Year
 Mathematics and Statistics
 (your reasons) …
Thanks for a great semester. I gave you a lot to struggle with: Statistical methods and thinking, programming, visualization, data management, literature review, reproducible research, and more! You’ve grown a lot this semester and I see so much skill in working with data, statistical ideas, and code. I’m really impressed.
Course content
Weekly structure (also see Assessment below)
 Preclass (Tuesday): Reading, Video, Quiz (due before class — solutions become available Tue 3:30, after the quiz is due)
 Inclass: Activities in class Tuesday and Thursday due by 5pm the following day, submitted to UNM Learn (evaluated by TA within 1 week).
 Postclass (Thursday): Homework (crowdgrader, due following Thursday before class)
 Postclass (Following ThursdayTuesday): Grading (crowdgrader, following 1 week + Tuesday before class)
UNM Learn for quizzes and inclass assignments.
YouTube Video playlist (try 1.5 speed, then pause as you need).
Course notes, code, data, and video lectures
Notes from Fall 2016: ADA1_notes_F16.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F16.pdf.
Ch  Chapter Title  Notes  R code  Datasets  Video lectures playlist  Helper videos 

00  Introduction to R, Rstudio, and ggplot  R  001 002  markdown, 01 PDS codebook, 01 HW codebook, 02 HW Lit review 

01  Summarizing and Displaying Data  R  011  03 HW 03 subset  
02  Estimation in OneSample Problems  R  021 022 023  
03  TwoSample Inferences  R  031 032 033  
04  Checking Assumptions  R  041  
05  OneWay Analysis of Variance  R  CHDS dat desc  051  
06  Nonparametric Methods  R  061 onesample, 062 paired, 063 twosample, 064 ANOVA, 065 perm test.  
07  Categorical Data Analysis  R  071 intro, 072 single prop, 073 GOFtest, 074 two prop & cond prob, …  
08  Correlation and Regression  R  081 corr/log, 082 corr hyp test, 083 LS reg eq, 084 085  
09  Introduction to the Bootstrap  R  091  
10  Power and Sample size  R  101  
11  Data Cleaning  R  111  14 HW to poster  
12  ADA2 Ch 11 Logistic Regression  R  121 122 123 124  Upgrading R on Windows 
lm_diag_plots.R function for a large set of standard diagnostic plots.
Notes from Fall 2015 using R: ADA1_notes_F15.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F15.pdf.
Passion Driven Statistics (PDS) data
I encourage you to use one of the AddHealth datasets. Use W1 if you want to understand adolescents when they were young and W4 if you want to understand adult relationships. NESARC is also interesting, but we had too many people choose many related questions in F15 from this dataset.
Install PDS package.
AddHealthW1 Sampling Design, Codebook, RData. Adolescents when they were young, unique ID “AID”.
AddHealthW4 Sampling Design, Codebook, RData. Same adolescents when they were older, unique ID “aid”.
NESARC Sampling Design, Codebook, RData. Alcohol abuse and related conditions, unique ID “IDNUM”.
OutlookOnLife Sampling Design, Codebook, RData. Interesting data, but not enough continuous variables to use, unique ID “CASEID”.
GapMinder Sampling Design, Codebook, RData. Country data, but it’s complicated to average large and small countries, unique ID “country”.
Erik’s example homework document:
NESARC data, nicotine and depression: .Rmd + .bib + EditRules = .html.
Timetable
WkDate  Cl  Topic  Reading, Video, Quiz  Inclass Worksheet, Data  Homework  HW Submit Grading  Due before class 

0008/23  00  Install software, survey  Step 0 (above)  
0108/23  01  Intro, data, poster  read: PDS Chs 23; video: Rmd, Ch 23, Med records, crowdgrader 
CTLB video, Active Learning, 01 Syllabus subset, 01a Medical records Rmd html Turn in assignment in Thursday’s class to learn how crowdgrader works. 
(Intro to using RMarkdown: Rmd html)  
0108/25  02  Rmd, codebook, crowdgrader  video: 01 Personal codebook 
Work as a group, each submit own copy (anonymously). 01a crowdgrader submit by 16:00 grading 16:0016:30 FB 
01 Personal codebook Rmd html
Choose from PDS datasets 
01 crowdgrader 09/01 Submit 09/06 Grade 

0208/30  03  Research questions  read: PDS Ch 24; video: Lit Rev biblio & Mendeley; quiz: 02 codebook and lit review 
Inclass: Rmd html Turn in one question of variable association. 
(UNM Google Scholar)  Quiz 02  
0209/01  04  Literature review  Inclass: Rmd html Turn in one citation to a research question. 
02 Literature review Rmd html bib
(While we won’t be doing a research proposal as part of this class, if we were covering more on research methods, then we might continue with a short research proposal (Rmd html).) 
02 crowdgrader 09/08 Submit 09/13 Grade 
Turn in HW 01  
0309/06  05  R programming, data subset and numerical summaries  read: PDS Chs 5, 8, & 18, Ch 00 R, Ch 01 R; video: Ch 00 p1, Ch 00 p2, Ch 01; quiz: 03 programming, univariate 
Inclass: Rmd html Look at datasets in R, create subset of data, rename variables, numerical summaries. 
Quiz 03, Grade HW 01  
0309/08  06  Plotting univariate  video: HW 03 vid 
Inclass: Rmd html Univariate plots of numerical and categorical variables. 
03 Data subset, univariate summaries and plots Rmd html (See the link above the table “Erik’s NESARC data, nicotine and depression”.) 
03 crowdgrader 09/15 Submit 09/20 Grade 
Turn in HW 02 
0409/13  07  Plotting bivariate  read: PDS Ch 9, Ch 00 R, Ch 11 R; video: 111; quiz: quiz 
Inclass: Rmd html Complete at least one bivariate coding relationship. 
Quiz 04, Grade HW 02  
0409/15  08  Data cleaning  Inclass: Rmd html Edit rules, run with dataset, assess exceptions, decide what to do with them. Erik’s EditRules. 
04 Rmd html  04 crowdgrader 09/22 Submit 09/27 Grade 
Turn in HW 03  
0509/20  09  Simple linear regression, intro  read: Ch 8.4, 8.2 R; video: 081 corr/log, 083 LS reg eq; quiz: quiz 
Inclass: Rmd html dat Build intuition using SLR App, interpret properties of linear regression fit. 
Quiz 05, Grade HW 03  
0509/22  10  Logarithm transformation  (novel example)  Inclass: Rmd html dat Plot, transform, plot, and interpret. 
05 Rmd html  05 crowdgrader 09/29 Submit 10/04 Grade 
Turn in HW 04 
0609/27  11  Correlation  read: Ch 8.1, 8.3.1 R, Ch 7.5.1 only sections on “conditional probability” and the following example R; video: 081 corr/log, 082 corr hyp test, 074 two prop & cond prob; quiz: quiz 
Inclass: Rmd html Data collection (hand span and word memory), correlation, regression to the mean. 
Spurious Correlations  Quiz 06, Grade HW 04  
0609/29  12  Categorical contingency tables  quiz 06b, Guess Ages (for next inclass)  Inclass: Rmd html d1 Interpret condition proportions in two examples. Simpson’s Paradox 
06 Rmd html  06 crowdgrader 10/06 Submit 10/11 Grade 
Turn in HW 05 
0710/04  13  Inference, intro  read: Ch 2.12.2 R; video: see table above; quiz: quiz 
Inclass: Rmd html Guess Ages, Legos. (Legos part 2 Rmd html dat, diagram). 
BBC Radio 4: More or Less, “sampling” 9 min audio  Quiz 07, Grade HW 05  
0710/06  14  Parameter estimation (onesample)  Inclass: Rmd html Water on Earth. 
07 Rmd html PDS Data Sampling Designs: AddHealth, OOL, NESARC 
07 crowdgrader 10/13 Submit 10/18 Grade 
Turn in HW 06  
0810/11  15  Hypothesis testing (twosample)  read: Ch 2.3end R Ch 3 R; video: see table above; quiz: quiz 
Inclass: Rmd html one and twosample tests using data we collected in class. 
Quiz 08, Grade HW 06  
0810/13  Fall Break  08 Rmd html  08 crowdgrader 10/20 Submit 15/25 Grade 
Turn in HW 07  
0910/18  16  Paired data, assumption assessment  read: Ch 2.2.1, Ch 3.4 & 3.6, Ch 4, Ch 5; video: see table above; quiz: quiz 
Inclass: Rmd html Paired data and checking model assumptions. 
Quiz 09, Grade HW 07  
0910/20  17  ANOVA, posthoc comparisons  Inclass: Rmd html ANOVA, model assumptions, and paired comparisons. 
09 Rmd html  09 crowdgrader 10/27 Submit 11/01 Grade 
Turn in HW 08  
1010/25  18  Nonparametric methods  read: Ch 6, Ch 7.27.4, Ch 10; video: see table above; quiz: quiz 
Inclass: Rmd html NP onesample tests and CIs, and ANOVA with pairwise comparisons. 
Quiz 10, Grade HW 08  
1010/27  19  Binomial and multinomial proportion tests  Inclass: Rmd html dat Multinomial: World series number of games. 
10 Rmd html  10 crowdgrader 11/03 Submit 11/08 Grade 
Turn in HW 09  
1111/01  20  Twoway categorical tables  read: Ch 7.8end, Ch 8.58.7; video:; quiz: quiz 
Inclass: Rmd html dat Popular kids. 
Quiz 11, Grade HW 09  
1111/03  21  Simple linear regression, inference  Inclass: Rmd html Regression of height vs hand span using data from our class. 
11 Rmd html  11 crowdgrader 11/10 Submit 11/15 Grade 
Turn in HW 10  
1211/08  22  Logistic regression, intro  read: ADA2 Ch 11.13, 11.6, PDS Ch 16; video:; quiz: quiz 
Inclass: Rmd html AddHealth W4 Pregnancy. 
Summary of Methods we’ve covered  Quiz 12, Grade HW 10  
1211/10  23  Experiments and observational studies  Inclass: Rmd html Describing a study reported in the media. 
12 Rmd html  12 crowdgrader 11/17 Submit 11/22 Grade 
Turn in HW 11  
1311/15  24  Statistical communication  read: PDS Ch 18; video:; quiz: no quiz 
Inclass: Rmd html Key statistical principles, ethics.With additional time, clarify which research questions you’ll present in your poster with a peer mentor. (Null results are ok!) 
Statistics is about communication, including writing and presenting.  Quiz 13, Grade HW 11  
1311/17  25  Poster Preparation  Inclass: Rmd html Work on designing poster content at the bottom of your HW document. 
13 Rmd html
Work on your poster content. Try to complete your poster planning in your HW document. 
13 crowdgrader 11/29 Submit (Tue) 12/01 Grade (Thu) unusual schedule 
Turn in HW 12  
1411/22  26  Posters wrapping up  poster template pdf, Rnw, sty, bib, logo 
Prof Erhardt’s example poster pdf, Rnw 
Grade HW 12  
1411/24  Thanksgiving break  
1511/29  27  Show poster  Inclass: Course evaluations, submit receipt (capture screen image) as inclass assignment.
See email for more details. 
14 Rmd html
Due next Wednesday 12/7. Complete and submit your poster in LaTeX pdf format. Transition from Markdown to LaTeX 
14 crowdgrader 12/07 Submit No grading 
Turn in HW 13  
1512/01  28  Approve poster, final touches  ARI Graphix $9 poster printing Open MonFri 7:305:30 Do not use their website! Email plotting@abqrepro.com, Subject: ADA1 class poster Text: indicate to print “in color on bond paper”. Attach: Poster pdf with your name in the filename, such as “FirstLast_ADA1_poster.pdf”. Try to send by Friday noon for the poster to be ready by Monday. Arrange to pick up the poster. Price is $0.75/sq ft for Fall 2016. 
Have a peer mentor approve your poster for printing and presentation. Congratulations!  Grade HW 13  
1612/06  29  POSTERS  Poster sessions in SMLC Atrium  Poster Schedule (be on time): 3:303:40 Organization 3:404:40 Group 1 4:455:45 Group 2 5:506:50 Group 3 Congratulations on a great semester! 
Poster rubric  Turn in HW 14 tomorrow (Wed)  
1612/08  30  Class finishes early  Erik travelling, no class  NO Grade HW 14  
1712/08  Finals week  (no final) 
(I reserve the right to continue to improve the materials throughout the semester.)
Syllabus
Description: Statistical tools for scientific research, including parametric and nonparametric methods for ANOVA and group comparisons, simple linear and multiple linear regression and basic ideas of experimental design and analysis. Emphasis placed on the use of statistical packages such as R. Course cannot be counted in the hours needed for graduate degrees in Mathematics and Statistics.
Prerequisite: Stat 145 (or other intro stats course)
Semesters offered: Fall
Lecture: Stat 427.002, CRN 54725; Stat 527.002, CRN 54726, TR 15301645; Location: CTLB 300 (building 55, northeast of Zimmerman) Video
Office hours: Tue/Thu 14:0014:50, and by appointment in SMLC 312
email: “Erik B. Erhardt” <erike@stat.unm.edu>, please include “ADA1” in subject line
Textbook: Required books will be provided free by pdf on UNM Learn. Optional: Peter Dalgaard, “Introductory Statistics with R“, Second Edition, 2008, ISBN: 9780387790534. The book is not required, but it will provide a backup for what you learn in class.
Laptops running R: I encourage you to bring a laptop to class each day so you can try the R programming exercises in class. If you don’t have one, no problem, there are some laptops in class and teamwork is encouraged — sit next to someone friendly who likes to share.
Saving data: If you’re using classroom computers, use Flashdrives or UNM’s OneDrive (available in LoboMail) for saving files. The CTLB computers do not connect to your standard UNM drive space.
Teaching Assistants and Peer Mentors
Stat grad students TAs
Lindsey Pittington <lpittin
Ernest AttaAsiamah <eatta@unm.edu>, office hours Tue, Wed, Thu, and Fri 9:3010:30 in SMLC 305
Peer Mentors, SEP
Alicia Dominguez, former student.
Andrew Nathan Hollis, former student.
Ayed Alanzi, stat graduate student.
innovationAcademy video
Student learning outcomes
At the end of the course, you will be able to: (student results: R, all years, 2015, 2014, 2013, 2012)
General outcomes:
 Organize knowledge in graphs, tables, and code to support concise, comprehensible, and scientifically defensible written interpretations to produce knowledge within a reproducible research environment.
 Distinguish a testable scientific hypothesis or datasupported interpretation from an opinion.
 Understand from a data story the goals of the study and apply the correct statistical procedure.
 Explain the scientific aspects of a problem to nonscientists in a fashion that enhances understanding and decision making.
Topical outcomes:
 Define parameters of interest and hypotheses in words and notation.
 Summarize data visually, numerically, and descriptively and interpret the observed characteristics. Calculate and interpret numerical summaries such as mean, variance, fivenumber summary, confidence intervals, and pvalues, and create visual summaries such as bar plots, scatter plots, and histograms. (Never pie charts!)
 Distinguish between statistical significance and scientific relevance.
 Use statistical software, such as R, to read and manage data, create informative plots, report numerical summaries, and apply statistical models, by recommended programming practice including abstraction and documentation.
 Understand the differences and limitations of controlled experiments and observational studies. Design experiments to infer causal treatment effects. Analyze observational data to infer associations between measured variables.
 Identify and explain the statistical methods, assumptions, and limitations used in reported studies in scientific literature or popular media.
 Evaluate and criticize published studies, the work of peers, and your own work and assess what was done well, what could be done better, and examine whether their conclusions are supported using statistical principles.
 Make evidencebased decisions by constructing and deciding between testable hypotheses using appropriate data and methods.
 Discover relationships and make predictions through model development and selection.
Meeting the learning outcomes
You will acquire new information in this class, but the emphasis is comprehending, integrating, and applying information. Rote factual memorization is the lowest form of learning. Effective learning takes place by explaining, integrating, applying, and analyzing facts, hypotheses, and theories.
Learning in this class occurs by:
 Doing – completion of exercises that require analysis of data to answer questions and test hypotheses, or researching answers to reading assignments.
 Discussion – interaction with classmates to assemble and synthesize information you’d utilizing the collective skills and knowledge base of the group.
 Listening, acting, and reflecting – activities during class time provide insights into information not available in readings and includes review difficult material to aid comprehension. Note taking permits later reflection on lecture content. Listening to the professor lecture is the least effective learning tool for both students, however, and you should plan on coming to every class prepared to participate in active and reflective learning opportunities.
Assessment
 Quizzes will be due each Tuesday before class. Purpose: to assess reading and video comprehension and assure you’re prepared to actively participate in class activities with minimal lecture. (About 12, 20% of final grade, the lowest few are dropped.) Most weeks plan for 12 hours reading and video, 20 minute quiz. Quizzes are not timed, they can be taken twice, and the higher of the two scores is used for grade calculation.
 Viewing quiz solutions after the due date in UNM Learn is not intuitive. Click on the “Begin” button (this is the nonintuitive part, since you are not actually beginning the quiz), then click “View All Attempts” to see the scores. Finally, click “Calculated Grade” to see the feedback for each question of the quiz.
 Inclass assignments are due by 5pm the next day, submitted to UNM Learn. Purpose: to struggle and find success in class with the concepts and skills. (About 24, includes class participation, 20% of final grade, the lowest several are dropped.) Most weeks plan to finish in class.
 Homework (HW) assignments are assigned each Thursday and due the following Thursday, submitted to crowdgrader.org (75% of HW grade). Purpose: to apply concepts and skills to your class poster project. (About 12, 40% of final grade, the lowest few are dropped.) Most weeks plan on 14 hours per assignment.
 Peer grading is due by the following Tuesday after each homework is due (25% of HW grade). Purpose: to gain skill assessing the work of others, as well as see alternative strategies to answer questions. Most weeks this will take about 30 minutes to grade 5 other students’s HW.
 Poster will be developed through semester (most HW assignment contribute to poster), the last couple weeks we’ll complete them, and the last week we’ll have poster presentations. Purpose: to have an overarching set of questions to answer using methods learned in the course, with a deliverable you can be proud of! (1 poster and presentation, 12% poster, 2% presentation, and 2% evaluations of others of final grade.) In the last couple weeks, assembling this poster may take 510 hours, using a template provided to you.
 Course surveys are due at the beginning and end of the course. Purpose: to participate in national projectbased learning projects and improve course. (About 2, 4% of final grade.)
Final grade may include a small buffer at the discretion of the instructor. For example, final grade could be the total points earned divided by the total possible points times 0.95 for graduate students and 0.90 for undergraduate students. That is [Final Grade] = [Points Earned]/[Points possible * 0.95], so that your grade is slightly higher than you earned.
All homework assignments in this class are electronic, submitted to crowdgrader.com for grading, except for the final poster.
Crowdgrader
 Students usually get far more feedback on their work than they would get from overworked teaching assistants/faculty.
 Students get to see what other students are doing, and they can learn from the work of others (taking the best ideas, and leaving the rest).
 In exchange for this, they need to put in some amount of work in reviewing the work of others.
 It is important that students understand that their final grade is determined both by the quality of their work, and by the precision of the grades they give, and the helpfulness of the reviews they write.
Late assignments will not be accepted.
Rubrics guide assessment (and selfassessment) of homework, code, projects, exams, and presentations. Each assignment will have its own specific rubric.
All R code for the assignment should be included with the part of the problem it addresses (for code and output use a fixedwidth font, such as Courier).
Do NOT use your R code and output as your answer to the problem, but include them to show me how you arrived at your answer. Your prose solution (in a nonfixedwidth font) should be provided in addition to R output.
Collaboration and citation
For homeworks I encourage you to work together. Please discuss the data, code, and problems with one another, but do your own exploration and write up. We expect everyone to hand in substantially different homeworks, and we will enforce this under the honor code. The small benefit you might get from plagiarism is not worth the severe penalty (of lost trust, being reported to the dean, no points for the assignment, etc.).
As in life, please use any resources available to you. Projects and some homeworks will explicitly encourage you to use resources on the internet, but showing extra initiative will always be appreciated. You may find R programming tough at first, so feel free discuss your problems with other classmates or meet with or email questions to the TAs or me.
I encourage you to use the ideas of others, but make them your own, giving credit. For projects have a formal bibliography, for homework cite casually, and for code simply copy the URL in as a comment (which is doubly helpful for finding the resource again).
Statements
Disability statement
If you have a documented disability that will impact your work in this class, please contact me to discuss your needs. You’ll also need to register with the Accessibility Resource Center in 2021 Mesa Vista Hall (building 56) across the courtyard east from the SUB.
Title IX statement
In an effort to meet obligations under Title IX, UNM faculty, Teaching Assistants, and Graduate Assistants are considered “responsible employees” by the Department of Education (see pg 15). This designation requires that any report of gender discrimination which includes sexual harassment, sexual misconduct and sexual violence made to a faculty member, TA, or GA must be reported to the Title IX Coordinator at the Office of Equal Opportunity. For more information on the campus policy regarding sexual misconduct.
Our Classroom
We’re doing this because:
 We want you to be empowered with statistics.
 We believe everyone should get out of this course with awesome skills
 Realtime feedback promotes efficient learning
“It encourages me to engage actively with the course material and take responsibility for my learning.”
GAISE Connections
Our six recommendations include the following:
 Emphasize statistical literacy and develop statistical thinking
 Use real data
 Stress conceptual understanding, rather than mere knowledge of procedures
 Foster active learning in the classroom
 Use technology for developing conceptual understanding and analyzing data
 Use assessments to improve and evaluate student learning
Learning without thought is labor lost.
What I hear, I forget.
What I see, I remember.
What I do, I understand.
– Confucius
Archive
Course introduction materials
Precourse todos
Did you receive a registration error for Fall 2016? Send me an email with the following answers:
1. What registration error did you get (copy/paste is best)?
2. What is your UNM ID?
3. What is your Math/Stat background (that is, do you have the prereqs)?
If you are waitlisted, as long as there are seats available I will override you into the course. Don’t worry.
Step 0: Before our first class (Tue 8/23) please read through the following actions and install the required software on your computer and complete the brief surveys. If you don’t have a computer, there are classroom computers which will be available only when the classroom is open. Video for this process.
 Create a Google account (if you don’t already have one) to use with crowdgrader.
 Sign up for crowdgrader (which uses gmail account).
 Complete a quick 5question survey so I can link your crowdgrader gmail account with your UNM user ID for homework assignments.
 Complete a short Opinio survey required for classroom assessment.
 Install or upgrade R (windows or mac) then Rstudio. Videos that may be helpful:
 Install R on Mac (2 min).
 Install R for Windows (3 min).
 Install R and RStudio on Windows (5 min).
 Install R packages, also update all packages within RStudio.
 Install Mendeley.
 Install LaTeX (for poster at end of semester).
Problems installing PDS package? Solution.
If you had problems installing the PDS package, no problem; here’s how to get the data:
1. Download the “.RData” file above for your dataset.
2. Where I have “library(PDS)” in my code, change it to the two lines below. You’ll need update the “PATH_TO_FILE” below to the path on your computer’s harddrive, and “filename” needs to be changed to the name of the file. This will directly read the data file.
# library(PDS) setwd("/PATH_TO_FILE") load("filename.RData")
Joining AddHealth waves 1 and 2 together into a single dataset can be done if you want to use variables from when the participants were both adolecents and adults. See Erik’s example project for the code.
Saving data: If you’re using classroom computers, use Flashdrives or UNM’s OneDrive (available in LoboMail) for saving files. The CTLB computers do not connect to your standard UNM drive space. I recommend using a very systematic folder structure, such as ADA1/HW, ADA1/Class, ADA1/Reading, ADA1/Poster, etc. Do not just work on files in your downloads folder or your desktop; respect your data and code!
Unicode compile problems: If you knit to pdf you may get this error: “! Package inputenc Error: Unicode char”. ASCII is a small character set what we use to program in, Unicode is an extended character set that looks pretty (for example “straight quotes” become “curly quotes”) but causes code to break. You get unwanted Unicode when you copy/paste from a pdf or some other source into your code. To fix this, you have to find the Unicode and replace it with it’s ASCII equivalent. To do this: CtrlF to find, search for “[^\x00\x7F]” (without quotes), select “Regex” for regular expressions, and find the “Next” one. As it finds instances, replace the characters manually until there are no more. These characters will typically be curly quotes or fancy dashes.
Random stuff
UNM has license for free online access to the definitive books for the Lattice and ggplot2 graphing platforms. Note you must be on campus or logged in through the UNM proxy to access these.
R is currently available in these UNM Locations: DSH 141 and 143, Econ 1004, SMLC pods, and SUB ITLoboLab Pod and ITLoboLab Classroom.
R style matters. There is a lot of online help on R, such as at UCLA, tryr, and Google’s Intro to R video series. Usually try searching for “R [mytopic]” and you’ll get lots of results. ggplot2 plotting cookbook.
R reference card by Jonathan Baron.
Translate between MATLAB and R.
Figure checklist. Choosing the right chart. Nature Methods points of view on visualization.
Statistical consulting and collaboration slides
Raster vs vector graphics.
Statistics prereq refresher from Khan Academy.
Coursera has a free 4week course on computing for data analysis with R.
Muddy points in perspective.
R+LaTeX+knitr for reproducible research. See my SC1 lecture notes (Ch01), and Mohammad Arbabshirani’s notes (pdf, rnw).
Asking smart questions
“Smart Questions” guide (note “hackers build things, crackers break them”)
Email Question Rubric:
* Send one email per question.
— Use “Reply” to continue conversation on a question; send a new email for a new question.
* Include “ADA1” as the first word of the subject line in new emails (if replying, just use reply).
* Begin email with a short question summary.
* When possible, include commented code in email body
— Comments should indicate where the problem is, what the expected behavior is, and what steps are necessary to reproduce problem.
— Code should include a “Minimum representative test cast” (http://www.catb.org/esr/faqs/
* If attaching code, please include all the files necessary to run your code (data, etc.).
Help:
LaTeX wiki, lshort, Detexify LaTeX symbols (linux texlive package management)
R tutorials: TryR (gentle), Kelly Black
R style matters. There is a lot of online help on R, such as at UCLA. Usually try searching for “R [mytopic]” and you’ll get lots of results.
Knitr in Rstudio (knitr is modern version of Sweave intro, demo, guide)
xtable to produce LaTeX tabular environment from R data.frames
Cookbook for R for helpful examples, visualization tutorials, diagrams
Image formats: vector (pdf, eps) vs raster (jpeg, bmp, tiff, gif)
R Errors and fixes
Rcpp error
The solution can be clearly described like this:
 At the CRAN Rcpp page, download the “Windows binaries” “rrelease” zip file (currently: Rcpp_0.12.6.zip) onto your desktop or a temporary location.
 Unzip it, it will create a folder with a version number that includes a subfolder called “Rcpp”.
 Navigate into the folder and copy the “Rcpp” folder into your library folder (on my computer it’s “C:\Program Files\R\R3.3.1\library”).
 Restart RStudio and try to compile a markdown file.
 If successful, then delete the zip file you downloaded and the folder it created.
Why stats now?
Important enough to have a US Chief Data Scientist (1) (2).
Citing and using notes, including previous editions
Citing lecture notes, example: Erhardt EB, Bedrick EJ, and Schrader RM. (2016) Lecture notes for Advanced Data Analysis 1. Retrieved Mar 1, 2016, from statacumen.com/teach/ADA1/ADA1_notes.pdf, 136–144.
Notes from Fall 2015 using R: ADA1_notes_F15.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F15.pdf.
Notes from Fall 2014 using R: ADA1_notes_F14.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf.
Notes from Fall 2013 using R: ADA1_notes_F13.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F13.pdf.
Notes from Fall 2012 using R: ADA1_notes_F12.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F12.pdf.
Notes from Fall 2011 using Minitab: ADA1_notes_F11.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F11.pdf.
Semester schedule narrative
Week 01: Poster gives the context of what we will learn in class. We will get familiar with data, defining a codebook for a hypothetical dataset using RStudio and markdown. We use this assignment to practice peer review using crowdgrader in class. We review large dataset codebooks and choose an interesting subset of variables to use in our personalized projects throughout the semester.
Week 02: Each student chooses a large dataset, define a few research questions, specifies variables that can address those questions, and performs a short literature review to provide context for their questions.
Week 03: Using personalized datasets, we start programming by creating personal data subsets and visualizing our univariate data.
Week 04: We plot bivariate relationships and implement data cleaning with editrules.
Week 05: Introduce simple linear regression and interpretation, and logarithmic transformations and interpretations.
Week 06: Correlation and regression to the mean, including inclass student data collection, and spurious correlations. Then categorical contingency tables, interpreting conditional proportions, and Simpson’s paradox.
…
Table of selected statistical methods
The data and design determines which method you use: original or UCLA.
Here’s a table of methods with the applicable semester of ADA and Chapter.
Number of Dependent Variables 
Number of Independent Variables 
Type of Dependent Variable(s) 
Type of Independent Variable(s) 
Measure  Test(s)  ADACh 
1  0 (1 population) 
continuous normal  not applicable (none) 
mean  onesample ttest 
102 
continuous nonnormal 
median  onesample median 
106  
categorical  proportions  Chi Square goodnessoffit, binomial test 
107  
1 (2 independent populations) 
normal  2 categories  mean  2 independent sample ttest 
103  
nonnormal  medians  Mann Whitney, Wilcoxon rank sum test 
106  
categorical  proportions  Chi square test Fisher’s Exact test 
107  
0 (1 population measured twice) or 1 (2 matched populations) 
normal  not applicable/ categorical 
means  paired ttest  102  
nonnormal  medians  Wilcoxon signed ranks test 
106  
categorical  proportions  McNemar, Chisquare test 
107  
1 (3 or more populations) 
normal  categorical  means  oneway ANOVA  105  
nonnormal  medians  Kruskal Wallis  106  
categorical  proportions  Chi square test  107  
2 or more (e.g., 2way ANOVA) 
normal  categorical  means  Factorial ANOVA  205  
nonnormal  medians  Friedman test  not  
categorical  proportions  loglinear, logistic regression 
211  
0 (1 population measured 3 or more times) 
normal  not applicable  means  Repeated measures ANOVA 
not  
1  normal  continuous  correlation, simple linear regression 
108  
nonnormal  nonparametric correlation 
108  
categorical  categorical or continuous 
logistic regression  211  
continuous  discriminant analysis 
216  
2 or more  normal  continuous  multiple linear regression 
202  
nonnormal  
categorical  logistic regression  211  
normal  mixed categorical and continuous 
Analysis of Covariance, General Linear Models (regression) 
209  
nonnormal  not  
categorical  logistic regression  211  
2  2 or more  normal  categorical  MANOVA  215  
2 or more  2 or more  normal  continuous  multivariate multiple linear regression 
not  
2 sets of 2 or more 
0  normal  not applicable  canonical correlation  not  
2 or more  0  normal  not applicable  factor analysis  not  
0 or more  mixed categorical and continuous 
principal component analysis (w/multiple regression) 
213  
categorical  cluster analysis  213  
discriminant analysis  216  
classification  217 