ADA1
UNM Stat 427/527: Advanced Data Analysis I (ADA1)
Fall 2013 Syllabus is below table.
Fall 2013 schedule:
Time: TR 0930-1045
Location: Hibben 105
Stat 427, CRN 35812
Stat 527, CRN 35813
Did you receive a registration error for Fall 2013? Send me an email with the following answers:
1. What registration error did you get (copy/paste is best)?
2. What is your UNM ID?
3. What is your Math/Stat background (that is, do you have the pre-reqs)?
News:
5/16 The notes have been updated for Fall 2013.
Tentative Timetable
| Wk-Date | Ch | Topic | Slides Code Data | pts HW sol Data |
Read ISWR |
HW Due |
Plot |
| 01-08/20 | 00 | Introduction to R, Rstudio, and ggplot |
Ch 00 R | 10 HW00 | 08/23 | day | |
| 01-08/22 | 1.2, 1.3 | Minard | |||||
| 02-08/27 | Ch 2 | pac,pi | |||||
| 02-08/29 | 01 | Summarizing and Displaying Data |
Ch 01 R | 60 HW01 sol d1 d2 |
09/11 | crash | |
| 03-09/03 | Ch 4 | Nobel | |||||
| 03-09/05 | 02 | Estimation in One-Sample Problems |
Ch 02 R | 120 HW02 sol d1 d2 d3 |
09/20 | Space | |
| 04-09/10 | 5.1 | 9/11 | |||||
| 04-09/12 | baby | ||||||
| 05-09/17 | 03 | Two-Sample Inferences |
Ch 03 R | 130 HW03 sol d1 d2 d3 |
5.3 | 10/02 | fx23456 |
| 05-09/19 | null | ||||||
| 06-09/24 | 04 | Checking Assumptions |
Ch 04 R | (in HW3) | rad | ||
| 06-09/26 | 05 | One-way ANOVA | Ch 05 R CHDS dat desc |
80 HW05 sol | 7.1 | 10/16 | boyfr |
| 07-10/01 | Assign Teams csv R | Assign Proj | sig | ||||
| 07-10/03 | worst,2 | ||||||
| 08-10/08 | (Midterm Review) | Obudg,2 | |||||
| 08-10/10 | Fall Break | ||||||
| 09-10/15 | 06 | Nonparametric Methods |
Ch 06 R | 175 HW06 sol d1 d2 d3 d4 |
5.2, 5.5, 5.7, 7.2, 7.4 |
11/06 | bball |
| 09-10/17 | bball2 | ||||||
| 10-10/22 | Midterm, Chs 1-5 | Bring: UNM ID, pen(cil), and 4×6″ handwritten “help” card |
|||||
| 10-10/24 | Proj 1 | feel,2 | |||||
| 11-10/29 | vote,2 | ||||||
| 11-10/31 | (no class last year) | choc p | |||||
| 12-11/05 | 07 | Categorical Data Analysis (election day) |
Ch 07 R | 105 HW07 sol | Ch 8 | 11/27 | cause grid |
| 12-11/07 | work | ||||||
| 13-11/12 | occupy | ||||||
| 13-11/14 | food | ||||||
| 14-11/19 | 08 | Correlation and Regression |
Ch 08 R | 80 HW08 sol d1 |
Ch 6 | (12/13) Proj 2 |
terr |
| 14-11/21 | Thanksgiving break | extrap,2 | |||||
| 15-11/26 | roulette | ||||||
| 15-11/28 | sodapop | ||||||
| 16-12/03 | text | ||||||
| 16-12/05 | 09 | Bootstrap | Ch 09 R | (no HW09) | Proj 3 | insur | |
| 17-12/12 | Finals week | 10:00-12:00 (HW8 due, no final) |
10 Power and Sample size Ch 10 R
R functions written for these notes appearing in other chapters.
Statistical consulting and collaboration slides
Notes from Fall 2013 using R: ADA1_notes_F13.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F13.pdf.
Notes from Fall 2012 using R: ADA1_notes_F12.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F12.pdf.
Notes from Fall 2011 using Minitab: ADA1_notes_F11.pdf includes all chapters in one document.
Lecture notes for Advanced Data Analysis 1 (ADA1) Stat 427/527 University of New Mexico is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at http://statacumen.com/teach/ADA1/ADA1_notes_F11.pdf.
Syllabus
Description: Statistical tools for scientific research, including parametric and non-parametric methods for ANOVA and group comparisons, simple linear and multiple linear regression and basic ideas of experimental design and analysis. Emphasis placed on the use of statistical packages such as R. Course cannot be counted in the hours needed for graduate degrees in Mathematics and Statistics.
Prerequisite: Stat 145 (or other intro stats course)
Semesters offered: Fall
Lecture: Stat 427/527.001, TR 12:30–13:45, Hibben 105
Office hours: Tues 2-3pm and by appointment in SMLC 312
email: “Erik B. Erhardt” <erike@stat.unm.edu>, please include “ADA1″ in subject line
Textbook: Peter Dalgaard, “Introductory Statistics with R“, Second Edition, 2008, ISBN: 978-0-387-79053-4. The book is not required, but it will provide a backup for what you learn in class.
i>Clickers: Yes, we’re going to use clickers! You don’t need to buy a new one, you can get a used one, or you can share with someone who isn’t also in our class. Please bring the same one to class each day. Sorry, though web clickers are an option, technically speaking, it is beyond what I can do for our class this first semester of my using clickers. There is also some expense for using the web clicker system rather than the simple iClicker system.
Laptops running R: I encourage you to bring a laptop to class each day so you can try the R programming exercises in class. If you don’t have one, no problem, teamwork is encouraged — sit next to someone friendly who likes to share.
Teaching Assistants
Claire Longo <clongo01@unm.edu> SMLC 323 (Math/Stat Graduate conference room) W 13:00-15:00 [unavailable 11/13-15 (no office hours), 11/26-27]
Mohammad Arbabshirani <mrezaarb@unm.edu> SMLC 323 (Math/Stat Graduate conference room) T 11:00-12:15, R 11:00-12:15
Sometimes the Math/Stat Graduate conference room has something else going on. In this case, look around … Claire or Mohammad are probably close by.
Student learning outcomes
At the end of the course, you will be able to: (student results: R plot)
General outcomes:
1. Organize knowledge in graphs, tables, and code to support concise, comprehensible, and scientifically defensible written interpretations to produce knowledge.
2. Distinguish a testable scientific hypothesis or data-supported interpretation from an opinion.
3. Understand from a data story the goals of the study and apply the correct statistical procedure.
4. Explain the scientific aspects of a problem to nonscientists in a fashion that enhances understanding and decision making.
Topical outcomes:
5. Define parameters of interest and hypotheses in words and notation.
6. Summarize data visually, numerically, and descriptively and interpret the observed characteristics. Calculate and interpret numerical summaries such as mean, variance, five-number summary, confidence intervals, and p-values, and create visual summaries such as bar plots, scatter plots, and histograms. (Never pie charts!)
7. Distinguish between statistical significance and scientific relevance.
8. Use statistical software, such as R, to read and manage data, create informative plots, report numerical summaries, apply statistical models, by recommended programming practice including abstraction and documentation.
9. Understand the differences and limitations of controlled experiments and observational studies. Design experiments to infer causal treatment effects. Analyze observational data to infer associations between measured variables.
10. Identify and explain the statistical methods, assumptions, and limitations used in reported studies in scientific literature or popular media.
11. Evaluate and criticize published studies, the work of peers, and your own work and assess what was done well, what could be done better, and examine whether their conclusions are supported using statistical principles.
12. Make evidence-based decisions by constructing and deciding between testable hypotheses using appropriate data and methods.
13. Discover relationships and make predictions through model development and selection.
Meeting the learning outcomes
You will acquire new information in this class, but the emphasis is comprehending, integrating, and applying information. Rote factual memorization is the lowest form of learning. Effective learning takes place by explaining, integrating, applying, and analyzing facts, hypotheses, and theories.
Learning in this class occurs by:
- Doing – completion of exercises that require analysis of data to answer questions and test hypotheses, or researching answers to reading assignments.
- Discussion – interaction with classmates to assemble and synthesize information you’d utilizing the collective skills and knowledge base of the group.
- Listening, acting, and reflecting – activities during class time provide insights into information not available in readings and includes review difficult material to aid comprehension. Note taking permits later reflection on lecture content. Listening to the professor lecture is the least effective learning tool for both students, however, and you should plan on coming to every class prepared to participate in active and reflective learning opportunities.
Assessment
Rubrics guide assessment (and self-assessment) of homework, code, projects, exams, and presentations.
Homework is due 1 week (or 2 classes, whichever is shorter) after we complete each chapter. Homework grade based on rubrics for homework (75%) and code (25%).
Header for homework assignments for each part:
First Last
ADA1 Stat 427 (or 527)
HW ##, Part #
MM/DD/YYYY
All R code for the assignment should be included in an appendix at the end of the document.
Grading breakdown
Semi-weekly homework: 45%, lowest grade dropped
Team projects: 30%
Midterm exam: 20%
Class participation: 5% (i>Clicker)
Please hand in a physical version of your homework and projects – a TA will write comments on it and give it back to you. An electronic version will be accepted under exception circumstances (almost never).
Late assignments will be penalized 20% if handed in by 5pm the following day, and will not be accepted after that. (In the past I have been loose about this, but with so many students I’m going to be firm.)
Grading scale
Assignments will be graded according to a rubric, which might be different from what you’re used to. This is how it works. Each component (typically skepticism, curiosity, and organization) is graded between 1 (F) and 5 (A+). To get a 5 for any component you will typically need to go above and beyond what I have covered in class and show me something new. Earning a 3 or 4 out of 5 indicates you’re doing very well.
If you find that on the first couple homeworks you only receive grades of 5 or 6 out of 15, don’t worry too much. Perseverance will win the day. In the past, most students end up doing extremely well in the ADA courses.
A rough conversion between rubric grades and letter grades is:
rubric = gpa = letter
5 = 4.0–4.33 = A+
4 = 3.5–4.0 = A
3 = 2.5–3.5 = B
2 = 1.5-2.5 = C
1 = <1.5 = F
These are minimum guaranteed grades – that is, we may be more generous depending on the grading distribution this semester. Plusses and minuses will be awarded at our discretion.
Grade Calculation
Grading from HW0 to HW3: First you earn points relative to the rubric. The rubric grade is averaged between the HW (maximum 5+5+5=15) and Code (maximum 5+5+5=15) portions of your HW grade (0.75*HW + 0.25*Code), then is weighted by each assignment’s points (HW1 is 60, HW2 is 120, etc.). This grand weighted average over all your assignments is converted to a gpa score (as above) for a final letter grade.
In particular:
| Rubric of 15=5+5+5 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| Percent | 0 | 3 | 6 | 10 | 30 | 50 | 70 | 73 | 77 | 80 | 83 | 87 | 90 | 94 | 98 | 100 |
Grading from HW 5 forward (including Midterm): HW grade is based on the points (rather than the rubric), though a rubric score will also be included to help guide where you can improve. If you earn a 5+5+5 on the rubric, then your grade will be increased by 10% of what you earned for doing exceptional work. Code will still be based on the rubric.
Semi-weekly homework
Homework is designed to encourage you to review the material we’ve learned, synthesize new information from the R help pages or the web, and apply (and learn!) your new skills. Expect to spend 4-5 hours a week (outside of class!) to do well, and maybe double that to do outstandingly well.
Team projects
Working in teams is how science gets done. Each member of the team is responsible for every part of the project. I know team projects can be frustrating, requiring maturity, mutual consideration, and professionalism throughout, but I hope to teach some skills that should make it less painful. More details will be provided when we start the first project, but expect to produce a 5-10 page report detailing the analysis of a data set or one you collect from a study you design.
Each project will receive a single grade, but individual grades will be weighted by effort as judged by the entire team.
Teams will be assigned by the TAs and myself. Teams can chose to fire team members who are not performing well (after meeting with me as a team), and individuals can choose to quit if they feel they are doing all the work.
Model answers
Some homeworks and projects are open ended, and there are no right answers. For each I will show model answers to have a sense of the quality and content I’m looking for. Occasionally, I’ll publish (anonymously) a few of the best answers. If you don’t want yours to be published, please let me know.
Collaboration and citation
For homeworks (and obviously team projects) I encourage you to work together. Please discuss the data, code, and problems with one another, but do your own exploration and write up. We expect everyone to hand in substantially different homeworks, and we will enforce this under the honor code. The small benefit you might get from plagiarism is not worth the severe penalty.
As in life, please use any resources available to you. Projects and some homeworks will explicitly encourage you to use resources on the internet, but showing extra initiative will always be appreciated. You may find R programming tough at first, so feel free discuss your problems with other classmates or meet with or email questions to the TAs or me.
I encourage you to use the ideas of others, but make them your own, giving credit. For projects have a formal bibliography, for homework cite casually, and for code copy the URL in a comment (which is doubly helpful for finding the resource again).
Disability statement
If you have a documented disability that will impact your work in this class, please contact me to discuss your needs. You’ll also need to register with the Accessibility Resource Center in 2021 Mesa Vista Hall (building 56) across the courtyard east from the SUB.
Lecture notes from 2011, using Minitab
The book of lecture notes from Fall 2011 is available as ADA1_notes_F11.pdf and includes most of the statistical content we’ll cover this semester.
If you wish to cite these notes, here’s an example: Bedrick EJ, Schrader RM, and Erhardt EB. (2011) Lecture notes for Advanced Data Analysis 1. Retrieved Jan 1, 2012, from statacumen.com/teach/ADA1/ADA1_notes.pdf, 136–144.
Learning without thought is labor lost.
What I hear, I forget.
What I see, I remember.
What I do, I understand.
- Confucius
Random stuff:
UNM R programming group, organized and taught by Christian Gunning, meeting at 12:30pm on Friday (place TBD). From Christian: “This semester, I’d like to focus on graphing. The emphasis will be on producing publication-quality figures using small to medium datasets from participants own research. I suggest we spend the 1st half of the semester making figures with lattice, and the second half using ggplot2.”
UNM has license for free online access to the definitive books for the Lattice and ggplot2 graphing platforms. Note you must be on campus or logged in through the UNM proxy to access these.
R style matters. There is a lot of online help on R, such as at UCLA. Usually try searching for “R [mytopic]” and you’ll get lots of results. ggplot2 plotting cookbook.
R reference card by Jonathan Baron.
Translate between MATLAB and R.
Figure checklist. Choosing the right chart.
Raster vs vector graphics.
Statistics pre-req refresher from Khan Academy.
R errors and solutions:
Symptom: On MacOS Lion, R2.15.1, ggplot2, R will not display fonts in the plotting windows and, when using ggplot2, we get an error.
Possible cause (would like confirmation if you also have this problem): OfficeMac2011 fonts confuse the Mac Font Box, especially Ariel.
The error
Error in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
Polygon edge not found
can be fixed by defining a font each time:
p <- p + theme_bw(base_family = 'Helvetica')
print(p)
Archive
Prior to first day:
Step 0a: Set up R and Rstudio
(1) Download R for windows or mac, (2) install Rstudio, and (3) install a package we’ll use with the following R command: install.packages("ggplot2").
R style matters. There is a lot of online help on R, such as at UCLA. Usually try searching for “R [mytopic]” and you’ll get lots of results.
Step 0b: Register your i>Clicker
(1) Obtain your iClicker, (2) register online (First and Last name as on my.unm.edu, UNM ID number), and (3) bring it with you to class every day. (See my note in the syllabus about clickers). I will then be able to sync your data to the grade sheet.
I am drastically redesigning this course (conducting a major learning experiment with … you!) and will make the slides for the following section available as soon as I can, and always at least a day before class.
(Idea about Proj 2: Might skip “Ch 12 – Stat consulting” in favor of two days of Proj 2 presentations. Two 75-minute sessions with 90 students in 4-person teams gives 2*75/(90/4)=6.6 minute blocks, so strict 5-min presentations with 1-min change — pdfs prepared ahead of time numbered by group. Each presentation would be assessed by the Prof and TAs, and self-assessed by the team using the rubric. Reports are also assessed.)
Table of selected statistical methods
The data and design determines which method you use: original or UCLA.
Here’s a table of methods with the applicable semester of ADA and Chapter.
| Number of Dependent Variables |
Number of Independent Variables |
Type of Dependent Variable(s) |
Type of Independent Variable(s) |
Measure | Test(s) | ADA-Ch |
| 1 | 0 (1 population) |
continuous normal | not applicable (none) |
mean | one-sample t-test |
1-02 |
| continuous non-normal |
median | one-sample median |
1-06 | |||
| categorical | proportions | Chi Square goodness-of-fit, binomial test |
1-07 | |||
| 1 (2 independent populations) |
normal | 2 categories | mean | 2 independent sample t-test |
1-03 | |
| non-normal | medians | Mann Whitney, Wilcoxon rank sum test |
1-06 | |||
| categorical | proportions | Chi square test Fisher’s Exact test |
1-07 | |||
| 0 (1 population measured twice) or 1 (2 matched populations) |
normal | not applicable/ categorical |
means | paired t-test | 1-02 | |
| non-normal | medians | Wilcoxon signed ranks test |
1-06 | |||
| categorical | proportions | McNemar, Chi-square test |
1-07 | |||
| 1 (3 or more populations) |
normal | categorical | means | one-way ANOVA | 1-05 | |
| non-normal | medians | Kruskal Wallis | 1-06 | |||
| categorical | proportions | Chi square test | 1-07 | |||
| 2 or more (e.g., 2-way ANOVA) |
normal | categorical | means | Factorial ANOVA | 2-05 | |
| non-normal | medians | Friedman test | not | |||
| categorical | proportions | log-linear, logistic regression |
2-11 | |||
| 0 (1 population measured 3 or more times) |
normal | not applicable | means | Repeated measures ANOVA |
not | |
| 1 | normal | continuous | correlation, simple linear regression |
1-08 | ||
| non-normal | non-parametric correlation |
1-08 | ||||
| categorical | categorical or continuous |
logistic regression | 2-11 | |||
| continuous | discriminant analysis |
2-16 | ||||
| 2 or more | normal | continuous | multiple linear regression |
2-02 | ||
| non-normal | ||||||
| categorical | logistic regression | 2-11 | ||||
| normal | mixed categorical and continuous |
Analysis of Covariance, General Linear Models (regression) |
2-09 | |||
| non-normal | ||||||
| categorical | logistic regression | 2-11 | ||||
| 2 | 2 or more | normal | categorical | MANOVA | 2-15 | |
| 2 or more | 2 or more | normal | continuous | multivariate multiple linear regression |
not | |
| 2 sets of 2 or more |
0 | normal | not applicable | canonical correlation | not | |
| 2 or more | 0 | normal | not applicable | factor analysis | not | |
| 0 or more | mixed categorical and continuous |
principal component analysis (w/multiple regression) |
2-13 | |||
| categorical | cluster analysis | 2-13 | ||||
| discriminant analysis | 2-16 | |||||
| classification | 2-17 | |||||