Category Archives: Statistics

UNM Stats R Package Development “R-Hack-a-Pack”

Workshop at The UNM Dept of Math & Stats on Friday 8/17/2018 10AM-5PM. Erik Erhardt
Please click this link to RSVP.
No workshop fee.  Food and caffeine contribution: $10-15. Math & Stat event link.
I’m going to host an R Package Development R-Hack-a-Pack one-day workshop on Friday 8/17 from 10-5.  This is like a “hack-a-thon”, but, instead of tackling a dataset analysis, the goal is to learn and use the skills to package R code.  The first hour or so will cover the basic ideas of packaging code using modern tools such as RStudio, devtools, and usethis.  The rest of the day is intended as a protected time where you can turn your code into a package. We’ll cover the basics of creating a package, package testing, writing vignettes, and using github (and CRAN) for version control and making your package available.
To get the most out of this workshop, bring a package idea to start on.  A few ideas:
  1. If you already have a set of functions that you load with script(“my_functions.R”), then you’re an afternoon away from making a great package.
  2. If you have a large script with repeated code, then you can start by turning the repeated code into functions, package those functions, and write a short vignette to perform the same analysis using your package.
  3. If you have an idea for a new package to develop, bring that idea with you and we can consider trying to develop it during the workshop.
  4. My project is to package all the code and data for my two-semester data analysis course while I’m not helping others.
Food and helper:
If you would like to volunteer to be a helper during the day for logistical issues, please email me directly.  It would be nice to have lunch and an afternoon coffee break with a snack.

... more

R package: RmdNameChunk

Enumerate Rmd code chunks

I wrote the RmdNameChunk package to automatically name the Rmd code chunks. This is important for my workflow because I don’t click “Knit” in RStudio, instead I run rmarkdown::render(fn). This has the advantage of using the console environment instead of the RMarkdown environment, so all the objects are available for manipulation in the console. However, it is hard to debug when there are scores of unnamed chunks. Now, I can easily name all the code chunk and can quickly identify where issues are. Install from github:


Install from source by running this command:
Read an Rmd file, update existing prefixed code chunks, and renumber.
  fn_in  = "test_in.Rmd"
, fn_out = "test_out.Rmd"
, prefix_chunk_name = "chunk-"
Review the input and output files to see how the chunk header names have been updated. Below is an example of the two Rmd files in the vignette. test_in.Rmd was read in and test_out.Rmd was created. The chunk headers are shown below from each file.
8 : ```{r setup, include=FALSE}
22 : ```{r cars}
28 : ```{r}
32 : ```{ r }
36 : ```{r, echo=FALSE}
40 : ```{ r, eval=FALSE}
44 : ```{r , eval=FALSE}
48 : ```{r chunk-2, eval=FALSE}
52 : ```{r chunk-XXX , eval=FALSE}
56 : ```{r chunk-XXX2 , eval=FALSE}
60 : ```{r chunk-XXX3 , eval=FALSE}
64 : ```{r chunk-XXX4 , eval=FALSE}
68 : ```{r chunk-XXX5 , eval=FALSE}
72 : ```{r chunk-XXX6 , eval=FALSE}
81 : ```{r pressure, echo=FALSE}
These code chunk headers were changed to those below:
8 : ```{r setup, include=FALSE}
22 : ```{r cars}
28 : ```{r chunk-01}
32 : ```{r chunk-02}
36 : ```{r chunk-03, echo=FALSE}
40 : ```{r chunk-04, eval=FALSE}
44 : ```{r chunk-05, eval=FALSE}
48 : ```{r chunk-06, eval=FALSE}
52 : ```{r chunk-07, eval=FALSE}
56 : ```{r chunk-08, eval=FALSE}
60 : ```{r chunk-09, eval=FALSE}
64 : ```{r chunk-10, eval=FALSE}
68 : ```{r chunk-11, eval=FALSE}
72 : ```{r chunk-12, eval=FALSE}
81 : ```{r pressure, echo=FALSE}

... more


I’m enjoying Sabbatical during the Fall 2017 – Spring 2018 academic year.  My Sabbatical Leave Plan includes visiting both UC Irvine and the Mind Research Network to:
  • learn more about Bayesian graphical models,
  • learn more about Hamiltonian Monte Carlo (HMC),
  • learn about their statistical and computational implementations, and
  • apply both to extend current models in the application to fMRI brain imaging data.
  • Continue UNM 100-level statistics and mathematics education initiatives to understand factors influencing student success and find strategies to increase success.

... more

Paper published: Stable Isotope Sourcing using Sampling

Stable Isotope Sourcing using Sampling Erhardt, EB, BO Wolf, M Ben-David, and EJ Bedrick Open Journal of Ecology 4 (6) pp. 289–298 Online: May 2014 DOI: 10.4236/oje.2014.46027 Abstract Stable isotope mixing models are used to estimate proportional contributions of sources to a mixture, such as in the analysis of animal diets, plant nutrient use, geochemistry, pollution, and forensics. We describe an algorithm implemented as SISUS software for providing a user-specified number of probabilistic exact solutions derived quickly from the extended mixing model. Our method outperforms IsoSource, a deterministic algorithm for providing approximate solutions to represent the solution polytope. Our method is an approximate Bayesian large sample procedure. SISUS software is freely available at and as an R package at
... more

Paper published: Inference for stable isotope mixing models: a study of the diet of dunlin

Inference for stable isotope mixing models: a study of the diet of dunlin Erhardt, EB and EJ Bedrick Journal of the Royal Statistical Society: Series C. pp. 579–593 Online: February 4, 2014 doi: 10.1111/rssc.12047 Abstract Stable isotope sourcing is used to estimate proportional contributions of sources to a mixture, such as in the analysis of animal diets and plant nutrient use. Statistical methods for inference on the diet proportions by using stable isotopes have focused on the linear mixing model. Existing frequentist methods assume that the diet proportion vector can be uniquely solved for in terms of one or two isotope ratios. We develop large sample methods that apply to an arbitrary number of isotope ratios, assuming that the linear mixing model has a unique solution or is overconstrained. We generalize these methods to allow temporal modelling of the population mean diet, assuming that isotope ratio response data are collected over time. The methodology is motivated by a study of the diet of dunlin, a small migratory seabird.
... more

TV: KOB4, Police cadet test scores under investigation

Tonight KOB-TV4 aired the NM Law Enforcement Academy “Police cadet test scores under investigation” story on the Eyewitness News 4 at 10 P.M., for which I gave a short interview to Gadi Schwartz using a plot I created from the test score data. I gave the information and interview out of a personal desire to be helpful and was not acting on the University’s behalf. I did not speculate on the cause for the outlying class’s scores. I value the men and women who risk their lives daily serving our communities.
... more

Plot improved: NM Registered voters 2008

Showing party affiliation by age group can be made more informative by representing voter power with census data.  Note that in the “before” plot, years 60+ appear to be almost half the plot width while in the “after” plot we see that 60+ only represent 25% of the voting pool. Before After R code to create the “after” plot follows.
# Erik B. Erhardt
# 4/28/2012

# Recreating this plot as a Marimekko mosaic chart
# NM Registered Voters - Party by Age Line Chart (Oct 2008)

# Census population sizes
# NM population numbers

ages <- c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54",
          "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85-89", "90-99")
pop.ages <- c(149861,142370,139678,127567,123303,125220,144839,147170,
              136799,120137, 87890, 65904, 50230, 36238, 21622, 10371)

age <- seq(18,99)
pop <- c(rep(pop.ages/5,each=5)[c(4:75)], rep(pop.ages[length(pop.ages)]/10,10))
pop.prop <- pop/sum(pop)

# datathief

dem <- c(0.46,0.46,0.45,0.44,0.40,0.41,0.42,0.43,0.43,0.44,0.46

rep <- c(0.24,0.25,0.26,0.28,0.28,0.27,0.27,0.27,0.27,0.27,0.26

dts  <- c(0.26,0.25,0.25,0.24,0.30,0.29,0.28

other <- c(0.05,0.05,0.05,0.05,0.03,0.03,0.04,0.04,0.04,0.04,0.04

all <- data.frame(dem, rep, dts, other)

# correct rounding errors from datathief
for (i in 1:length(age)) {
  all[i,] <- all[i,]/sum(all[i,]);

## getting data list above
# x <- scan()
# [datathief numbers]
# round(matrix(x,ncol=2,byrow=TRUE)[,2],2)
# plot(round(matrix(x,ncol=2,byrow=TRUE)[,1],0))

# following example from

df <- data.frame(
          segment = age
        , segpct = pop.prop * 100
        , Other = all$other * 100
        , DTS  = all$dts    * 100
        , Rep = all$rep     * 100
        , Dem = all$dem     * 100

df$xmax <- cumsum(df$segpct)
df$xmin <- df$xmax - df$segpct
df$segpct <- NULL


dfm <- melt(df, id = c("segment", "xmin", "xmax"))

dfm1 <- ddply(dfm , .(segment), transform, ymax = cumsum(value))
dfm1 <- ddply(dfm1, .(segment), transform, ymin = ymax - value)

dfm1$xtext <- with(dfm1, xmin + (xmax - xmin)/2)
dfm1$ytext <- with(dfm1, ymin + (ymax - ymin)/2)

dfm1$segmentlabel <- rep("",length(dfm1$segment))
ss <- ((dfm1$segment %% 5)==0); # every 5 years, display age
dfm1$segmentlabel[ss] <- dfm1$segment[ss]
dfm1$segmentlabel[(dfm1$segment==18)] <- "age"

p <- ggplot(dfm1, aes(ymin = ymin, ymax = ymax, xmin = xmin, xmax = xmax, fill = variable))

p <- p + geom_rect(colour = I("grey"), alpha=0.75, size=.01) +
      xlab("Percentage age distribution") +
      ylab("Percent registered voter for party by age") +
      labs(title="NM Registered Voters - Party by Age (Oct 2008)")

p <- p + geom_text(aes(x = xtext, y = ytext,
     label = ifelse(segment == 20, paste(variable), " ")), size = 3.5)

p <- p + geom_text(aes(x = xtext, y = -3, label = paste(dfm1$segmentlabel)), size = 3)

... more

Invited talks: Neuroimaging and Statistics at Wright State University, Dayton, OH

I just returned from a fun event-filled couple days at Wright State Univeristy in Dayton, Ohio, visiting statistician Harry Khamis.  Harry invited me to give two talks on Friday, November 2nd, 2012: one in Statistics and a second in Neuroscience, arranged by Thomas N. Hangartner.  Harry was the model host; I always felt taken care of, my needs met. I was excited to meet two people from my talks who could use the methods I presented. Prof Nasser H Kashou develops models for HRF functions, which the SimTB might be helpful for. Prof Yvonne Vadeboncoeur uses stable isotopes to study freshwater ecosystems, and we had some exciting discussion about collaborative opportunities. The links to the papers the talks draw on are at the bottom. My morning neuroimaging talk (10:15) in the Department of Biomedical, Industrial & Human Factors Engineering (BIE) included two-and-one-half topics: SimTB, subject variability with GICA, and a little data visualization.
Title Capturing inter-subject variability with group independent component analysis of fMRI data: a simulation study Abstract A key challenge in functional neuroimaging is the meaningful combination of results across subjects. Even in a sample of healthy participants, brain morphology and functional organization exhibit considerable variability, such that no two individuals have the same neural activation at the same location in response to the same stimulus. This inter-subject variability limits inferences at the group-level as average activation patterns may fail to represent the patterns seen in individuals. A promising approach to multi-subject analysis is group independent component analysis (GICA), which identifies group components and reconstructs activations at the individual level. GICA has gained considerable popularity, particularly in studies where temporal response models cannot be specified. However, a comprehensive understanding of the performance of GICA under realistic conditions of inter-subject variability is lacking. In this study we use simulated functional magnetic resonance imaging (fMRI) data to determine the capabilities and limitations of GICA under conditions of spatial, temporal, and amplitude variability. Simulations, generated with the SimTB toolbox, address questions that commonly arise in GICA studies, such as: (1) How well can individual subject activations be estimated and when will spatial variability preclude estimation? (2) Why does component splitting occur and how is it affected by model order? (3) How should we analyze component features to maximize sensitivity to intersubject differences? Overall, our results indicate an excellent capability of GICA to capture between-subject differences and we make a number of recommendations regarding analytic choices for application to functional imaging data.
My afternoon statistics talk (3:00) in the Department of Mathematics and Statistics to a packed room (they had to bring in additional chairs!) included work that extends my published stable isotope sourcing work.
Title An extended Bayesian stable isotope mixing model for trophic level inference Abstract You are what and where you eat on the food web. We developed an extended Bayesian mixing model to jointly infer organic matter utilization and isotopic enrichment of organic matter sources in order to infer the trophic levels of several numerically abundant fish species (consumers) present in Apalachicola Bay, FL, USA. Bayesian methods apply for arbitrary numbers of isotopes and diet sources but existing models are somewhat limited as they assume that trophic fractionation is estimated without error or that isotope ratios are uncorrelated. The model uses stable isotope ratios of carbon, nitrogen, and sulfur, isotopic fractionations, elemental concentrations, elemental assimilation efficiencies, as well as prior information (expert opinion) to inform the diet and trophic level parameters. The model appropriately accounts for uncertainly and prior information at all levels of the analysis.
Neuroscience talk Summary of both SimTB papers. SimTB, a simulation toolbox for fMRI data under a model of spatiotemporal separability Erik B. Erhardt, Elena A. Allen, Yonghua Wei, Tom Eichele, Vince D. Calhoun NeuroImage 59 (2012), pp. 4160-4167 Capturing inter-subject variability with group independent component analysis of fMRI data: A simulation study Elena A. Allen, Erik B. Erhardt, Yonghua Wei, Tom Eichele, Vince D. Calhoun NeuroImage Data visualization in the neurosciences: overcoming the curse of dimensionality Elena A. Allen, Erik B. Erhardt, Vince D. Calhoun Neuron Statistics talk A Bayesian framework for stable isotope mixing models Erik B. Erhardt and Edward J. Bedrick Environmental and Ecological Statistics Bio Erik Barry Erhardt, PhD, is an Assistant Professor of Statistics at the University of New Mexico Department of Mathematics and Statistics, where he serves as Director of the statistics consulting clinic. His research interests include Bayesian and frequentist statistical methods for stable isotope sourcing and brain imaging. Erik is a Howard Hughes Medical Institute Interfaces Scholar collaborating in interdisciplinary research and consulting.
... more