Category Archives: Statistics

Paper published: Bayesian Simultaneous Intervals for Small Areas: An Application to Variation in Maps

Bayesian Simultaneous Intervals for Small Areas: An Application to Variation in Maps Erik Barry Erhardt, Balgobin Nandram, Jai Won Choi International Journal of Statistics and Probability Vol 1, No 2, pp. 229–243 Received: September 19, 2012 Accepted: October 24, 2012 Online: October 29, 2012 http://www.ccsenet.org/journal/index.php/ijsp/article/view/20714 doi:10.5539/ijsp.v1n2p229 Abstract Bayesian inference about small areas is of considerable current interest, and simultaneous intervals for the parameters for the areas are needed because these parameters are correlated. This is not usually pursued because with many areas the problem becomes difficult. We describe a method for finding simultaneous credible intervals for a relatively large number of parameters, each corresponding to a single area. Our method is model based, it uses a hierarchical Bayesian model, and it starts with either the 100(1-alpha)% (e.g., alpha=0.05 for 95%) credible interval or highest posterior density (HPD) interval for each area. As in the construction of the HPD interval, our method is the result of the solution of two simultaneous equations, an equation that accounts for the probability content, 100(1-alpha)% of all the intervals combined, and an equation that contains an optimality condition like the “equal ordinates” condition in the HPD interval. We compare our method with one based on a nonparametric method, which as expected under a parametric model, does not perform as well as ours, but is a good competitor. We illustrate our method and compare it with the nonparametric method using an example on disease mapping which utilizes a standard Poisson regression model.
... more

Paper published: A Bayesian framework for stable isotope mixing models

A Bayesian framework for stable isotope mixing models Erik B. Erhardt and Edward J. Bedrick Environmental and Ecological Statistics Submitted 19 February 2011 Accepted 28 September 2012 Online 23 October 2012 http://www.springerlink.com/content/vg4v62j8717671p3/ DOI 10.1007/s10651-012-0224-1 Abstract: Stable isotope sourcing is used to estimate proportional contributions of sources to a mixture, such as in the analysis of animal diets and plant nutrient use. Statistical methods for inference on the diet proportions using stable isotopes have focused on the linear mixing model. Existing frequentist methods provide inferences when the diet proportion vector can be uniquely solved for in terms of the isotope ratios. Bayesian methods apply for arbitrary numbers of isotopes and diet sources but existing models are somewhat limited as they assume that trophic fractionation or discrimination is estimated without error or that isotope ratios are uncorrelated. We present a Bayesian model for the estimation of mean diet that accounts for uncertainty in source means and discrimination and allows correlated isotope ratios. This model is easily extended to allow the diet proportion vector to depend on covariates, such as time. Two data sets are used to illustrate the methodology. Code is available for selected analyses.
... more

Paper published: Data visualization in the neurosciences: overcoming the curse of dimensionality

Data visualization in the neurosciences: overcoming the curse of dimensionality
Elena A. Allen, Erik B. Erhardt, Vince D. Calhoun Neuron Accepted 7 May 2012 Online 24 May 2012 doi:10.1016/j.neuron.2012.05.001 www.cell.com/neuron/retrieve/pii/S089662731200428X Abstract: In publications, presentations, and popular media, scientific results are predominantly communicated through graphs. But are these figures clear and honest, or misleading? We examine current practices in data visualization and discuss improvements, advocating design choices which reveal data rather than hide it.
... more

Awarded: 2011-12 UNM Math & Stat Outstanding Undergraduate Instructor

I’m grateful to my students who voted for me as UNM Math & Stat Outstanding Undergraduate Instructor for 2011-12 (certificate).  I was tied for UNM Math & Stat Outstanding Graduate Instructor, as well.  I work hard to give my students the best experience, to give them time before and after class to ask questions, to respond promptly to their email, and to reach them where they are and pull them up or show them how to keep climbing. I keep adding to my Teaching Dossier, reflecting on my experience and accomplishments.
... more

Mega Millions $540M jackpot, or How to NOT LOSE at the lottery

I quickly prepared the fun slides below for a short interview with KOAT Channel 7 on the Mega Millions $540M jackpot (estimated at $640 at 3:30pm on the day of the drawing), since it is greatly surpassing the previous record of $390 million. A ticket’s probability of winning the jackpot is roughly the ratio of the length of one of your fingers to the diameter of the earth, so unchangeably near 0 (0.00000000569). It is interesting for the jackpot to be large enough that the expected value is a few times larger than the cost of a ticket, which makes it a sensible time to buy from an expected value point-of-view. In fact, now’s a good time to purchase EVERY ticket combination — hurry, and hope you don’t have to split it with another winner!











... more

Funded: UNM Math&Stat Travel award to WNAR 2012

The UNM Department of Mathematics and Statistics funded my travel request to go to 2012 WNAR – Graybill June 17-20, 2012 at Colorado State University – Fort Collins, Colorado. I intend to participate with a talk on my trophic-level modeling in the “Biostatistics and systems biology” session, meet and discuss research with those working in similar areas, and look for opportunities for cross-collaboration for students at these other western universities.
... more

Paper published: The 5.1 ka Aridization Event, Expansion of Piñon-Juniper Woodlands

The 5.1 ka Aridization Event, Expansion of Piñon-Juniper Woodlands, and the Introduction of Maize (Zea mays) in the American Southwest Brandon L. Drake, W. H. Wills, Erik B. Erhardt The Holocene Published online before print July 9, 2012, doi: 10.1177/0959683612449758 Accepted 2/13/2012 Lee Drake (UNM Anthropology) exemplifies excellence and I will make every opportunity to work with him again. Abstract Pollen analysis is frequently used to build climate and environmental histories. A distinct Holocene pollen series exists for Chaco Canyon, New Mexico. This study reports linear modeling and hypothesis testing of long distance dispersal pollen from radiocarbon-dated packrat middens which reveal strong relationships between piñon pine (Pinus edulis) and ponderosa pine (Pinus ponderosa). Ponderosa pollen dominates midden pollen assemblages during the early Holocene, while a rapid shift to a much higher proportion of piñon to ponderosa pine pollen between 5,440 and 5,100 BP points to an aridization episode. This shift is associated with higher δ18O values in Southwest speleothem records relative to the preceding millenium.  The period of aridization is followed by a sharp increase in El Niño/Southern Oscillation events that would have caused highly variable precipitation and lasted until 4,200 BP. Bayesian changepoint analysis suggests that this aridization episode led to stable ecotonal boundaries for at least 3,000 years. The piñon/ponderosa transition may have been caused by punctuated multi-year droughts, analogous to those in the 20th century. The earliest documented instance of Zea mays cultivation on the Colorado Plateau is around ca. 4,290 BP. The introduction of this laborintensive cultigen from Mesoamerica may have been facilitated by changes in the regional ecosystems, specifically by an increase in piñon trees, that promoted increasing human territoriality. Linear modeling and hypothesis testing can complement traditional palynological techniques by adding greater resolution in vegetation patterning to climate/environmental histories.
... more

Paper published: A morphometric analysis of Actaea racemosa L. (Ranunculaceae)

A morphometric analysis of Actaea racemosa L. (Ranunculaceae) Z. Gardner, L. Lueck, E.B. Erhardt, L.E. Craker Journal of Medicinally Active Plants scholarworks.umass.edu/jmap/vol1/iss2/3/ Abstract Actaea racemosa L. (syn. Cimicifuga racemosa [L.] Nutt.), Ranunculaceae, commonly known as black cohosh, is an herbaceous, perennial, medicinal plant native to the deciduous woodlands of eastern North America. Historical texts and current sales data indicate the continued popularity of this plant as an herbal remedy for over 175 years. Much of the present supply of A. racemosa is harvested from the wild. Diversity within and between populations of the species has not been well characterized. The purpose of this study was to assess the morphological variation of A. racemosa and identify patterns of variation at the population and species levels. A total of twentysix populations representative of a significant portion of the natural range of the species were surveyed and plant material was collected for the morphological analysis of 37 leaflet, flower, and whole plant characteristics. In total, 511 leaflet samples and 83 flower samples were examined. Several of the populations surveyed had sets of relatively unique characteristics (large leaflet measurements, tall leaves and flowers, and a large number of stamen) and Tukey-Kramer multiple comparisons revealed significant differences between specific populations for 20 different characteristics. However, no unique phenotype was found. Considerable morphological plasticity was noted in the apices of the staminodia. Cluster analyses showed that the morphological variation within populations is not smaller than between population and that this variation in not influenced by their geographic distribution.
... more

Funded: UNM RAC grant Erhardt/Hanson, Modeling (photo)respiration

We got one! Research Allocation Committee (RAC) Grants are for supporting new research or creative works. The RAC is particularly supportive of projects that may lead to outside funding and/or larger related projects. PIs: Erik Erhardt and David Hanson Title: “Frequentist (bootstrap) and Bayesian modeling of (photo)respiration in plants” Amount: $3982.63, RAC 12-04 Use: To hire statistics graduate student, Mohammad Hattab, to implement and develop modeling that I did last summer in Switzerland. Purpose: We are requesting $3982.63 to develop statistical models to estimate (photo)respiration in plants, accounting for sources of uncertainty and prior information. Because current models provide estimates without meaningful assessments of uncertainty, our model will have broad application in understanding photosynthetic pathways and carbon usage in plants, clarifying the precision of our knowledge, conditional on what is already believed. This modeling is an important step towards developing more comprehensive models of photosynthetic parameters. Support from the Resource Allocation Committee will allow us to: (1) develop frequentist (bootstrap) and Bayesian models to analyze existing experimental data, providing inferences on the set of parameters related in the model; (2) design experiments and acquire additional data to distinguish and estimate respiration and photorespiration under a set of scientifically relevant conditions; (3) conduct validations using pre-existing data and estimates; (4) publish our model with results; and (5) develop grant proposals to apply this model more broadly.
... more

Another look at New Mexico suicide statistics: conditional probability and data visualization

This article was printed in the Daily Lobo on 11/10/2011. Presenting information in a way that clearly answers interesting questions is challenging. Every plot has an implicit question (hypothesis) that it helps you answer. Therefore, it is important to align a visual display of information with the intended interesting question(s). Collaboration or consultation with a statistician can clarify interesting questions and lead to answers through appropriate data analysis (visit UNM’s free statistics consulting clinic, www.stat.unm.edu/~clinic). Suicide was the topic of the front cover story in the Daily Lobo on Thurs, Nov 3rd. With the story, two pie charts displayed average annual proportions of “successful” and “unsuccessful” suicides by method in NM. The “successful” pie chart answers this statement of conditional probability (their implied question): “given a successful suicide, what percentage used certain methods?” A question I consider more interesting reverses the conditioning (my question): “given an attempted suicide with a certain method, what percentage were successful?” Furthermore, I want to know the overall frequency and percentage of each method attempted. How can we present the information in a way that simultaneously answers these questions? The Suicide Prevention Resource Center (SPRC.org) maintains national and state suicide fact sheets, last updated September 2008, describing “deaths by suicide, estimated hospitalized attempts, and data on medical costs, work loss costs, gender, race/ethnicity, age, and method of suicide.” The pie charts in Thursday’s Daily Lobo were reproductions of those found on the NM fact sheet. From their NM summaries, below is the SPRC table for estimated mean frequencies by method for “successful” and “unsuccessful” suicides.
Method Successful Unsuccessful Total
Cut/Pierce 4 229 233
Firearms 191 16 207
Poisoning 60 1097 1157
Suffocation 73 23 96
Other/Unspecified 13 91 104
Total 341 1456 1797
Their question and pie charts (below) consider percentages down columns. When the data are reduced to row percentages for “successful” and “unsuccessful” attempts separately, you lose the relative frequency of attempts. The percentage of firearms “successes” (56%), for example, depends on all the other “successful” attempts. Because proportions for “successful” and “unsuccessful” attempts are separate, you can’t learn about how successful firearm attempts are. [caption id="" align="alignnone" width="549" caption="Original pie charts of proportions of method conditional on attempt "success", which doesn't ask/answer the interesting/relavant question."]Original pie chart[/caption] It is critical to consider the temporal process: a person first chooses a method, then makes an attempt, and is either “successful” or not. The data display and questions should follow these temporal steps. The pie chart displays ignore this process. My question and plot (below) considers the temporal process of attempting suicide, considering percentages across rows, including row total information. First, the relative use of various methods is clear; almost two-thirds of attempts are by poisoning, and firearm and cut/pierce are each just above one in ten. However, though attempts by firearms (12%) and cut/pierce (13%) are relatively rare, the “success” rates are extremely different (92% versus 2%)! The plot has been sorted by the numbers of “successes” to emphasize the relative risk of the methods in terms of lives, information which is lost in the pie charts. Also, the area of each box is relative to the frequency in each box. The Agora Crisis Center (505-277-3013, 9am-midnight, every day) plays a critical role in our community, and our education as individuals around these issues can save someone. Using statistics and visualization to tell and understand the important story in the data can lead to improvements in strategies and resource allocation for treatment and prevention. [caption id="" align="alignnone" width="368" caption="Improved visualization has relative use of methods across the horizontal and proportion of successes along the vertical. Area is proportional to people."]Improved visualization[/caption] R code follows to produce plot above (with modest post-production necessary).
# following example from http://learnr.wordpress.com/2009/03/29/ggplot2_marimekko_mosaic_chart/
setwd("F:\\Dropbox\\UNM\\seminar\\DailyLobo\\20111103_SuicideStatistics")

################################################################################
df <- data.frame(
          M = c(
            "Cut/Pierce",       # "C/P",
            "Firearms",         # "F",
            "Poisoning",        # "P",
            "Suffocation",      # "S",
            "Other/Unspecified")# "O/U")       # Method
        , S = c(4,191,60,73,13)     # Successful
        , U = c(229,16,1097,23,91)  # Unsuccessful
      )

df$T  <- df$S+df$U;                 # Total across method
#df$pS <- df$S/sum(df$S);            # prop S
#df$pU <- df$U/sum(df$U);            # prop U
df$Successful <- 100*round(df$S/df$T, digits = 2);     # prop S by M
df$Unsuccessful <- 100*round(df$U/df$T, digits = 2);   # prop S by M
df$pT  <- 100*round(df$T/sum(df$T), digits = 2);       # prop total
df <- df[order(-df$S),];  # sort so largest method is first

# proportions on x-axis
df$xmax <- cumsum(df$pT);
df$xmin <- df$xmax - df$pT;

#Data looks like this before the long-format conversion:
df

library(ggplot2)
dfm <- melt(df, id = c("M", "T", "pT", "S", "U", "xmin", "xmax"))
dfm

#Now we need to determine how the columns are stacked and where to position the text labels.

#Calculate ymin and ymax:
dfm1 <- ddply(dfm , .(M), transform, ymax = cumsum(value))
dfm1 <- ddply(dfm1, .(M), transform, ymin = ymax - value)
n <- dim(dfm1)[1];
dfm1$F <- as.vector(t(matrix(c(dfm1$S[seq(1,n-1,by=2)],dfm1$U[seq(2,n,by=2)]),ncol=2)))

# Positioning of text:
dfm1$xtext <- with(dfm1, xmin + (xmax - xmin)/2)
dfm1$ytext <- with(dfm1, ymin + (ymax - ymin)/2)

# Finally, we are ready to start the plotting process:
p <- ggplot(dfm1, aes(ymin = ymin, ymax = ymax, xmin = xmin, xmax = xmax, fill = variable))

# Use grey border to distinguish between the segments:
p1 <- p + geom_rect(colour = I("grey"))

# The explanation of different fill colours will be included in the text label of Segment A using the ifelse function.
p2 <- p1 + geom_text(aes(x = xtext, y = ytext,
      label = ifelse(M == df$M[1],
                     paste(variable, "\n", value, "%\n", F, sep = ""),
                     paste(value, "%\n", F, sep = "")))
      , size = 3.5)

# The maximum y-axes value is 100 (as in 100%), and to add the segment description above each column I manually specify the text position.
p3 <- p2 + geom_text(aes(x = xtext, y = 107, label = paste(M,"\n",pT,"%\n",T,sep="")), size = 4)
#p3

# Some last-minute changes to the default formatting: remove axis labels, legend and gridlines.
p4 <- p3 + theme_bw() + labs(x = "Percent Attempts by Method", y = "Percent Success by Method",
     fill = NULL) + opts(legend.position = "none",
     panel.grid.major = theme_line(colour = NA),
     panel.grid.minor = theme_line(colour = NA))
p4

pdf("fromR.pdf")
p4
dev.off()

... more