Another look at New Mexico suicide statistics: conditional probability and data visualization

This article was printed in the Daily Lobo on 11/10/2011.

Presenting information in a way that clearly answers interesting questions is challenging. Every plot has an implicit question (hypothesis) that it helps you answer. Therefore, it is important to align a visual display of information with the intended interesting question(s). Collaboration or consultation with a statistician can clarify interesting questions and lead to answers through appropriate data analysis (visit UNM’s free statistics consulting clinic, www.stat.unm.edu/~clinic).

Suicide was the topic of the front cover story in the Daily Lobo on Thurs, Nov 3rd. With the story, two pie charts displayed average annual proportions of “successful” and “unsuccessful” suicides by method in NM. The “successful” pie chart answers this statement of conditional probability (their implied question): “given a successful suicide, what percentage used certain methods?” A question I consider more interesting reverses the conditioning (my question): “given an attempted suicide with a certain method, what percentage were successful?” Furthermore, I want to know the overall frequency and percentage of each method attempted. How can we present the information in a way that simultaneously answers these questions?

The Suicide Prevention Resource Center (SPRC.org) maintains national and state suicide fact sheets, last updated September 2008, describing “deaths by suicide, estimated hospitalized attempts, and data on medical costs, work loss costs, gender, race/ethnicity, age, and method of suicide.” The pie charts in Thursday’s Daily Lobo were reproductions of those found on the NM fact sheet. From their NM summaries, below is the SPRC table for estimated mean frequencies by method for “successful” and “unsuccessful” suicides.

Method Successful Unsuccessful Total
Cut/Pierce 4 229 233
Firearms 191 16 207
Poisoning 60 1097 1157
Suffocation 73 23 96
Other/Unspecified 13 91 104
Total 341 1456 1797

Their question and pie charts (below) consider percentages down columns. When the data are reduced to row percentages for “successful” and “unsuccessful” attempts separately, you lose the relative frequency of attempts. The percentage of firearms “successes” (56%), for example, depends on all the other “successful” attempts. Because proportions for “successful” and “unsuccessful” attempts are separate, you can’t learn about how successful firearm attempts are.

Original pie chart
Original pie charts of proportions of method conditional on attempt "success", which doesn't ask/answer the interesting/relavant question.

It is critical to consider the temporal process: a person first chooses a method, then makes an attempt, and is either “successful” or not. The data display and questions should follow these temporal steps. The pie chart displays ignore this process.

My question and plot (below) considers the temporal process of attempting suicide, considering percentages across rows, including row total information. First, the relative use of various methods is clear; almost two-thirds of attempts are by poisoning, and firearm and cut/pierce are each just above one in ten. However, though attempts by firearms (12%) and cut/pierce (13%) are relatively rare, the “success” rates are extremely different (92% versus 2%)! The plot has been sorted by the numbers of “successes” to emphasize the relative risk of the methods in terms of lives, information which is lost in the pie charts. Also, the area of each box is relative to the frequency in each box. The Agora Crisis Center (505-277-3013, 9am-midnight, every day) plays a critical role in our community, and our education as individuals around these issues can save someone. Using statistics and visualization to tell and understand the important story in the data can lead to improvements in strategies and resource allocation for treatment and prevention.

Improved visualization
Improved visualization has relative use of methods across the horizontal and proportion of successes along the vertical. Area is proportional to people.

R code follows to produce plot above (with modest post-production necessary).

# following example from http://learnr.wordpress.com/2009/03/29/ggplot2_marimekko_mosaic_chart/
setwd("F:\\Dropbox\\UNM\\seminar\\DailyLobo\\20111103_SuicideStatistics")

################################################################################
df <- data.frame(
          M = c(
            "Cut/Pierce",       # "C/P",
            "Firearms",         # "F",
            "Poisoning",        # "P",
            "Suffocation",      # "S",
            "Other/Unspecified")# "O/U")       # Method
        , S = c(4,191,60,73,13)     # Successful
        , U = c(229,16,1097,23,91)  # Unsuccessful
      )

df$T  <- df$S+df$U;                 # Total across method
#df$pS <- df$S/sum(df$S);            # prop S
#df$pU <- df$U/sum(df$U);            # prop U
df$Successful <- 100*round(df$S/df$T, digits = 2);     # prop S by M
df$Unsuccessful <- 100*round(df$U/df$T, digits = 2);   # prop S by M
df$pT  <- 100*round(df$T/sum(df$T), digits = 2);       # prop total
df <- df[order(-df$S),];  # sort so largest method is first

# proportions on x-axis
df$xmax <- cumsum(df$pT);
df$xmin <- df$xmax - df$pT;

#Data looks like this before the long-format conversion:
df

library(ggplot2)
dfm <- melt(df, id = c("M", "T", "pT", "S", "U", "xmin", "xmax"))
dfm

#Now we need to determine how the columns are stacked and where to position the text labels.

#Calculate ymin and ymax:
dfm1 <- ddply(dfm , .(M), transform, ymax = cumsum(value))
dfm1 <- ddply(dfm1, .(M), transform, ymin = ymax - value)
n <- dim(dfm1)[1];
dfm1$F <- as.vector(t(matrix(c(dfm1$S[seq(1,n-1,by=2)],dfm1$U[seq(2,n,by=2)]),ncol=2)))

# Positioning of text:
dfm1$xtext <- with(dfm1, xmin + (xmax - xmin)/2)
dfm1$ytext <- with(dfm1, ymin + (ymax - ymin)/2)

# Finally, we are ready to start the plotting process:
p <- ggplot(dfm1, aes(ymin = ymin, ymax = ymax, xmin = xmin, xmax = xmax, fill = variable))

# Use grey border to distinguish between the segments:
p1 <- p + geom_rect(colour = I("grey"))

# The explanation of different fill colours will be included in the text label of Segment A using the ifelse function.
p2 <- p1 + geom_text(aes(x = xtext, y = ytext,
      label = ifelse(M == df$M[1],
                     paste(variable, "\n", value, "%\n", F, sep = ""),
                     paste(value, "%\n", F, sep = "")))
      , size = 3.5)

# The maximum y-axes value is 100 (as in 100%), and to add the segment description above each column I manually specify the text position.
p3 <- p2 + geom_text(aes(x = xtext, y = 107, label = paste(M,"\n",pT,"%\n",T,sep="")), size = 4)
#p3

# Some last-minute changes to the default formatting: remove axis labels, legend and gridlines.
p4 <- p3 + theme_bw() + labs(x = "Percent Attempts by Method", y = "Percent Success by Method",
     fill = NULL) + opts(legend.position = "none",
     panel.grid.major = theme_line(colour = NA),
     panel.grid.minor = theme_line(colour = NA))
p4

pdf("fromR.pdf")
p4
dev.off()

Leave a Reply