# 21 Logistic Regression

You are reading the work-in-progress online edition of “Statistical Acumen: Advanced Data Analysis.”

This chapter should be readable but is currently undergoing final polishing.

Logistic regression analysis is used for predicting the outcome of a categorical dependent variable based on one or more predictor variables. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Logistic regression is frequently used to refer to the problem in which the dependent variable is binary — that is, the number of available categories is two — and problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression.

Logistic regression measures the relationship between a categorical dependent variable and usually (but not necessarily) one or more continuous independent variables, by converting the dependent variable to probability scores. As such it treats the same set of problems as does probit regression using similar techniques.

## 21.1 Generalized linear model variance and link families

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In R, the basic tool for fitting generalized linear models is the `glm()`

function, which has the following general structure:

where “`\...`

” stands for additional options. The key parameter here is `family`

, which is a simple way of specifying a choice of variance and link functions. Some choices of family are listed in the table. As can be seen, each of the first five choices has an associated variance function (for binomial the binomial variance \(\mu(1-\mu))\), and one or more choices of link functions (for binomial the logit, probit, or complementary log-log).

Family | Variance | Link |
---|---|---|

gaussian | gaussian | identity |

binomial | binomial | logit, probit, or cloglog |

poisson | poisson | log, identity, or sqrt |

Gamma | Gamma | inverse, identity, or log |

inverse.gaussian | inverse.gaussian | \(1/\mu^2\) |

quasi | user-defined | user-defined |

As long as you want the default link, all you have to specify is the family name. If you want an alternative link, you must add a link argument. For example to do probits you use:

The last family on the list, quasi, is there to allow fitting user-defined models by maximum quasi-likelihood.

The rest of this chapter concerns logistic regression with a binary response variable.

## 21.2 Example: Age of Menarche in Warsaw

The data^{1} below are from a study conducted by Milicer and Szczotka on pre-teen and teenage girls in Warsaw, Poland in 1965. The subjects were classified into 25 age categories. The number of girls in each group (Total) and the number that reached menarche (Menarche) at the time of the study were recorded. The age for a group corresponds to the midpoint for the age interval.

^{1} Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in 1965. Human Biology 38, 199–203.

## Code

```
#### Example: Menarche
# menarche dataset is available in MASS package
#library(MASS)
dat_menarche <-
MASS::menarche |>
mutate(
# these frequencies look better in the table as integers
Total = Total |> as.integer()
, Menarche = Menarche |> as.integer()
# create estimated proportion of girls reaching menarche for each age group
, p_hat = Menarche / Total
)
str(dat_menarche)
```

```
'data.frame': 25 obs. of 4 variables:
$ Age : num 9.21 10.21 10.58 10.83 11.08 ...
$ Total : int 376 200 93 120 90 88 105 111 100 93 ...
$ Menarche: int 0 0 0 2 2 5 10 17 16 29 ...
$ p_hat : num 0 0 0 0.0167 0.0222 ...
```

The researchers were curious about how the proportion of girls that reached menarche (\(\hat{p}=\textrm{Menarche}/\textrm{Total}\)) varied with age. One could perform a test of homogeneity (Multinomial goodness-of-fit test) by arranging the data as a 2-by-25 contingency table with columns indexed by age and two rows: ROW1 \(=\) Menarche, and ROW2 \(=\) number that have not reached menarche \(=\) (Total \(-\) Menarche). A more powerful approach treats these as regression data, using the proportion of girls reaching menarche as the response and age as a predictor.

A plot of the observed proportion \(\hat{p}\) of girls that have reached menarche shows that the proportion increases as age increases, but that the relationship is nonlinear. The observed proportions, which are bounded between zero and one, have a lazy \(S\)-shape (a **sigmoidal function**) when plotted against age. The change in the observed proportions for a given change in age is much smaller when the proportion is near 0 or 1 than when the proportion is near \(1/2\). This phenomenon is common with regression data where the response is a proportion.

The trend is nonlinear so linear regression is inappropriate. A sensible alternative might be to transform the response or the predictor to achieve near linearity. A common transformation of response proportions following a sigmoidal curve is to the **logit scale** \(\hat{\mu} = \log_e\{\hat{p}/(1-\hat{p})\}\). This transformation is the basis for the **logistic regression model**. The natural logarithm (base \(e\)) is traditionally used in logistic regression.

The logit transformation is undefined when \(\hat{p}=0\) or \(\hat{p}=1\). To overcome this problem, researchers use the **empirical logits**, defined by \(\log \{ (\hat{p}+0.5/n) / (1-\hat{p}+0.5/n) \}\), where \(n\) is the sample size or the number of observations on which \(\hat{p}\) is based.

A plot of the empirical logits against age is roughly linear, which supports a logistic transformation for the response.

## Code

## Code

## 21.3 Simple logistic regression model

The simple logistic regression model expresses the population proportion \(p\) of individuals with a given attribute (called the probability of success) as a function of a single predictor variable \(X\). The model assumes that \(p\) is related to \(X\) through \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X\] or, equivalently, as \[p = \frac{ \exp( \beta_0 + \beta_1 X ) }{ 1 + \exp( \beta_0 + \beta_1 X ) }.\] The logistic regression model is a **binary response model**, where the response for each case falls into one of two exclusive and exhaustive categories, success (cases with the attribute of interest) and failure (cases without the attribute of interest).

The odds of success are \(p/(1-p)\). For example, when \(p=1/2\) the odds of success are 1 (or 1 to 1). When \(p=0.9\) the odds of success are 9 (or 9 to 1). The logistic model assumes that the log-odds of success is linearly related to \(X\). Graphs of the logistic model relating \(p\) to \(X\) are given below. The sign of the slope refers to the sign of \(\beta_1\).

I should write \(p=p(X)\) to emphasize that \(p\) is the proportion of all individuals with score \(X\) that have the attribute of interest. In the menarche data, \(p=p(X)\) is the population proportion of girls at age \(X\) that have reached menarche.

The data in a logistic regression problem are often given in summarized or **aggregate** form:

\(X\) | \(n\) | \(y\) |
---|---|---|

\(X_1\) | \(n_1\) | \(y_1\) |

\(X_2\) | \(n_2\) | \(y_2\) |

\(\vdots\) | \(\vdots\) | \(\vdots\) |

\(X_m\) | \(n_m\) | \(y_m\) |

where \(y_i\) is the number of individuals with the attribute of interest among \(n_i\) randomly selected or representative individuals with predictor variable value \(X_i\). For **raw data** on individual cases, \(y_i=1\) or 0, depending on whether the case at \(X_i\) is a success or failure, and the sample size column \(n\) is omitted with raw data.

For logistic regression, a plot of the sample proportions \(\hat{p}_i=y_i/n_i\) against \(X_i\) should be roughly sigmoidal, and a plot of the empirical logits against \(X_i\) should be roughly linear. If not, then some other model is probably appropriate. I find the second plot easier to calibrate, but neither plot is very informative when the sample sizes are small, say 1 or 2. (Why?).

There are a variety of other binary response models that are used in practice. The **probit** regression model or the **complementary log-log** regression model might be appropriate when the logistic model does fit the data.

The following section describes the standard MLE strategy for estimating the logistic regression parameters.

### 21.3.1 Estimating Regression Parameters via LS of empirical logits

*(This is a naive method; we will discuss a MUCH better way in the next section. In practice, never do this! This is only for illustration.)*

There are two unknown population parameters in the logistic regression model \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X.\] A simple way to estimate \(\beta_0\) and \(\beta_1\) is by least squares (LS), using the empirical logits as responses and the \(X_i\)s as the predictor values.

Below we use standard regression to calculate the LS fit between the empirical logits and age.

## Code

```
(Intercept) Age
-22.027933 1.676395
```

The LS estimates for the menarche data are \(b_0=-22.03\) and \(b_1=1.68\), which gives the fitted relationship \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = -22.03 + 1.68 \textrm{ Age}\] or \[\tilde{p} = \frac{ \exp( -22.03 + 1.68 \textrm{ Age}) }{ 1 + \exp(-22.03 + 1.68 \textrm{ Age}) },\] where \(\tilde{p}\) is the predicted proportion (under the model) of girls having reached menarche at the given age. I used \(\tilde{p}\) to identify a predicted probability, in contrast to \(\hat{p}\) which is the observed proportion at a given age.

The power of the logistic model versus the contingency table analysis discussed earlier is that the model gives estimates for the population proportion reaching menarche at all ages within the observed age range. The observed proportions allow you to estimate only the population proportions at the observed ages.

### 21.3.2 Maximum Likelihood Estimation for Logistic Regression Model

There are better ways to the fit the logistic regression model than LS which assumes that the responses are normally distributed with constant variance. A deficiency of the LS fit to the logistic model is that the observed counts \(y_i\) have a **Binomial distribution** under random sampling. The Binomial distribution is a discrete probability model associated with counting the number of successes in a fixed size sample, and other equivalent experiments such as counting the number of heads in repeated flips of a coin. The distribution of the empirical logits depend on the \(y_i\)s so they are not normal (but are approximately normal in large samples), and are extremely skewed when the sample sizes \(n_i\) are small. The response variability depends on the population proportions, and is not roughly constant when the observed proportions or the sample sizes vary appreciably across groups.

The differences in variability among the empirical logits can be accounted for using **weighted least squares** (WLS) when the sample sizes are large. An alternative approach called maximum likelihood uses the exact Binomial distribution of the responses \(y_i\) to generate optimal estimates of the regression coefficients. Software for maximum likelihood estimation is widely available, so LS and WLS methods are not really needed.

In **maximum likelihood estimation** (MLE), the regression coefficients are estimated iteratively by minimizing the **deviance** function (also called the likelihood ratio chi-squared statistic) \[D = 2 \sum_{i=1}^m \left\{ y_i \log \left( \frac{ y_i }{n_i p_i} \right)
+ (n_i-y_i) \log \left( \frac{ n_i-y_i }{n_i - n_i p_i} \right) \right\}\] over all possible values of \(\beta_0\) and \(\beta_1\), where the \(p_i\)s satisfy the logistic model \[\log \left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 X_i.\] The ML method also gives standard errors and significance tests for the regression estimates.

The deviance is an analog of the residual sums of squares in linear regression. The choices for \(\beta_0\) and \(\beta_1\) that minimize the deviance are the parameter values that make the observed and fitted proportions as close together as possible in a “likelihood sense”.

Suppose that \(b_0\) and \(b_1\) are the MLEs of \(\beta_0\) and \(\beta_1\). The deviance evaluated at the MLEs, \[D = 2 \sum_{i=1}^m \left\{ y_i \log \left( \frac{ y_i }{n_i \tilde{p}_i} \right) + (n_i-y_i) \log \left( \frac{ n_i-y_i }{n_i - n_i \tilde{p}_i} \right) \right\},\] where the fitted probabilities \(\tilde{p}_i\) satisfy \[\log \left( \frac{\tilde{p}_i}{1-\tilde{p}_i} \right) = b_0 + b_1 X_i,\] is used to test the adequacy of the model. The deviance is small when the data fits the model, that is, when the observed and fitted proportions are close together. Large values of \(D\) occur when one or more of the observed and fitted proportions are far apart, which suggests that the model is inappropriate.

If the logistic model holds, then **\(D\) has a chi-squared distribution with \(m - r\) degrees of freedom**, where \(m\) is the the number of groups and \(r\) (here 2) is the number of estimated regression parameters. A p-value for the deviance is given by the area under the chi-squared curve to the right of \(D\). A small p-value indicates that the data does not fit the model.

Alternatively, the fit of the model can be evaluated using the chi-squared approximation to the Pearson \(X^2\) statistic: \[X^2 = \sum_{i=1}^m \left\{ \frac{( y_i - n_i \tilde{p}_i )^2}{n_i \tilde{p}_i} + \frac{ ( (n_i-y_i) - n_i(1-\tilde{p}_i) )^2 }{ n_i(1-\tilde{p}_i)} \right\} = \sum_{i=1}^m \frac{ ( y_i - n_i \tilde{p}_i )^2}{n_i \tilde{p}_i (1- \tilde{p}_i)}.\]

### 21.3.3 Fitting the Logistic Model by Maximum Likelihood, Menarche

## Code

```
# For our summarized data (with frequencies and totals for each age)
# The left-hand side of our formula binds two columns together with cbind():
# the columns are the number of "successes" and "failures".
# For logistic regression with logit link we specify family = binomial,
# where logit is the default link function for the binomial family.
glm_m_a <-
glm(
cbind(Menarche, Total - Menarche) ~ Age
, family = binomial
, data = dat_menarche
)
```

The `glm()`

statement creates an object which we can use to create the fitted probabilities and 95% CIs for the population proportions at the ages in **menarche**. The fitted probabilities and the limits are stored in columns labeled `fitted.values`

, `fit.lower`

, and `fit.upper`

, respectively.

## Code

```
# predict() uses all the Load values in dataset, including appended values
fit_logit_pred <-
predict(
glm_m_a
, type = "link"
, se.fit = TRUE
) |>
as_tibble()
# put the fitted values in the data.frame
dat_menarche <-
dat_menarche |>
mutate(
fit_logit = fit_logit_pred$fit
, fit_logit_se = fit_logit_pred$se.fit
# added "fit_p" to make predictions at appended Load values
, fit_p = exp(fit_logit) / (1 + exp(fit_logit))
# CI for p fitted values
, fit_p_lower = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
, fit_p_upper = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
)
#dat_menarche |> round(3) |> head() |> kable()
```

This printed summary information is easily interpreted. For example, the estimated population proportion of girls aged 15.08 (more precisely, among girls in the age interval with midpoint 15.08) that have reached menarche is 0.967. You are 95% confident that the population proportion is between 0.958 and 0.975. A variety of other summaries and diagnostics can be produced.

% latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:43 2023 % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:43 2023The summary table gives MLEs and standard errors for the regression parameters. The z-value column is the parameter estimate divided by its standard error. The p-values are used to test whether the corresponding parameters of the logistic model are zero.

```
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
data = dat_menarche)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
```

If the model is correct and when sample sizes are large, the residual deviance \(D\) has an approximate chi-square distribution, \[\begin{aligned} \textrm{residual } D & = & \chi^2_{\textrm{residual df}} . \nonumber \end{aligned}\] If \(D\) is too large, or the \(p\)-value is too small, then the model does not capture all the features in the data.

The deviance statistic is \(D=26.70\) on \(25-2=23\) df. The large p-value for \(D\) suggests no gross deficiencies with the logistic model. The observed and fitted proportions (`p_hat`

and `fitted.values`

in the output table above are reasonably close at each observed age. Also, `emp_logit`

and `fit`

are close. This is consistent with \(D\) being fairly small. The data fits the logistic regression model reasonably well.

`[1] 26.70345`

`[1] 23`

`[1] 0.2687953`

The MLEs \(b_0=-21.23\) and \(b_1=1.63\) for the intercept and slope are close to the LS estimates of \(b_{\textrm{LS}0}=-22.03\) and \(b_{\textrm{LS}1}=1.68\), respectively from page . The two estimation methods give similar predicted probabilities here. The MLE of the predicted probabilities satisfy \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = -21.23 + 1.63 \textrm{ Age}\] or \[\tilde{p} = \frac{ \exp( -21.23 + 1.63 \textrm{ Age}) }{ 1 + \exp(-21.23 + 1.63 \textrm{ Age}) }.\]

## Code

```
library(ggplot2)
p <- ggplot(dat_menarche, aes(x = Age, y = p_hat))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = Age, ymin = fit_p_lower, ymax = fit_p_upper), alpha = 0.2)
p <- p + geom_line(aes(x = Age, y = fit_p), colour = "red")
# fitted values
p <- p + geom_point(aes(y = fit_p), size = 2, colour = "red")
# observed values
p <- p + geom_point(size = 2)
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + labs(title = paste("Observed and predicted probability of girls reaching menarche,\n",
"Warsaw, Poland in 1965", sep="")
, y = "Probability"
)
print(p)
```

If the model holds, then a slope of \(\beta_1=0\) implies that \(p\) does not depend on AGE, i.e., the proportion of girls that have reached menarche is identical across age groups. The Wald p-value for the slope is \(<0.0001\), which leads to rejecting \(H_0:\beta_1=0\) at any of the usual test levels. The proportion of girls that have reached menarche is not constant across age groups. Again, the power of the model is that it gives you a simple way to quantify the effect of age on the proportion reaching menarche. This is more appealing than testing homogeneity across age groups followed by multiple comparisons.

Wald tests can be performed to test the global null hypothesis, that all non-intercept \(\beta\)s are equal to zero. This is the logistic regression analog of the overall model F-test in ANOVA and regression. The only predictor is AGE, so the implied test is that the slope of the regression line is zero. The Wald test statistic and p-value reported here are identical to the Wald test and p-value for the AGE effect given in the parameter estimates table. The Wald test can also be used to test specific contrasts between parameters.

## Code

```
# Testing Global Null Hypothesis
# Typically, we are interested in testing whether individual parameters or
# set of parameters are all simultaneously equal to 0s
# However, any null hypothesis values can be included in the vector coef_test_values.
coef_test_values <-
rep(0, length(coef(glm_m_a)))
library(aod) # for wald.test()
test_wald <-
wald.test(
b = coef(glm_m_a) - coef_test_values
, Sigma = vcov(glm_m_a)
, Terms = c(2)
)
test_wald
```

```
Wald test:
----------
Chi-squared test:
X2 = 766.3, df = 1, P(> X2) = 0.0
```

## 21.4 Example: Leukemia white blood cell types

Feigl and Zelen^{2} reported the survival time in weeks and the white cell blood count (WBC) at time of diagnosis for 33 patients who eventually died of acute leukemia. Each person was classified as AG\(+\) or AG\(-\), indicating the presence or absence of a certain morphological characteristic in the white cells. Four variables are given in the data set: WBC, a binary factor or **indicator variable** AG (1 for AG\(+\), 0 for AG\(-\)), NTOTAL (the number of patients with the given combination of AG and WBC), and NRES (the number of NTOTAL that survived at least one year from the time of diagnosis).

^{2} Feigl, P., Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics 21, 826–838. Survival times are given for 33 patients who died from acute myelogenous leukaemia. Also measured was the patient’s white blood cell count at the time of diagnosis. The patients were also factored into 2 groups according to the presence or absence of a morphologic characteristic of white blood cells. Patients termed AG positive were identified by the presence of Auer rods and/or significant granulation of the leukaemic cells in the bone marrow at the time of diagnosis.

The researchers are interested in modelling the probability \(p\) of surviving at least one year as a function of WBC and AG. They believe that WBC should be transformed to a log scale, given the skewness in the WBC values.

## Code

```
#### Example: Leukemia
## Leukemia white blood cell types example
# ntotal = number of patients with IAG and WBC combination
# nres = number surviving at least one year
# ag = 1 for AG+, 0 for AG-
# wbc = white cell blood count
# lwbc = log white cell blood count
# p_hat = Emperical Probability
dat_leuk <-
read_table2("http://statacumen.com/teach/ADA2/notes/ADA2_notes_Ch11_leuk.dat") |>
mutate(
ag = factor(ag)
, lwbc = log(wbc)
, p_hat = nres / ntotal
)
```

```
Warning: `read_table2()` was deprecated in readr 2.0.0.
ℹ Please use `read_table()` instead.
```

```
tibble [30 × 6] (S3: tbl_df/tbl/data.frame)
$ ntotal: num [1:30] 1 1 1 1 1 1 1 1 3 1 ...
$ nres : num [1:30] 1 1 1 1 1 1 1 1 1 1 ...
$ ag : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
$ wbc : num [1:30] 75 230 260 430 700 940 1000 1050 10000 300 ...
$ lwbc : num [1:30] 4.32 5.44 5.56 6.06 6.55 ...
$ p_hat : num [1:30] 1 1 1 1 1 ...
```

As an initial step in the analysis, consider the following model: \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \textrm{ LWBC} + \beta_2 \textrm{ AG},\] where \(\textrm{LWBC}=\log(\textrm{WBC})\). The model is best understood by separating the AG\(+\) and AG\(-\) cases. For AG\(-\) individuals, AG=0 so the model reduces to \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \textrm{ LWBC} + \beta_2*0 = \beta_0 + \beta_1 \textrm{ LWBC}.\] For AG\(+\) individuals, AG=1 and the model implies \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \textrm{ LWBC} + \beta_2*1 = (\beta_0 + \beta_2) + \beta_1 \textrm{ LWBC}.\]

The model without AG (i.e., \(\beta_2=0\)) is a simple logistic model where the log-odds of surviving one year is linearly related to LWBC, and is independent of AG. The reduced model with \(\beta_2=0\) implies that there is no effect of the AG level on the survival probability once LWBC has been taken into account.

Including the **binary predictor** AG in the model implies that there is a linear relationship between the log-odds of surviving one year and \(\textrm{LWBC}\), with a constant slope for the two AG levels. This model includes an effect for the AG morphological factor, but more general models are possible. A natural extension would be to include a product or interaction effect, a point that I will return to momentarily.

The parameters are easily interpreted: \(\beta_0\) and \(\beta_0+\beta_2\) are intercepts for the population logistic regression lines for AG\(-\) and AG\(+\), respectively. The lines have a common slope, \(\beta_1\). The \(\beta_2\) coefficient for the AG indicator is the difference between intercepts for the AG\(+\) and AG\(-\) regression lines. A picture of the assumed relationship is given below for \(\beta_1 < 0\). The population regression lines are parallel on the logit scale only, but the order between AG groups is preserved on the probability scale.

Before looking at output for the equal slopes model, note that the data set has 30 distinct AG and LWBC combinations, or 30 “groups” or samples. Only two samples have more than 1 observation. The majority of the observed proportions surviving at least one year (number surviving \(\geq 1\) year/group sample size) are 0 (i.e., 0/1) or 1 (i.e., 1/1). This sparseness of the data makes it difficult to graphically assess the suitability of the logistic model (because the estimated proportions are almost all 0 or 1). Although significance tests on the regression coefficients do not require large group sizes, the chi-squared approximation to the deviance statistic is suspect in sparse data settings. With small group sizes as we have here, most researchers would not interpret the p-value for \(D\) literally. Instead, they would use the p-values to informally check the fit of the model. Diagnostics would be used to highlight problems with the model.

## Code

`[1] 0.6842804`

The large p-value for \(D\) indicates that there are no gross deficiencies with the model. Recall that the Testing Global Null Hypothesis gives p-values for testing the hypothesis that the regression coefficients are zero for each predictor in the model. The two predictors are LWBC and AG, so the small p-values indicate that LWBC or AG, or both, are important predictors of survival. The p-values in the estimates table suggest that LWBC and AG are both important. If either predictor was insignificant, I would consider refitting the model omitting the least significant effect, as in regression.

```
(Intercept) ag1 lwbc
5.543349 2.519562 -1.108759
```

## Code

```
Warning in coef(glm_i_l) - coef_test_values: longer object length is not a
multiple of shorter object length
```

```
Wald test:
----------
Chi-squared test:
X2 = 8.2, df = 2, P(> X2) = 0.017
```

Given that the model fits reasonably well, a test of \(H_0:\beta_2=0\) might be a primary interest here. This checks whether the regression lines are identical for the two AG levels, which is a test for whether AG affects the survival probability, after taking LWBC into account. This test is rejected at any of the usual significance levels, suggesting that the AG level affects the survival probability (assuming a very specific model).

```
Call:
glm(formula = cbind(nres, ntotal - nres) ~ ag + lwbc, family = binomial,
data = dat_leuk)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.5433 3.0224 1.834 0.0666 .
ag1 2.5196 1.0907 2.310 0.0209 *
lwbc -1.1088 0.4609 -2.405 0.0162 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.191 on 29 degrees of freedom
Residual deviance: 23.014 on 27 degrees of freedom
AIC: 30.635
Number of Fisher Scoring iterations: 5
```

A plot of the predicted survival probabilities as a function of LWBC, using AG as the plotting symbol, indicates that the probability of surviving at least one year from the time of diagnosis is a decreasing function of LWBC. For a given LWBC the survival probability is greater for AG\(+\) patients than for AG\(-\) patients. This tendency is consistent with the observed proportions, which show little information about the exact form of the trend.

## Code

```
# predict() uses all the Load values in dataset, including appended values
fit_logit_pred <-
predict(
glm_i_l
, type = "link"
, se.fit = TRUE
) |>
as_tibble()
# put the fitted values in the data.frame
dat_leuk <-
dat_leuk |>
mutate(
fit_logit = fit_logit_pred$fit
, fit_logit_se = fit_logit_pred$se.fit
# added "fit_p" to make predictions at appended Load values
, fit_p = exp(fit_logit) / (1 + exp(fit_logit))
# CI for p fitted values
, fit_p_lower = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
, fit_p_upper = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
)
library(ggplot2)
p <- ggplot(dat_leuk, aes(x = lwbc, y = p_hat, colour = ag, fill = ag))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = lwbc, ymin = fit_p_lower, ymax = fit_p_upper), alpha = 0.2)
p <- p + geom_line(aes(x = lwbc, y = fit_p))
# fitted values
p <- p + geom_point(aes(y = fit_p), size=2)
# observed values
p <- p + geom_point(size = 2, alpha = 0.5)
p <- p + labs(title = "Observed and predicted probability of 1+ year survival"
, y = "Probability"
)
print(p)
```

The estimated survival probabilities satisfy \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 5.54 - 1.11\textrm{ LWBC} + 2.52 \textrm{ AG}.\] For AG\(-\) individuals with AG=0, this reduces to \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 5.54 - 1.11\textrm{ LWBC},\] or equivalently, \[\tilde{p} = \frac{ \exp(5.54 - 1.11 \textrm{ LWBC} ) } { 1 + \exp(5.54 - 1.11 \textrm{ LWBC} ) }.\] For AG\(+\) individuals with AG=1, \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 5.54 - 1.11\textrm{ LWBC} + 2.52(1) = 8.06 - 1.11\textrm{ LWBC},\] or \[\tilde{p} = \frac{ \exp(8.06 - 1.11 \textrm{ LWBC} ) } { 1 + \exp(8.06 - 1.11 \textrm{ LWBC} ) }.\]

Using the **logit scale**, the difference between AG\(+\) and AG\(-\) individuals in the estimated log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is the estimated AG regression coefficient \[( 8.06 - 1.11 \textrm{ LWBC} ) - ( 5.54 - 1.11 \textrm{ LWBC} ) = 2.52.\] Using properties of exponential functions, the odds that an AG\(+\) patient lives at least one year is \(\exp(2.52)=12.42\) times larger than the odds that an AG\(-\) patient lives at least one year, regardless of LWBC.

This summary, and a CI for the AG odds ratio, is given in the **Odds Ratio** table. Similarly, the estimated odds ratio of 0.33 for LWBC implies that the odds of surviving at least one year is reduced by a factor of 3 for each unit increase of LWBC.

We can use the `confint()`

function to obtain confidence intervals for the coefficient estimates. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.

`Waiting for profiling to be done...`

```
2.5 % 97.5 %
(Intercept) 0.1596372 12.4524409
ag1 0.5993391 5.0149271
lwbc -2.2072275 -0.3319512
```

We can also get CIs based on just the standard errors by using the default method.

```
2.5 % 97.5 %
(Intercept) -0.3804137 11.467112
ag1 0.3818885 4.657236
lwbc -2.0121879 -0.205330
```

You can also exponentiate the coefficients and confidence interval bounds and interpret them as odds-ratios.

`Waiting for profiling to be done...`

```
OR 2.5 % 97.5 %
(Intercept) 5.543349 0.1596372 12.4524409
ag1 2.519562 0.5993391 5.0149271
lwbc -1.108759 -2.2072275 -0.3319512
```

`Waiting for profiling to be done...`

```
OR 2.5 % 97.5 %
(Intercept) 255.5323676 1.1730851 2.558741e+05
ag1 12.4231582 1.8209149 1.506452e+02
lwbc 0.3299682 0.1100052 7.175224e-01
```

Although the equal slopes model appears to fit well, a more general model might fit better. A natural generalization here would be to add an **interaction**, or product term, \(\textrm{AG}\times\textrm{LWBC}\) to the model. The logistic model with an AG effect and the \(\textrm{AG}\times\textrm{LWBC}\) interaction is equivalent to fitting separate logistic regression lines to the two AG groups. This interaction model provides an easy way to test whether the slopes are equal across AG levels. I will note that the interaction term is not needed here.

### 21.4.1 Interpretting odds ratios in logistic regression

Let’s begin with probability^{3}. Let’s say that the probability of success is 0.8, thus \(p = 0.8\). Then the probability of failure is \(q = 1 - p = 0.2\). The odds of success are defined as \(\textrm{odds(success)} = p/q = 0.8/0.2 = 4\), that is, the odds of success are 4 to 1. The odds of failure would be \(\textrm{odds(failure)} = q/p = 0.2/0.8 = 0.25\), that is, the odds of failure are 1 to 4. Next, let’s compute the odds ratio by \(\textrm{OR} = \textrm{odds(success)/odds(failure)} = 4/0.25 = 16\). The interpretation of this odds ratio would be that the odds of success are 16 times greater than for failure. Now if we had formed the odds ratio the other way around with odds of failure in the numerator, we would have gotten something like this, \(\textrm{OR} = \textrm{odds(failure)/odds(success)} = 0.25/4 = 0.0625\).

^{3} Borrowed graciously from UCLA Academic Technology Services at http://www.ats.ucla.edu/stat/sas/faq/oratio.htm

Interestingly enough, the interpretation of the second odds ratio is nearly the same as the first. Here the interpretation is that the odds of failure are one-sixteenth the odds of success. In fact, if you take the reciprocal of the first odds ratio you get \(1/16 = 0.0625\).

###### 21.4.1.0.0.1 Another example

This example is adapted from Pedhazur (1997). Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females are admitted. The probabilities for admitting a male are, \(p = 7/10 = 0.7\) and \(q = 1 - 0.7 = 0.3\). Here are the same probabilities for females, \(p = 3/10 = 0.3\) and \(q = 1 - 0.3 = 0.7\). Now we can use the probabilities to compute the admission odds for both males and females, \(\textrm{odds(male)} = 0.7/0.3 = 2.33333\) and \(\textrm{odds(female)} = 0.3/0.7 = 0.42857\). Next, we compute the odds ratio for admission, \(\textrm{OR} = 2.3333/0.42857 = 5.44\). Thus, the odds of a male being admitted are 5.44 times greater than for a female.

###### 21.4.1.0.0.2 Leukemia example

In the example above, the OR of surviving at least one year increases 12.43 times for AG+ vs AG\(-\), and increases 0.33 times (that’s a decrease) for every unit increase in lwbc.

## 21.5 Example: Mortality of confused flour beetles

This example illustrates a quadratic logistic model.

The aim of an experiment originally reported by Strand (1930) and quoted by Bliss (1935) was to assess the response of the confused flour beetle, *Tribolium confusum*, to gaseous carbon disulphide (CS\(_2\)). In the experiment, prescribed volumes of liquid carbon disulphide were added to flasks in which a tubular cloth cage containing a batch of about thirty beetles was suspended. Duplicate batches of beetles were used for each concentration of CS\(_2\). At the end of a five-hour period, the proportion killed was recorded and the actual concentration of gaseous CS\(_2\) in the flask, measured in mg/l, was determined by a volumetric analysis. The mortality data are given in the table below.

Plot the observed probability of mortality and the empirical logits with linear and quadratic LS fits (which are not the same as the logistic MLE fits).

## Code

## Code

```
library(ggplot2)
p <- ggplot(dat_beetles, aes(x = conc, y = emp_logit))
p <- p + geom_smooth(method = "lm", colour = "red", se = FALSE)
#p <- p + geom_smooth(method = "lm", formula = y ~ poly(x, 2), colour = "blue", se = FALSE)
p <- p + geom_smooth(method = "lm", formula = y ~ x + I(x^2), colour = "blue", se = FALSE)
# observed values
p <- p + geom_point(aes(shape = rep), color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Empirical logit with 'naive' LS fits (not MLE)")
print(p)
```

``geom_smooth()` using formula = 'y ~ x'`

In a number of articles that refer to these data, the responses from the first two concentrations are omitted because of apparent non-linearity. Bliss himself remarks that

…in comparison with the remaining observations, the two lowest concentrations gave an exceptionally high kill. Over the remaining concentrations, the plotted values seemed to form a moderately straight line, so that the data were handled as two separate sets, only the results at 56.91 mg of CS\(_2\) per litre being included in both sets.

However, there does not appear to be any biological motivation for this and so here they are retained in the data set.

Combining the data from the two replicates and plotting the empirical logit of the observed proportions against concentration gives a relationship that is better fit by a quadratic than a linear relationship, \[\begin{aligned} \log \left( \frac{p}{1-p} \right) & = & \beta_0 + \beta_1 X + \beta_2 X^2 . \nonumber \end{aligned}\] The right plot below shows the linear and quadratic model fits to the observed values with point-wise 95% confidence bands on the logit scale, and on the left is the same on the proportion scale.

## Code

```
# fit logistic regression to create lines on plots below
# linear
glm_beetles1 <-
glm(
cbind(y, n - y) ~ conc
, family = binomial
, data = dat_beetles
)
# quadratic
glm_beetles2 <-
glm(
cbind(y, n - y) ~ conc + conc2
, family = binomial
, data = dat_beetles
)
## put model fits for two models together
# First, linear relationship
fit_logit_pred <-
predict(
glm_beetles1
, type = "link"
, se.fit = TRUE
) |>
as_tibble()
dat_beetles1 <-
dat_beetles |>
mutate(
fit_logit = fit_logit_pred$fit
, fit_logit_se = fit_logit_pred$se.fit
# added "fit_p" to make predictions
, fit_p = exp(fit_logit) / (1 + exp(fit_logit))
# CI for p fitted values
, fit_p_lower = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
, fit_p_upper = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
, modelorder = "linear"
)
# Second, quadratic relationship
fit_logit_pred <-
predict(
glm_beetles2
, type = "link"
, se.fit = TRUE
) |>
as_tibble()
dat_beetles2 <-
dat_beetles |>
mutate(
fit_logit = fit_logit_pred$fit
, fit_logit_se = fit_logit_pred$se.fit
# added "fit_p" to make predictions
, fit_p = exp(fit_logit) / (1 + exp(fit_logit))
# CI for p fitted values
, fit_p_lower = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
, fit_p_upper = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
, modelorder = "quadratic"
)
dat_beetles_all <-
dat_beetles1 |>
bind_rows(dat_beetles2) |>
mutate(
modelorder = factor(modelorder)
)
# plot on logit and probability scales
library(ggplot2)
p <- ggplot(dat_beetles_all, aes(x = conc, y = p_hat, shape = rep, colour = modelorder, fill = modelorder))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = conc, ymin = fit_p_lower, ymax = fit_p_upper), linetype = 0, alpha = 0.1)
p <- p + geom_line(aes(x = conc, y = fit_p, linetype = modelorder), size = 1)
```

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
```

## Code

## Code

```
library(ggplot2)
p <- ggplot(dat_beetles_all, aes(x = conc, y = emp_logit, shape = rep, colour = modelorder, fill = modelorder))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = conc, ymin = fit_logit - 1.96 * fit_logit_se
, ymax = fit_logit + 1.96 * fit_logit_se), linetype = 0, alpha = 0.1)
p <- p + geom_line(aes(x = conc, y = fit_logit, linetype = modelorder), size = 1)
# fitted values
p <- p + geom_point(aes(y = fit_logit), size=2)
# observed values
p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
p <- p + labs(title = "Observed and predicted mortality, logit scale")
print(p)
```

## 21.6 Example: The UNM Trauma Data

The data to be analyzed here were collected on 3132 patients admitted to The University of New Mexico Trauma Center between the years 1991 and 1994. For each patient, the attending physician recorded their age, their revised trauma score (RTS), their injury severity score (ISS), whether their injuries were blunt (i.e., the result of a car crash: BP=0) or penetrating (i.e., gunshot/knife wounds: BP=1), and whether they eventually survived their injuries (SURV=0 if not, SURV=1 if survived). Approximately 10% of patients admitted to the UNM Trauma Center eventually die from their injuries.

The ISS is an overall index of a patient’s injuries, based on the approximately 1300 injuries cataloged in the Abbreviated Injury Scale. The ISS can take on values from 0 for a patient with no injuries to 75 for a patient with 3 or more life threatening injuries. The ISS is the standard injury index used by trauma centers throughout the U.S. The RTS is an index of physiologic injury, and is constructed as a weighted average of an incoming patient’s systolic blood pressure, respiratory rate, and Glasgow Coma Scale. The RTS can take on values from 0 for a patient with no vital signs to 7.84 for a patient with normal vital signs.

Champion et al. (1981) proposed a logistic regression model to estimate the probability of a patient’s survival as a function of RTS, the injury severity score ISS, and the patient’s age, which is used as a surrogate for physiologic reserve. Subsequent survival models included the binary effect BP as a means to differentiate between blunt and penetrating injuries.

*We will develop a logistic model for predicting survival from ISS, AGE, BP, and RTS, and nine body regions.* Data on the number of severe injuries in each of the nine body regions is also included in the database, so we will also assess whether these features have any predictive power. The following labels were used to identify the number of severe injuries in the nine regions: AS = head, BS = face, CS = neck, DS = thorax, ES = abdomen, FS = spine, GS = upper extremities, HS = lower extremities, and JS = skin.

## Code

```
#### Example: UNM Trauma Data
## Variables
# surv = survival (1 if survived, 0 if died)
# rts = revised trauma score (range: 0 no vital signs to 7.84 normal vital signs)
# iss = injury severity score (0 no injuries to 75 for 3 or more life threatening injuries)
# bp = blunt or penetrating injuries (e.g., car crash BP=0 vs gunshot/knife wounds BP=1)
# Severe injuries: add the severe injuries 3--6 to make summary variables
dat_trauma <-
read_table2("http://statacumen.com/teach/ADA2/notes/ADA2_notes_Ch11_trauma.dat") |>
mutate(
as = a3 + a4 + a5 + a6 # as = head
, bs = b3 + b4 + b5 + b6 # bs = face
, cs = c3 + c4 + c5 + c6 # cs = neck
, ds = d3 + d4 + d5 + d6 # ds = thorax
, es = e3 + e4 + e5 + e6 # es = abdomen
, fs = f3 + f4 + f5 + f6 # fs = spine
, gs = g3 + g4 + g5 + g6 # gs = upper extremities
, hs = h3 + h4 + h5 + h6 # hs = lower extremities
, js = j3 + j4 + j5 + j6 # js = skin
) |>
# keep only columns of interest
select(
id, surv, prob, age, as:js, iss:rts
)
head(dat_trauma)
```

```
# A tibble: 6 × 17
id surv prob age as bs cs ds es fs gs hs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1238385 1 0.991 13 0 0 0 1 0 0 0 0
2 1238393 1 0.995 23 0 0 0 0 0 0 0 0
3 1238898 1 0.995 43 0 0 0 0 0 0 2 0
4 1239516 1 0.962 17 1 0 0 0 0 0 0 0
5 1239961 1 0.934 20 1 0 0 0 0 0 0 0
6 1240266 1 0.995 32 0 0 0 0 0 0 0 1
# ℹ 5 more variables: js <dbl>, iss <dbl>, iciss <dbl>, bp <dbl>, rts <dbl>
```

I made side-by-side boxplots of the distributions of ISS, AGE, RTS, and AS through JS for the survivors and non-survivors. In several body regions the number of injuries is limited, so these boxplots are not very enlightening. Survivors tend to have lower ISS scores, tend to be slightly younger, tend to have higher RTS scores, and tend to have fewer severe head (AS) and abdomen injuries (ES) than non-survivors. The importance of the effects individually towards predicting survival is directly related to the separation between the survivors and non-survivors scores.

## Code

```
# Create boxplots for each variable by survival
dat_trauma_long <-
dat_trauma |>
pivot_longer(
cols = -one_of("id", "surv", "prob")
, names_to = "variable"
, values_to = "value"
)
# Plot the data using ggplot
library(ggplot2)
p <- ggplot(dat_trauma_long, aes(x = factor(surv), y = value))
# boxplot, size=.75 to stand out behind CI
p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
# points for observed data
p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.1)
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
alpha = 0.75, colour = "red")
```

```
Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.
```

## Code

### 21.6.1 Selecting Predictors in the Trauma Data

The same automated methods for model building are available for the `glm()`

procedure, including backward elimination, forward selection, and stepwise methods, among others. Below we perform a stepwise selection using AIC starting at the full model, starting with a full model having 13 effects: ISS, BP, RTS, AGE, and AS–JS. Revisit Chapter @ch:ADA2_notes_10_ModelSelection for more information.

In our previous logistic regression analyses, the cases in the data set were pre-aggregated into groups of observations having identical levels of the predictor variables. The numbers of cases in the success category and the group sample sizes were specified in the model statement, along with the names of the predictors. The trauma data set, which is not reproduced here, is **raw data** consisting of one record per patient (i.e., 3132 lines). The logistic model is fitted to data on individual cases by specifying the binary response variable (SURV) with successes and \(1-\textrm{SURV}\) failures with the predictors on the right-hand side of the formula. Keep in mind that we are defining the logistic model to model the **success** category, so we are modeling the probability of surviving.

As an aside, there are two easy ways to model the probability of dying (which we don’t do below). The first is to swap the order the response is specified in the formula: `cbind(1 - surv, surv)`

. The second is to convert a model for the log-odds of surviving to a model for the log-odds of dying by simply changing the sign of each regression coefficient in the model.

I only included the summary table from the backward elimination, and information on the fit of the selected model.

## Code

```
glm_tr <-
glm(
cbind(surv, 1 - surv) ~
as + bs + cs + ds + es + fs + gs + hs + js + iss + rts + age + bp
, family = binomial
, data = dat_trauma
)
# option: trace = 0 doesn't show each step of the automated selection
glm_tr_red_AIC <-
step(
glm_tr
, direction = "both"
, trace = 0
)
# the anova object provides a summary of the selection steps in order
glm_tr_red_AIC$anova
```

```
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 3118 869.4309 897.4309
2 - as 1 0.04911674 3119 869.4800 895.4800
3 - bs 1 0.12242766 3120 869.6024 893.6024
4 - fs 1 0.15243093 3121 869.7549 891.7549
5 - cs 1 0.78703127 3122 870.5419 890.5419
6 - ds 1 0.81690443 3123 871.3588 889.3588
7 - gs 1 1.18272567 3124 872.5415 888.5415
8 - hs 1 1.17941462 3125 873.7209 887.7209
9 - js 1 1.71204029 3126 875.4330 887.4330
```

The final model includes effects for ES (number of severe abdominal injuries), ISS, RTS, AGE, and BP. All of the effects in the selected model are significant at the 5% level.

```
Call:
glm(formula = cbind(surv, 1 - surv) ~ es + iss + rts + age +
bp, family = binomial, data = dat_trauma)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.355845 0.442943 0.803 0.4218
es -0.461317 0.110098 -4.190 2.79e-05 ***
iss -0.056920 0.007411 -7.680 1.59e-14 ***
rts 0.843143 0.055339 15.236 < 2e-16 ***
age -0.049706 0.005291 -9.394 < 2e-16 ***
bp -0.635137 0.249597 -2.545 0.0109 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1825.37 on 3131 degrees of freedom
Residual deviance: 875.43 on 3126 degrees of freedom
AIC: 887.43
Number of Fisher Scoring iterations: 7
```

The p-value for \(D\) is large, indicating no gross deficiencies with the selected model.

## Code

`[1] 1`

Letting \(p\) be the probability of survival, the estimated survival probability is given by \[\begin{aligned} \log \left( \frac{\tilde p}{1-\tilde p} \right) & = & 0.3558 -0.4613 \textrm{ ES} -0.6351 \textrm{ BP} -0.0569 \textrm{ ISS} \nonumber\\ %=== & & +0.8431 \textrm{ RTS} -0.0497 \textrm{ AGE}. \nonumber \end{aligned}\] Let us interpret the sign of the coefficients, and the odds ratios, in terms of the impact that individual predictors have on the survival probability.

`Waiting for profiling to be done...`

```
OR 2.5 % 97.5 %
(Intercept) 0.35584499 -0.51977300 1.21869015
es -0.46131679 -0.67603693 -0.24307991
iss -0.05691973 -0.07159539 -0.04249502
rts 0.84314317 0.73817886 0.95531089
age -0.04970641 -0.06020882 -0.03943822
bp -0.63513735 -1.12051508 -0.14005288
```

`Waiting for profiling to be done...`

```
OR 2.5 % 97.5 %
(Intercept) 1.4273863 0.5946555 3.3827539
es 0.6304529 0.5086287 0.7842088
iss 0.9446699 0.9309075 0.9583952
rts 2.3236592 2.0921220 2.5994786
age 0.9515087 0.9415679 0.9613293
bp 0.5298627 0.3261118 0.8693123
```

### 21.6.2 Checking Predictions for the Trauma Model

To assess the ability of the selected model to accurately predict survival, assume that the two types of errors (a false positive prediction of survival, and a false negative prediction) have equal costs. Under this assumption, the optimal classification rule is to predict survival for an individual with an estimated survival probability of 0.50 or larger.

In the below script, a table is generated that gives classifications for cutoff probabilities `thresh`

\(= 0.0, 0.1, \ldots, 1.0\) based on the selected model. While I do not do this below, it is common to use the **jackknife** method for assessing classification accuracy. To implement the jackknife method, each observation is temporarily held out and the selected model is fitted to the remaining (3131) cases. This leads to an estimated survival probability for the case, which is compared to the actual result. The results are summarized in terms of total number of correct and incorrect classifications for each possible outcome. In the table below, the columns labeled **Event** (a survival) and **Non-Event** (a death) refer to the classification of observations, and not to the actual outcomes. The columns labeled **Correct** and **Incorrect** identify whether the classifications are accurate.

## Code

```
# thresholds for classification given model proportion predictions for each observation
thresh <- seq(0, 1, by = 0.1)
# predicted probabilities
Yhat <- fitted(glm_tr_red_AIC)
# Name: lower (0) = NonEvent, higher (1) = Event
YObs <-
cut(
dat_trauma$surv
, breaks = c(-Inf, mean(dat_trauma$surv), Inf)
, labels = c("NonEvent", "Event")
)
classify_table <-
data.frame(
Thresh = rep(NA, length(thresh))
, Cor_Event = rep(NA, length(thresh))
, Cor_NonEv = rep(NA, length(thresh))
, Inc_Event = rep(NA, length(thresh))
, Inc_NonEv = rep(NA, length(thresh))
, Cor_All = rep(NA, length(thresh))
, Sens = rep(NA, length(thresh))
, Spec = rep(NA, length(thresh))
, Fal_P = rep(NA, length(thresh))
, Fal_N = rep(NA, length(thresh))
)
for (i_thresh in 1:length(thresh)) {
# choose a threshold for dichotomizing according to predicted probability
YhatPred <-
cut(
Yhat
, breaks = c(-Inf, thresh[i_thresh], Inf)
, labels = c("NonEvent", "Event")
)
# contingency table and marginal sums
cTab <- table(YhatPred, YObs)
addmargins(cTab)
# Classification Table
classify_table$Thresh [i_thresh] <- thresh[i_thresh] # Prob_Level
classify_table$Cor_Event[i_thresh] <- cTab[2,2] # Correct_Event
classify_table$Cor_NonEv[i_thresh] <- cTab[1,1] # Correct_NonEvent
classify_table$Inc_Event[i_thresh] <- cTab[2,1] # Incorrect_Event
classify_table$Inc_NonEv[i_thresh] <- cTab[1,2] # Incorrect_NonEvent
classify_table$Cor_All [i_thresh] <- 100 * sum(diag(cTab)) / sum(cTab) # Correct_Overall
classify_table$Sens [i_thresh] <- 100 * cTab[2,2] / sum(cTab[,2]) # Sensitivity
classify_table$Spec [i_thresh] <- 100 * cTab[1,1] / sum(cTab[,1]) # Specificity
classify_table$Fal_P [i_thresh] <- 100 * cTab[2,1] / sum(cTab[2,]) # False_Pos
classify_table$Fal_N [i_thresh] <- 100 * cTab[1,2] / sum(cTab[1,]) # False_Neg
}
classify_table |> round(2)
```

```
Thresh Cor_Event Cor_NonEv Inc_Event Inc_NonEv Cor_All Sens Spec
1 0.0 2865 0 267 0 91.48 100.00 0.00
2 0.1 2861 79 188 4 93.87 99.86 29.59
3 0.2 2856 105 162 9 94.54 99.69 39.33
4 0.3 2848 118 149 17 94.70 99.41 44.19
5 0.4 2837 125 142 28 94.57 99.02 46.82
6 0.5 2825 139 128 40 94.64 98.60 52.06
7 0.6 2805 157 110 60 94.57 97.91 58.80
8 0.7 2774 174 93 91 94.13 96.82 65.17
9 0.8 2727 196 71 138 93.33 95.18 73.41
10 0.9 2627 229 38 238 91.19 91.69 85.77
11 1.0 0 267 0 2865 8.52 0.00 100.00
Fal_P Fal_N
1 8.52 NaN
2 6.17 4.82
3 5.37 7.89
4 4.97 12.59
5 4.77 18.30
6 4.33 22.35
7 3.77 27.65
8 3.24 34.34
9 2.54 41.32
10 1.43 50.96
11 NaN 91.48
```

%```{r texoutput-trauma-classify, results=‘asis’, echo=FALSE, size=‘footnotesize’} %library(xtable) %xtab.out <- xtable(round(classify.table, 3)) %print(xtab.out, floating=FALSE, math.style.negative=TRUE) %@

The data set has 2865 survivors and 267 people that died. Using a 0.50 cutoff, 2825 of the survivors would be correctly identified, and 40 misclassified. Similarly, 139 of the patients that died would be correctly classified and 128 would not. The overall percentage of cases correctly classified is \((2825+138)/3132 = 94.6\%\). The **sensitivity** is the percentage of survivors that are correctly classified, \(2825/(2825+40)=98.6\%\). The **specificity** is the percentage of patients that died that are correctly classified, \(139/(139+128)=52.1\%\). The **false positive rate**, which is the % of those predicted to survive that did not, is \(128/(128+2825)=4.3\%\). The **false negative rate**, which is the % of those predicted to die that did not is \(40/(40+139)=22.5\%\).

## Code

```
YObs
YhatPred NonEvent Event Sum
NonEvent 139 40 179
Event 128 2825 2953
Sum 267 2865 3132
```

```
Thresh Cor_Event Cor_NonEv Inc_Event Inc_NonEv Cor_All Sens Spec Fal_P
1 0.5 2825 139 128 40 94.64 98.6 52.06 4.33
Fal_N
1 22.35
```

The misclassification rate seems small, but you should remember that approximately 10% of patients admitted to UNM eventually die from their injuries. Given this historical information only, you could achieve a 10% misclassification rate by completely ignoring the data and classifying each admitted patient as a survivor. Using the data reduces the misclassification rate by about 50% (from 10% down to 4.4%), which is an important reduction in this problem.

## 21.7 Historical Example: O-Ring Data

The table below presents data from Dalal, Fowlkes, and Hoadley (1989) on field O-ring failures in the 23 pre-*Challenger* space shuttle launches. Temperature at lift-off and O-ring joint pressure are predictors. The binary version of the data views each flight as an independent trial. The result of a trial (\(y\)) is a 1 if at least one field O-ring failure occurred on the flight and a 0 if all six O-rings functioned properly. A regression for the binary data would model the probability that **any** O-ring failed as a function of temperature and pressure. The binomial version of these data is also presented. It views the six field O-rings on each flight as independent trials. A regression for the binomial data would model the probability that **a particular** O-ring on a given flight failed as a function of temperature and pressure. The two regressions model different probabilities. The *Challenger* explosion occurred during a takeoff at 31 degrees Fahrenheit.

The regression model for the binomial data (i.e., six trials per launch) is suspect on substantive grounds. The binomial model makes no allowance for an effect due to having the six O-rings on the same flight. If these data were measurements rather than counts, they would almost certainly be treated as dependent repeated measures. If interest is focused on whether one or more O-rings failed, the simplest, most direct data are the binary data.

Consider fitting a logistic regression model using temperature and pressure as predictors. Let \(p_i\) be the probability that any O-ring fails in case \(i\) and model this as \[\log \left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 \textrm{ Temp}_i + \beta_2 \textrm{ Pressure}_i.\]

Logistic histogram plots of the data show a clear marginal relationship of failure with temp but not with pressure. We still need to assess the model with both variables together.

## Code

## Code

We fit the logistic model below using \(Y=1\) if at least one O-ring failed, and 0 otherwise. We are modelling the chance of one or more O-ring failures as a function of temperature and pressure.

The \(D\) goodness-of-fit statistic suggest no gross deviations from the model. Furthermore, the test of \(H_0: \beta_1=\beta_2=0\) (no regression effects) based on the Wald test has a p-value of 0.1, which suggests that neither temperature or pressure, or both, are useful predictors of the probability of O-ring failure. The z-test test p-values for testing \(H_0:\beta_1=0\) and \(H_0:\beta_2=0\) individually are 0.037 and 0.576, respectively, which indicates pressure is not important (when added last to the model), but that temperature is important. This conclusion might be anticipated by looking at data plots above.

## Code

`[1] 0.4589415`

```
(Intercept) temp pressure
16.385319489 -0.263404073 0.005177602
```

## Code

```
Wald test:
----------
Chi-squared test:
X2 = 4.6, df = 2, P(> X2) = 0.1
```

```
Call:
glm(formula = cbind(y, 1 - y) ~ temp + pressure, family = binomial,
data = dat_shuttle)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 16.385319 8.027474 2.041 0.0412 *
temp -0.263404 0.126371 -2.084 0.0371 *
pressure 0.005178 0.009257 0.559 0.5760
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28.267 on 22 degrees of freedom
Residual deviance: 19.984 on 20 degrees of freedom
AIC: 25.984
Number of Fisher Scoring iterations: 5
```

A reasonable next step would be to refit the model, after omitting pressure as a predictor. The target model is now \[\log \left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 \textrm{ Temp}_i.\]

## Code

`[1] 0.5013827`

```
Call:
glm(formula = cbind(y, 1 - y) ~ temp, family = binomial, data = dat_shuttle)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429 7.3786 2.039 0.0415 *
temp -0.2322 0.1082 -2.145 0.0320 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28.267 on 22 degrees of freedom
Residual deviance: 20.315 on 21 degrees of freedom
AIC: 24.315
Number of Fisher Scoring iterations: 5
```

Our conclusions on the overall fit of the model and the significance of the effect of temperature on the probability of O-ring failure are consistent with the results for two predictor model. The model estimates the log-odds of (at least one) O-ring failure to be \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 15.043 - 0.2322 \textrm{ Temp}.\] The estimated probability of (at least one) O-ring failure is \[\tilde{p} = \frac{ \exp( 15.043 - 0.2322 \textrm{ Temp}) } { 1 + \exp( 15.043 - 0.2322 \textrm{ Temp}) }.\] This is an decreasing function of temperature.

The *Challenger* was set to launch on a morning where the predicted temperature at lift-off was 31 degrees. Given that temperature appears to affect the probability of O-ring failure (a point that NASA missed), what can/should we say about the potential for O-ring failure?

Clearly, the launch temperature is outside the region for which data were available. Thus, we really have no prior information about what is likely to occur. If we assume the logistic model holds, and we can extrapolate back to 31 degrees, what is the estimated probability of O-ring failure?

The following gives the answer this question. I augmented the original data set to obtain predicted probabilities for temperatures not observed in the data set. Note that the fitted model to data with missing observations gives the same model fit because `glm()`

excludes observations with missing values. Predictions are then made for all temperatures in the dataset and the resulting table and plot are reported.

## Code

```
# append values to dataset for which we wish to make predictions
dat_shuttle_pred <-
tibble(
case = rep(NA, 5)
, flight = rep(NA, 5)
, y = rep(NA, 5)
, six = rep(NA, 5)
, temp = c(31, 35, 40, 45, 50) # temp values to predict
, pressure = rep(NA, 5)
)
dat_shuttle <-
dat_shuttle_pred |>
bind_rows(dat_shuttle)
# fit model
glm_sh <-
glm(
cbind(y, 1 - y) ~ temp
, family = binomial
, data = dat_shuttle
)
# Note: same model fit as before since glm() does not use missing values
summary(glm_sh)$coefficients |> round(3)
```

```
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.043 7.379 2.039 0.041
temp -0.232 0.108 -2.145 0.032
```

## Code

```
# predict() uses all the temp values in dataset, including appended values
fit_logit_pred <-
predict(
glm_sh
, newdata = dat_shuttle |> select(temp) # use temp from dat_shuttle
, type = "link"
, se.fit = TRUE
) |>
as_tibble()
# put the fitted values in the data.frame
dat_shuttle <-
dat_shuttle |>
mutate(
fit_logit = fit_logit_pred$fit
, fit_logit_se = fit_logit_pred$se.fit
# added "fit_p" to make predictions at appended Load values
, fit_p = exp(fit_logit) / (1 + exp(fit_logit))
# CI for p fitted values
, fit_p_lower = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
, fit_p_upper = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
)
```

## Code

```
library(ggplot2)
p <- ggplot(dat_shuttle, aes(x = temp, y = y))
# predicted curve and point-wise 95% CI
p <- p + geom_ribbon(aes(x = temp, ymin = fit_p_lower, ymax = fit_p_upper), alpha = 0.2)
p <- p + geom_line(aes(x = temp, y = fit_p), colour="red")
# fitted values
p <- p + geom_point(aes(y = fit_p), size=2, colour="red")
# observed values
p <- p + geom_point(size = 2)
p <- p + ylab("Probability")
p <- p + labs(title = "Observed events and predicted probability of 1+ O-ring failures")
print(p)
```

`Warning: Removed 5 rows containing missing values (`geom_point()`).`