21  Logistic Regression

You are reading the work-in-progress online edition of “Statistical Acumen: Advanced Data Analysis.”

This chapter should be readable but is currently undergoing final polishing.

Code
set.seed(76543)
library(erikmisc)
library(tidyverse)

Logistic regression analysis is used for predicting the outcome of a categorical dependent variable based on one or more predictor variables. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Logistic regression is frequently used to refer to the problem in which the dependent variable is binary — that is, the number of available categories is two — and problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression.

Logistic regression measures the relationship between a categorical dependent variable and usually (but not necessarily) one or more continuous independent variables, by converting the dependent variable to probability scores. As such it treats the same set of problems as does probit regression using similar techniques.

21.2 Example: Age of Menarche in Warsaw

The data1 below are from a study conducted by Milicer and Szczotka on pre-teen and teenage girls in Warsaw, Poland in 1965. The subjects were classified into 25 age categories. The number of girls in each group (Total) and the number that reached menarche (Menarche) at the time of the study were recorded. The age for a group corresponds to the midpoint for the age interval.

  • 1 Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in 1965. Human Biology 38, 199–203.

  • Code
    #### Example: Menarche
    # menarche dataset is available in MASS package
    #library(MASS)
    
    dat_menarche <-
      MASS::menarche |>
      mutate(
        # these frequencies look better in the table as integers
        Total    = Total    |> as.integer()
      , Menarche = Menarche |> as.integer()
        # create estimated proportion of girls reaching menarche for each age group
      , p_hat    = Menarche / Total
      )
    str(dat_menarche)
    'data.frame':   25 obs. of  4 variables:
     $ Age     : num  9.21 10.21 10.58 10.83 11.08 ...
     $ Total   : int  376 200 93 120 90 88 105 111 100 93 ...
     $ Menarche: int  0 0 0 2 2 5 10 17 16 29 ...
     $ p_hat   : num  0 0 0 0.0167 0.0222 ...
    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:42 2023

    The researchers were curious about how the proportion of girls that reached menarche (\(\hat{p}=\textrm{Menarche}/\textrm{Total}\)) varied with age. One could perform a test of homogeneity (Multinomial goodness-of-fit test) by arranging the data as a 2-by-25 contingency table with columns indexed by age and two rows: ROW1 \(=\) Menarche, and ROW2 \(=\) number that have not reached menarche \(=\) (Total \(-\) Menarche). A more powerful approach treats these as regression data, using the proportion of girls reaching menarche as the response and age as a predictor.

    A plot of the observed proportion \(\hat{p}\) of girls that have reached menarche shows that the proportion increases as age increases, but that the relationship is nonlinear. The observed proportions, which are bounded between zero and one, have a lazy \(S\)-shape (a sigmoidal function) when plotted against age. The change in the observed proportions for a given change in age is much smaller when the proportion is near 0 or 1 than when the proportion is near \(1/2\). This phenomenon is common with regression data where the response is a proportion.

    The trend is nonlinear so linear regression is inappropriate. A sensible alternative might be to transform the response or the predictor to achieve near linearity. A common transformation of response proportions following a sigmoidal curve is to the logit scale \(\hat{\mu} = \log_e\{\hat{p}/(1-\hat{p})\}\). This transformation is the basis for the logistic regression model. The natural logarithm (base \(e\)) is traditionally used in logistic regression.

    The logit transformation is undefined when \(\hat{p}=0\) or \(\hat{p}=1\). To overcome this problem, researchers use the empirical logits, defined by \(\log \{ (\hat{p}+0.5/n) / (1-\hat{p}+0.5/n) \}\), where \(n\) is the sample size or the number of observations on which \(\hat{p}\) is based.

    A plot of the empirical logits against age is roughly linear, which supports a logistic transformation for the response.

    Code
    library(ggplot2)
    p <- ggplot(dat_menarche, aes(x = Age, y = p_hat))
    p <- p + geom_point()
    p <- p + labs(title = paste("Observed probability of girls reaching menarche,\n",
                                "Warsaw, Poland in 1965", sep=""))
    print(p)

    Code
    dat_menarche <-
      dat_menarche |>
      mutate(
        # emperical logits
        emp_logit = log((p_hat + 0.5 / Total) / (1 - p_hat + 0.5 / Total))
      )
    
    library(ggplot2)
    p <- ggplot(dat_menarche, aes(x = Age, y = emp_logit))
    p <- p + geom_point()
    p <- p + labs(title = "Empirical logits")
    print(p)

    21.3 Simple logistic regression model

    The simple logistic regression model expresses the population proportion \(p\) of individuals with a given attribute (called the probability of success) as a function of a single predictor variable \(X\). The model assumes that \(p\) is related to \(X\) through \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X\] or, equivalently, as \[p = \frac{ \exp( \beta_0 + \beta_1 X ) }{ 1 + \exp( \beta_0 + \beta_1 X ) }.\] The logistic regression model is a binary response model, where the response for each case falls into one of two exclusive and exhaustive categories, success (cases with the attribute of interest) and failure (cases without the attribute of interest).

    The odds of success are \(p/(1-p)\). For example, when \(p=1/2\) the odds of success are 1 (or 1 to 1). When \(p=0.9\) the odds of success are 9 (or 9 to 1). The logistic model assumes that the log-odds of success is linearly related to \(X\). Graphs of the logistic model relating \(p\) to \(X\) are given below. The sign of the slope refers to the sign of \(\beta_1\).

    I should write \(p=p(X)\) to emphasize that \(p\) is the proportion of all individuals with score \(X\) that have the attribute of interest. In the menarche data, \(p=p(X)\) is the population proportion of girls at age \(X\) that have reached menarche.

    image

    The data in a logistic regression problem are often given in summarized or aggregate form:

    \(X\) \(n\) \(y\)
    \(X_1\) \(n_1\) \(y_1\)
    \(X_2\) \(n_2\) \(y_2\)
    \(\vdots\) \(\vdots\) \(\vdots\)
    \(X_m\) \(n_m\) \(y_m\)

    where \(y_i\) is the number of individuals with the attribute of interest among \(n_i\) randomly selected or representative individuals with predictor variable value \(X_i\). For raw data on individual cases, \(y_i=1\) or 0, depending on whether the case at \(X_i\) is a success or failure, and the sample size column \(n\) is omitted with raw data.

    For logistic regression, a plot of the sample proportions \(\hat{p}_i=y_i/n_i\) against \(X_i\) should be roughly sigmoidal, and a plot of the empirical logits against \(X_i\) should be roughly linear. If not, then some other model is probably appropriate. I find the second plot easier to calibrate, but neither plot is very informative when the sample sizes are small, say 1 or 2. (Why?).

    There are a variety of other binary response models that are used in practice. The probit regression model or the complementary log-log regression model might be appropriate when the logistic model does fit the data.

    The following section describes the standard MLE strategy for estimating the logistic regression parameters.

    21.3.1 Estimating Regression Parameters via LS of empirical logits

    (This is a naive method; we will discuss a MUCH better way in the next section. In practice, never do this! This is only for illustration.)

    There are two unknown population parameters in the logistic regression model \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 X.\] A simple way to estimate \(\beta_0\) and \(\beta_1\) is by least squares (LS), using the empirical logits as responses and the \(X_i\)s as the predictor values.

    Below we use standard regression to calculate the LS fit between the empirical logits and age.

    Code
    lm_menarche_e_a <-
      lm(
        emp_logit ~ Age
      , data = dat_menarche
      )
    # LS coefficients
    coef(lm_menarche_e_a)
    (Intercept)         Age 
     -22.027933    1.676395 

    The LS estimates for the menarche data are \(b_0=-22.03\) and \(b_1=1.68\), which gives the fitted relationship \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = -22.03 + 1.68 \textrm{ Age}\] or \[\tilde{p} = \frac{ \exp( -22.03 + 1.68 \textrm{ Age}) }{ 1 + \exp(-22.03 + 1.68 \textrm{ Age}) },\] where \(\tilde{p}\) is the predicted proportion (under the model) of girls having reached menarche at the given age. I used \(\tilde{p}\) to identify a predicted probability, in contrast to \(\hat{p}\) which is the observed proportion at a given age.

    The power of the logistic model versus the contingency table analysis discussed earlier is that the model gives estimates for the population proportion reaching menarche at all ages within the observed age range. The observed proportions allow you to estimate only the population proportions at the observed ages.

    21.3.2 Maximum Likelihood Estimation for Logistic Regression Model

    There are better ways to the fit the logistic regression model than LS which assumes that the responses are normally distributed with constant variance. A deficiency of the LS fit to the logistic model is that the observed counts \(y_i\) have a Binomial distribution under random sampling. The Binomial distribution is a discrete probability model associated with counting the number of successes in a fixed size sample, and other equivalent experiments such as counting the number of heads in repeated flips of a coin. The distribution of the empirical logits depend on the \(y_i\)s so they are not normal (but are approximately normal in large samples), and are extremely skewed when the sample sizes \(n_i\) are small. The response variability depends on the population proportions, and is not roughly constant when the observed proportions or the sample sizes vary appreciably across groups.

    The differences in variability among the empirical logits can be accounted for using weighted least squares (WLS) when the sample sizes are large. An alternative approach called maximum likelihood uses the exact Binomial distribution of the responses \(y_i\) to generate optimal estimates of the regression coefficients. Software for maximum likelihood estimation is widely available, so LS and WLS methods are not really needed.

    In maximum likelihood estimation (MLE), the regression coefficients are estimated iteratively by minimizing the deviance function (also called the likelihood ratio chi-squared statistic) \[D = 2 \sum_{i=1}^m \left\{ y_i \log \left( \frac{ y_i }{n_i p_i} \right) + (n_i-y_i) \log \left( \frac{ n_i-y_i }{n_i - n_i p_i} \right) \right\}\] over all possible values of \(\beta_0\) and \(\beta_1\), where the \(p_i\)s satisfy the logistic model \[\log \left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 X_i.\] The ML method also gives standard errors and significance tests for the regression estimates.

    The deviance is an analog of the residual sums of squares in linear regression. The choices for \(\beta_0\) and \(\beta_1\) that minimize the deviance are the parameter values that make the observed and fitted proportions as close together as possible in a “likelihood sense”.

    Suppose that \(b_0\) and \(b_1\) are the MLEs of \(\beta_0\) and \(\beta_1\). The deviance evaluated at the MLEs, \[D = 2 \sum_{i=1}^m \left\{ y_i \log \left( \frac{ y_i }{n_i \tilde{p}_i} \right) + (n_i-y_i) \log \left( \frac{ n_i-y_i }{n_i - n_i \tilde{p}_i} \right) \right\},\] where the fitted probabilities \(\tilde{p}_i\) satisfy \[\log \left( \frac{\tilde{p}_i}{1-\tilde{p}_i} \right) = b_0 + b_1 X_i,\] is used to test the adequacy of the model. The deviance is small when the data fits the model, that is, when the observed and fitted proportions are close together. Large values of \(D\) occur when one or more of the observed and fitted proportions are far apart, which suggests that the model is inappropriate.

    If the logistic model holds, then \(D\) has a chi-squared distribution with \(m - r\) degrees of freedom, where \(m\) is the the number of groups and \(r\) (here 2) is the number of estimated regression parameters. A p-value for the deviance is given by the area under the chi-squared curve to the right of \(D\). A small p-value indicates that the data does not fit the model.

    Alternatively, the fit of the model can be evaluated using the chi-squared approximation to the Pearson \(X^2\) statistic: \[X^2 = \sum_{i=1}^m \left\{ \frac{( y_i - n_i \tilde{p}_i )^2}{n_i \tilde{p}_i} + \frac{ ( (n_i-y_i) - n_i(1-\tilde{p}_i) )^2 }{ n_i(1-\tilde{p}_i)} \right\} = \sum_{i=1}^m \frac{ ( y_i - n_i \tilde{p}_i )^2}{n_i \tilde{p}_i (1- \tilde{p}_i)}.\]

    21.3.3 Fitting the Logistic Model by Maximum Likelihood, Menarche

    Code
    # For our summarized data (with frequencies and totals for each age)
    # The left-hand side of our formula binds two columns together with cbind():
    #   the columns are the number of "successes" and "failures".
    # For logistic regression with logit link we specify family = binomial,
    #   where logit is the default link function for the binomial family.
    glm_m_a <-
      glm(
        cbind(Menarche, Total - Menarche) ~ Age
      , family = binomial
      , data = dat_menarche
      )

    The glm() statement creates an object which we can use to create the fitted probabilities and 95% CIs for the population proportions at the ages in menarche. The fitted probabilities and the limits are stored in columns labeled fitted.values, fit.lower, and fit.upper, respectively.

    Code
    # predict() uses all the Load values in dataset, including appended values
    fit_logit_pred <-
      predict(
        glm_m_a
      , type   = "link"
      , se.fit = TRUE
      ) |>
      as_tibble()
    
    # put the fitted values in the data.frame
    dat_menarche <-
      dat_menarche |>
      mutate(
        fit_logit    = fit_logit_pred$fit
      , fit_logit_se = fit_logit_pred$se.fit
      # added "fit_p" to make predictions at appended Load values
      , fit_p        = exp(fit_logit) / (1 + exp(fit_logit))
      # CI for p fitted values
      , fit_p_lower  = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
      , fit_p_upper  = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
      )
    
    #dat_menarche |> round(3) |> head() |> kable()

    This printed summary information is easily interpreted. For example, the estimated population proportion of girls aged 15.08 (more precisely, among girls in the age interval with midpoint 15.08) that have reached menarche is 0.967. You are 95% confident that the population proportion is between 0.958 and 0.975. A variety of other summaries and diagnostics can be produced.

    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:43 2023 % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:43 2023

    The summary table gives MLEs and standard errors for the regression parameters. The z-value column is the parameter estimate divided by its standard error. The p-values are used to test whether the corresponding parameters of the logistic model are zero.

    Code
    summary(glm_m_a)
    
    Call:
    glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial, 
        data = dat_menarche)
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept) -21.22639    0.77068  -27.54   <2e-16 ***
    Age           1.63197    0.05895   27.68   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 3693.884  on 24  degrees of freedom
    Residual deviance:   26.703  on 23  degrees of freedom
    AIC: 114.76
    
    Number of Fisher Scoring iterations: 4

    If the model is correct and when sample sizes are large, the residual deviance \(D\) has an approximate chi-square distribution, \[\begin{aligned} \textrm{residual } D & = & \chi^2_{\textrm{residual df}} . \nonumber \end{aligned}\] If \(D\) is too large, or the \(p\)-value is too small, then the model does not capture all the features in the data.

    The deviance statistic is \(D=26.70\) on \(25-2=23\) df. The large p-value for \(D\) suggests no gross deficiencies with the logistic model. The observed and fitted proportions (p_hat and fitted.values in the output table above are reasonably close at each observed age. Also, emp_logit and fit are close. This is consistent with \(D\) being fairly small. The data fits the logistic regression model reasonably well.

    Code
    # Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
    glm_m_a$deviance
    [1] 26.70345
    Code
    glm_m_a$df.residual
    [1] 23
    Code
    dev_p_val <- 1 - pchisq(glm_m_a$deviance, glm_m_a$df.residual)
    dev_p_val
    [1] 0.2687953

    The MLEs \(b_0=-21.23\) and \(b_1=1.63\) for the intercept and slope are close to the LS estimates of \(b_{\textrm{LS}0}=-22.03\) and \(b_{\textrm{LS}1}=1.68\), respectively from page . The two estimation methods give similar predicted probabilities here. The MLE of the predicted probabilities satisfy \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = -21.23 + 1.63 \textrm{ Age}\] or \[\tilde{p} = \frac{ \exp( -21.23 + 1.63 \textrm{ Age}) }{ 1 + \exp(-21.23 + 1.63 \textrm{ Age}) }.\]

    Code
    library(ggplot2)
    p <- ggplot(dat_menarche, aes(x = Age, y = p_hat))
    # predicted curve and point-wise 95% CI
    p <- p + geom_ribbon(aes(x = Age, ymin = fit_p_lower, ymax = fit_p_upper), alpha = 0.2)
    p <- p + geom_line(aes(x = Age, y = fit_p), colour = "red")
    # fitted values
    p <- p + geom_point(aes(y = fit_p), size = 2, colour = "red")
    # observed values
    p <- p + geom_point(size = 2)
    p <- p + scale_y_continuous(limits = c(0, 1))
    p <- p + labs(title = paste("Observed and predicted probability of girls reaching menarche,\n",
                                "Warsaw, Poland in 1965", sep="")
                , y = "Probability"
                  )
    print(p)

    If the model holds, then a slope of \(\beta_1=0\) implies that \(p\) does not depend on AGE, i.e., the proportion of girls that have reached menarche is identical across age groups. The Wald p-value for the slope is \(<0.0001\), which leads to rejecting \(H_0:\beta_1=0\) at any of the usual test levels. The proportion of girls that have reached menarche is not constant across age groups. Again, the power of the model is that it gives you a simple way to quantify the effect of age on the proportion reaching menarche. This is more appealing than testing homogeneity across age groups followed by multiple comparisons.

    Wald tests can be performed to test the global null hypothesis, that all non-intercept \(\beta\)s are equal to zero. This is the logistic regression analog of the overall model F-test in ANOVA and regression. The only predictor is AGE, so the implied test is that the slope of the regression line is zero. The Wald test statistic and p-value reported here are identical to the Wald test and p-value for the AGE effect given in the parameter estimates table. The Wald test can also be used to test specific contrasts between parameters.

    Code
    # Testing Global Null Hypothesis
    
    # Typically, we are interested in testing whether individual parameters or
    #   set of parameters are all simultaneously equal to 0s
    # However, any null hypothesis values can be included in the vector coef_test_values.
    
    coef_test_values <-
      rep(0, length(coef(glm_m_a)))
    
    library(aod) # for wald.test()
    test_wald <-
      wald.test(
        b     = coef(glm_m_a) - coef_test_values
      , Sigma = vcov(glm_m_a)
      , Terms = c(2)
      )
    test_wald
    Wald test:
    ----------
    
    Chi-squared test:
    X2 = 766.3, df = 1, P(> X2) = 0.0

    21.4 Example: Leukemia white blood cell types

    Feigl and Zelen2 reported the survival time in weeks and the white cell blood count (WBC) at time of diagnosis for 33 patients who eventually died of acute leukemia. Each person was classified as AG\(+\) or AG\(-\), indicating the presence or absence of a certain morphological characteristic in the white cells. Four variables are given in the data set: WBC, a binary factor or indicator variable AG (1 for AG\(+\), 0 for AG\(-\)), NTOTAL (the number of patients with the given combination of AG and WBC), and NRES (the number of NTOTAL that survived at least one year from the time of diagnosis).

  • 2 Feigl, P., Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics 21, 826–838. Survival times are given for 33 patients who died from acute myelogenous leukaemia. Also measured was the patient’s white blood cell count at the time of diagnosis. The patients were also factored into 2 groups according to the presence or absence of a morphologic characteristic of white blood cells. Patients termed AG positive were identified by the presence of Auer rods and/or significant granulation of the leukaemic cells in the bone marrow at the time of diagnosis.

  • The researchers are interested in modelling the probability \(p\) of surviving at least one year as a function of WBC and AG. They believe that WBC should be transformed to a log scale, given the skewness in the WBC values.

    Code
    #### Example: Leukemia
    ## Leukemia white blood cell types example
    # ntotal = number of patients with IAG and WBC combination
    # nres   = number surviving at least one year
    # ag     = 1 for AG+, 0 for AG-
    # wbc    = white cell blood count
    # lwbc   = log white cell blood count
    # p_hat  = Emperical Probability
    dat_leuk <-
      read_table2("http://statacumen.com/teach/ADA2/notes/ADA2_notes_Ch11_leuk.dat") |>
      mutate(
        ag    = factor(ag)
      , lwbc  = log(wbc)
      , p_hat = nres / ntotal
      )
    Warning: `read_table2()` was deprecated in readr 2.0.0.
    ℹ Please use `read_table()` instead.
    Code
    str(dat_leuk)
    tibble [30 × 6] (S3: tbl_df/tbl/data.frame)
     $ ntotal: num [1:30] 1 1 1 1 1 1 1 1 3 1 ...
     $ nres  : num [1:30] 1 1 1 1 1 1 1 1 1 1 ...
     $ ag    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
     $ wbc   : num [1:30] 75 230 260 430 700 940 1000 1050 10000 300 ...
     $ lwbc  : num [1:30] 4.32 5.44 5.56 6.06 6.55 ...
     $ p_hat : num [1:30] 1 1 1 1 1 ...
    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:44 2023

    As an initial step in the analysis, consider the following model: \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \textrm{ LWBC} + \beta_2 \textrm{ AG},\] where \(\textrm{LWBC}=\log(\textrm{WBC})\). The model is best understood by separating the AG\(+\) and AG\(-\) cases. For AG\(-\) individuals, AG=0 so the model reduces to \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \textrm{ LWBC} + \beta_2*0 = \beta_0 + \beta_1 \textrm{ LWBC}.\] For AG\(+\) individuals, AG=1 and the model implies \[\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \textrm{ LWBC} + \beta_2*1 = (\beta_0 + \beta_2) + \beta_1 \textrm{ LWBC}.\]

    The model without AG (i.e., \(\beta_2=0\)) is a simple logistic model where the log-odds of surviving one year is linearly related to LWBC, and is independent of AG. The reduced model with \(\beta_2=0\) implies that there is no effect of the AG level on the survival probability once LWBC has been taken into account.

    Including the binary predictor AG in the model implies that there is a linear relationship between the log-odds of surviving one year and \(\textrm{LWBC}\), with a constant slope for the two AG levels. This model includes an effect for the AG morphological factor, but more general models are possible. A natural extension would be to include a product or interaction effect, a point that I will return to momentarily.

    The parameters are easily interpreted: \(\beta_0\) and \(\beta_0+\beta_2\) are intercepts for the population logistic regression lines for AG\(-\) and AG\(+\), respectively. The lines have a common slope, \(\beta_1\). The \(\beta_2\) coefficient for the AG indicator is the difference between intercepts for the AG\(+\) and AG\(-\) regression lines. A picture of the assumed relationship is given below for \(\beta_1 < 0\). The population regression lines are parallel on the logit scale only, but the order between AG groups is preserved on the probability scale.

    image

    Before looking at output for the equal slopes model, note that the data set has 30 distinct AG and LWBC combinations, or 30 “groups” or samples. Only two samples have more than 1 observation. The majority of the observed proportions surviving at least one year (number surviving \(\geq 1\) year/group sample size) are 0 (i.e., 0/1) or 1 (i.e., 1/1). This sparseness of the data makes it difficult to graphically assess the suitability of the logistic model (because the estimated proportions are almost all 0 or 1). Although significance tests on the regression coefficients do not require large group sizes, the chi-squared approximation to the deviance statistic is suspect in sparse data settings. With small group sizes as we have here, most researchers would not interpret the p-value for \(D\) literally. Instead, they would use the p-values to informally check the fit of the model. Diagnostics would be used to highlight problems with the model.

    Code
    glm_i_l <-
      glm(
        cbind(nres, ntotal - nres) ~ ag + lwbc
      , family = binomial
      , data = dat_leuk
      )
    
    # Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
    dev_p_val <- 1 - pchisq(glm_i_l$deviance, glm_i_l$df.residual)
    dev_p_val
    [1] 0.6842804

    The large p-value for \(D\) indicates that there are no gross deficiencies with the model. Recall that the Testing Global Null Hypothesis gives p-values for testing the hypothesis that the regression coefficients are zero for each predictor in the model. The two predictors are LWBC and AG, so the small p-values indicate that LWBC or AG, or both, are important predictors of survival. The p-values in the estimates table suggest that LWBC and AG are both important. If either predictor was insignificant, I would consider refitting the model omitting the least significant effect, as in regression.

    Code
    # Testing Global Null Hypothesis
    coef(glm_i_l)
    (Intercept)         ag1        lwbc 
       5.543349    2.519562   -1.108759 
    Code
    # specify which coefficients to test = 0  (Terms = 2:3 is for terms 2 and 3)
    coef_test_values <-
      rep(0, length(coef(glm_m_a)))
    
    library(aod) # for wald.test()
    test_wald <-
      wald.test(
        b     = coef(glm_i_l) - coef_test_values
      , Sigma = vcov(glm_i_l)
      , Terms = c(2, 3)
      )
    Warning in coef(glm_i_l) - coef_test_values: longer object length is not a
    multiple of shorter object length
    Code
    test_wald
    Wald test:
    ----------
    
    Chi-squared test:
    X2 = 8.2, df = 2, P(> X2) = 0.017

    Given that the model fits reasonably well, a test of \(H_0:\beta_2=0\) might be a primary interest here. This checks whether the regression lines are identical for the two AG levels, which is a test for whether AG affects the survival probability, after taking LWBC into account. This test is rejected at any of the usual significance levels, suggesting that the AG level affects the survival probability (assuming a very specific model).

    Code
    summary(glm_i_l)
    
    Call:
    glm(formula = cbind(nres, ntotal - nres) ~ ag + lwbc, family = binomial, 
        data = dat_leuk)
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)   5.5433     3.0224   1.834   0.0666 .
    ag1           2.5196     1.0907   2.310   0.0209 *
    lwbc         -1.1088     0.4609  -2.405   0.0162 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 38.191  on 29  degrees of freedom
    Residual deviance: 23.014  on 27  degrees of freedom
    AIC: 30.635
    
    Number of Fisher Scoring iterations: 5

    A plot of the predicted survival probabilities as a function of LWBC, using AG as the plotting symbol, indicates that the probability of surviving at least one year from the time of diagnosis is a decreasing function of LWBC. For a given LWBC the survival probability is greater for AG\(+\) patients than for AG\(-\) patients. This tendency is consistent with the observed proportions, which show little information about the exact form of the trend.

    Code
    # predict() uses all the Load values in dataset, including appended values
    fit_logit_pred <-
      predict(
        glm_i_l
      , type = "link"
      , se.fit = TRUE
      ) |>
      as_tibble()
    
    # put the fitted values in the data.frame
    dat_leuk <-
      dat_leuk |>
      mutate(
        fit_logit    = fit_logit_pred$fit
      , fit_logit_se = fit_logit_pred$se.fit
      # added "fit_p" to make predictions at appended Load values
      , fit_p        = exp(fit_logit) / (1 + exp(fit_logit))
      # CI for p fitted values
      , fit_p_lower  = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
      , fit_p_upper  = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
      )
    
    
    library(ggplot2)
    p <- ggplot(dat_leuk, aes(x = lwbc, y = p_hat, colour = ag, fill = ag))
    # predicted curve and point-wise 95% CI
    p <- p + geom_ribbon(aes(x = lwbc, ymin = fit_p_lower, ymax = fit_p_upper), alpha = 0.2)
    p <- p + geom_line(aes(x = lwbc, y = fit_p))
    # fitted values
    p <- p + geom_point(aes(y = fit_p), size=2)
    # observed values
    p <- p + geom_point(size = 2, alpha = 0.5)
    p <- p + labs(title = "Observed and predicted probability of 1+ year survival"
                , y = "Probability"
                  )
    print(p)

    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:45 2023

    The estimated survival probabilities satisfy \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 5.54 - 1.11\textrm{ LWBC} + 2.52 \textrm{ AG}.\] For AG\(-\) individuals with AG=0, this reduces to \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 5.54 - 1.11\textrm{ LWBC},\] or equivalently, \[\tilde{p} = \frac{ \exp(5.54 - 1.11 \textrm{ LWBC} ) } { 1 + \exp(5.54 - 1.11 \textrm{ LWBC} ) }.\] For AG\(+\) individuals with AG=1, \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 5.54 - 1.11\textrm{ LWBC} + 2.52(1) = 8.06 - 1.11\textrm{ LWBC},\] or \[\tilde{p} = \frac{ \exp(8.06 - 1.11 \textrm{ LWBC} ) } { 1 + \exp(8.06 - 1.11 \textrm{ LWBC} ) }.\]

    Using the logit scale, the difference between AG\(+\) and AG\(-\) individuals in the estimated log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is the estimated AG regression coefficient \[( 8.06 - 1.11 \textrm{ LWBC} ) - ( 5.54 - 1.11 \textrm{ LWBC} ) = 2.52.\] Using properties of exponential functions, the odds that an AG\(+\) patient lives at least one year is \(\exp(2.52)=12.42\) times larger than the odds that an AG\(-\) patient lives at least one year, regardless of LWBC.

    This summary, and a CI for the AG odds ratio, is given in the Odds Ratio table. Similarly, the estimated odds ratio of 0.33 for LWBC implies that the odds of surviving at least one year is reduced by a factor of 3 for each unit increase of LWBC.

    We can use the confint() function to obtain confidence intervals for the coefficient estimates. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.

    Code
    ## CIs using profiled log-likelihood
    confint(glm_i_l)
    Waiting for profiling to be done...
                     2.5 %     97.5 %
    (Intercept)  0.1596372 12.4524409
    ag1          0.5993391  5.0149271
    lwbc        -2.2072275 -0.3319512

    We can also get CIs based on just the standard errors by using the default method.

    Code
    ## CIs using standard errors
    confint.default(glm_i_l)
                     2.5 %    97.5 %
    (Intercept) -0.3804137 11.467112
    ag1          0.3818885  4.657236
    lwbc        -2.0121879 -0.205330

    You can also exponentiate the coefficients and confidence interval bounds and interpret them as odds-ratios.

    Code
    ## coefficients and 95% CI
    cbind(OR = coef(glm_i_l), confint(glm_i_l))
    Waiting for profiling to be done...
                       OR      2.5 %     97.5 %
    (Intercept)  5.543349  0.1596372 12.4524409
    ag1          2.519562  0.5993391  5.0149271
    lwbc        -1.108759 -2.2072275 -0.3319512
    Code
    ## odds ratios and 95% CI
    exp(cbind(OR = coef(glm_i_l), confint(glm_i_l)))
    Waiting for profiling to be done...
                         OR     2.5 %       97.5 %
    (Intercept) 255.5323676 1.1730851 2.558741e+05
    ag1          12.4231582 1.8209149 1.506452e+02
    lwbc          0.3299682 0.1100052 7.175224e-01

    Although the equal slopes model appears to fit well, a more general model might fit better. A natural generalization here would be to add an interaction, or product term, \(\textrm{AG}\times\textrm{LWBC}\) to the model. The logistic model with an AG effect and the \(\textrm{AG}\times\textrm{LWBC}\) interaction is equivalent to fitting separate logistic regression lines to the two AG groups. This interaction model provides an easy way to test whether the slopes are equal across AG levels. I will note that the interaction term is not needed here.

    21.4.1 Interpretting odds ratios in logistic regression

    Let’s begin with probability3. Let’s say that the probability of success is 0.8, thus \(p = 0.8\). Then the probability of failure is \(q = 1 - p = 0.2\). The odds of success are defined as \(\textrm{odds(success)} = p/q = 0.8/0.2 = 4\), that is, the odds of success are 4 to 1. The odds of failure would be \(\textrm{odds(failure)} = q/p = 0.2/0.8 = 0.25\), that is, the odds of failure are 1 to 4. Next, let’s compute the odds ratio by \(\textrm{OR} = \textrm{odds(success)/odds(failure)} = 4/0.25 = 16\). The interpretation of this odds ratio would be that the odds of success are 16 times greater than for failure. Now if we had formed the odds ratio the other way around with odds of failure in the numerator, we would have gotten something like this, \(\textrm{OR} = \textrm{odds(failure)/odds(success)} = 0.25/4 = 0.0625\).

  • 3 Borrowed graciously from UCLA Academic Technology Services at http://www.ats.ucla.edu/stat/sas/faq/oratio.htm

  • Interestingly enough, the interpretation of the second odds ratio is nearly the same as the first. Here the interpretation is that the odds of failure are one-sixteenth the odds of success. In fact, if you take the reciprocal of the first odds ratio you get \(1/16 = 0.0625\).

    21.4.1.0.0.1 Another example

    This example is adapted from Pedhazur (1997). Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females are admitted. The probabilities for admitting a male are, \(p = 7/10 = 0.7\) and \(q = 1 - 0.7 = 0.3\). Here are the same probabilities for females, \(p = 3/10 = 0.3\) and \(q = 1 - 0.3 = 0.7\). Now we can use the probabilities to compute the admission odds for both males and females, \(\textrm{odds(male)} = 0.7/0.3 = 2.33333\) and \(\textrm{odds(female)} = 0.3/0.7 = 0.42857\). Next, we compute the odds ratio for admission, \(\textrm{OR} = 2.3333/0.42857 = 5.44\). Thus, the odds of a male being admitted are 5.44 times greater than for a female.

    21.4.1.0.0.2 Leukemia example

    In the example above, the OR of surviving at least one year increases 12.43 times for AG+ vs AG\(-\), and increases 0.33 times (that’s a decrease) for every unit increase in lwbc.

    21.5 Example: Mortality of confused flour beetles

    This example illustrates a quadratic logistic model.

    The aim of an experiment originally reported by Strand (1930) and quoted by Bliss (1935) was to assess the response of the confused flour beetle, Tribolium confusum, to gaseous carbon disulphide (CS\(_2\)). In the experiment, prescribed volumes of liquid carbon disulphide were added to flasks in which a tubular cloth cage containing a batch of about thirty beetles was suspended. Duplicate batches of beetles were used for each concentration of CS\(_2\). At the end of a five-hour period, the proportion killed was recorded and the actual concentration of gaseous CS\(_2\) in the flask, measured in mg/l, was determined by a volumetric analysis. The mortality data are given in the table below.

    Code
    #### Example: Beetles
    ## Beetles data set
    # conc = CS2 concentration
    # y    = number of beetles killed
    # n    = number of beetles exposed
    # rep  = Replicate number (1 or 2)
    dat_beetles <-
      read_table2("http://statacumen.com/teach/ADA2/notes/ADA2_notes_Ch11_beetles.dat") |>
      mutate(
        rep = factor(rep)
      )
    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:46 2023 % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:40:46 2023
    Code
    dat_beetles <-
      dat_beetles |>
      mutate(
        conc2 = conc^2 # for quadratic term (making coding a little easier)
      , p_hat = y / n # observed proportion of successes
        # emperical logits
      , emp_logit = log((p_hat + 0.5 / n) / (1 - p_hat + 0.5 / n))
      )
    
    #str(dat_beetles)

    Plot the observed probability of mortality and the empirical logits with linear and quadratic LS fits (which are not the same as the logistic MLE fits).

    Code
    library(ggplot2)
    p <- ggplot(dat_beetles, aes(x = conc, y = p_hat, shape = rep))
    # observed values
    p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
    p <- p + labs(title = "Observed mortality, probability scale")
    print(p)

    Code
    library(ggplot2)
    p <- ggplot(dat_beetles, aes(x = conc, y = emp_logit))
    p <- p + geom_smooth(method = "lm", colour = "red", se = FALSE)
    #p <- p + geom_smooth(method = "lm", formula = y ~ poly(x, 2), colour = "blue", se = FALSE)
    p <- p + geom_smooth(method = "lm", formula = y ~ x + I(x^2), colour = "blue", se = FALSE)
    # observed values
    p <- p + geom_point(aes(shape = rep), color = "black", size = 3, alpha = 0.5)
    p <- p + labs(title = "Empirical logit with 'naive' LS fits (not MLE)")
    print(p)
    `geom_smooth()` using formula = 'y ~ x'

    In a number of articles that refer to these data, the responses from the first two concentrations are omitted because of apparent non-linearity. Bliss himself remarks that

    …in comparison with the remaining observations, the two lowest concentrations gave an exceptionally high kill. Over the remaining concentrations, the plotted values seemed to form a moderately straight line, so that the data were handled as two separate sets, only the results at 56.91 mg of CS\(_2\) per litre being included in both sets.

    However, there does not appear to be any biological motivation for this and so here they are retained in the data set.

    Combining the data from the two replicates and plotting the empirical logit of the observed proportions against concentration gives a relationship that is better fit by a quadratic than a linear relationship, \[\begin{aligned} \log \left( \frac{p}{1-p} \right) & = & \beta_0 + \beta_1 X + \beta_2 X^2 . \nonumber \end{aligned}\] The right plot below shows the linear and quadratic model fits to the observed values with point-wise 95% confidence bands on the logit scale, and on the left is the same on the proportion scale.

    Code
    # fit logistic regression to create lines on plots below
    # linear
    glm_beetles1 <-
      glm(
        cbind(y, n - y) ~ conc
      , family = binomial
      , data = dat_beetles
      )
    # quadratic
    glm_beetles2 <-
      glm(
        cbind(y, n - y) ~ conc + conc2
      , family = binomial
      , data = dat_beetles
      )
    
    ## put model fits for two models together
    
    # First, linear relationship
    fit_logit_pred <-
      predict(
        glm_beetles1
      , type = "link"
      , se.fit = TRUE
      ) |>
      as_tibble()
    
    dat_beetles1 <-
      dat_beetles |>
      mutate(
        fit_logit    = fit_logit_pred$fit
      , fit_logit_se = fit_logit_pred$se.fit
      # added "fit_p" to make predictions
      , fit_p        = exp(fit_logit) / (1 + exp(fit_logit))
      # CI for p fitted values
      , fit_p_lower  = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
      , fit_p_upper  = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
      , modelorder   = "linear"
      )
    
    # Second, quadratic relationship
    fit_logit_pred <-
      predict(
        glm_beetles2
      , type = "link"
      , se.fit = TRUE
      ) |>
      as_tibble()
    
    dat_beetles2 <-
      dat_beetles |>
      mutate(
        fit_logit    = fit_logit_pred$fit
      , fit_logit_se = fit_logit_pred$se.fit
      # added "fit_p" to make predictions
      , fit_p        = exp(fit_logit) / (1 + exp(fit_logit))
      # CI for p fitted values
      , fit_p_lower  = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
      , fit_p_upper  = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
      , modelorder   = "quadratic"
      )
    
    
    dat_beetles_all <-
      dat_beetles1 |>
      bind_rows(dat_beetles2) |>
      mutate(
        modelorder = factor(modelorder)
      )
    
    # plot on logit and probability scales
    library(ggplot2)
    p <- ggplot(dat_beetles_all, aes(x = conc, y = p_hat, shape = rep, colour = modelorder, fill = modelorder))
    # predicted curve and point-wise 95% CI
    p <- p + geom_ribbon(aes(x = conc, ymin = fit_p_lower, ymax = fit_p_upper), linetype = 0, alpha = 0.1)
    p <- p + geom_line(aes(x = conc, y = fit_p, linetype = modelorder), size = 1)
    Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
    ℹ Please use `linewidth` instead.
    Code
    # fitted values
    p <- p + geom_point(aes(y = fit_p), size=2)
    # observed values
    p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
    p <- p + labs(title = "Observed and predicted mortality, probability scale")
    print(p)

    Code
    library(ggplot2)
    p <- ggplot(dat_beetles_all, aes(x = conc, y = emp_logit, shape = rep, colour = modelorder, fill = modelorder))
    # predicted curve and point-wise 95% CI
    p <- p + geom_ribbon(aes(x = conc, ymin = fit_logit - 1.96 * fit_logit_se
                                     , ymax = fit_logit + 1.96 * fit_logit_se), linetype = 0, alpha = 0.1)
    p <- p + geom_line(aes(x = conc, y = fit_logit, linetype = modelorder), size = 1)
    # fitted values
    p <- p + geom_point(aes(y = fit_logit), size=2)
    # observed values
    p <- p + geom_point(color = "black", size = 3, alpha = 0.5)
    p <- p + labs(title = "Observed and predicted mortality, logit scale")
    print(p)

    21.6 Example: The UNM Trauma Data

    The data to be analyzed here were collected on 3132 patients admitted to The University of New Mexico Trauma Center between the years 1991 and 1994. For each patient, the attending physician recorded their age, their revised trauma score (RTS), their injury severity score (ISS), whether their injuries were blunt (i.e., the result of a car crash: BP=0) or penetrating (i.e., gunshot/knife wounds: BP=1), and whether they eventually survived their injuries (SURV=0 if not, SURV=1 if survived). Approximately 10% of patients admitted to the UNM Trauma Center eventually die from their injuries.

    The ISS is an overall index of a patient’s injuries, based on the approximately 1300 injuries cataloged in the Abbreviated Injury Scale. The ISS can take on values from 0 for a patient with no injuries to 75 for a patient with 3 or more life threatening injuries. The ISS is the standard injury index used by trauma centers throughout the U.S. The RTS is an index of physiologic injury, and is constructed as a weighted average of an incoming patient’s systolic blood pressure, respiratory rate, and Glasgow Coma Scale. The RTS can take on values from 0 for a patient with no vital signs to 7.84 for a patient with normal vital signs.

    Champion et al. (1981) proposed a logistic regression model to estimate the probability of a patient’s survival as a function of RTS, the injury severity score ISS, and the patient’s age, which is used as a surrogate for physiologic reserve. Subsequent survival models included the binary effect BP as a means to differentiate between blunt and penetrating injuries.

    We will develop a logistic model for predicting survival from ISS, AGE, BP, and RTS, and nine body regions. Data on the number of severe injuries in each of the nine body regions is also included in the database, so we will also assess whether these features have any predictive power. The following labels were used to identify the number of severe injuries in the nine regions: AS = head, BS = face, CS = neck, DS = thorax, ES = abdomen, FS = spine, GS = upper extremities, HS = lower extremities, and JS = skin.

    Code
    #### Example: UNM Trauma Data
    
    ## Variables
    # surv = survival (1 if survived, 0 if died)
    # rts  = revised trauma score (range: 0 no vital signs to 7.84 normal vital signs)
    # iss  = injury severity score (0 no injuries to 75 for 3 or more life threatening injuries)
    # bp   = blunt or penetrating injuries (e.g., car crash BP=0 vs gunshot/knife wounds BP=1)
    # Severe injuries: add the severe injuries 3--6 to make summary variables
    
    dat_trauma <-
      read_table2("http://statacumen.com/teach/ADA2/notes/ADA2_notes_Ch11_trauma.dat") |>
      mutate(
        as = a3 + a4 + a5 + a6    # as = head
      , bs = b3 + b4 + b5 + b6    # bs = face
      , cs = c3 + c4 + c5 + c6    # cs = neck
      , ds = d3 + d4 + d5 + d6    # ds = thorax
      , es = e3 + e4 + e5 + e6    # es = abdomen
      , fs = f3 + f4 + f5 + f6    # fs = spine
      , gs = g3 + g4 + g5 + g6    # gs = upper extremities
      , hs = h3 + h4 + h5 + h6    # hs = lower extremities
      , js = j3 + j4 + j5 + j6    # js = skin
      ) |>
      # keep only columns of interest
      select(
        id, surv, prob, age, as:js, iss:rts
      )
    head(dat_trauma)
    # A tibble: 6 × 17
           id  surv  prob   age    as    bs    cs    ds    es    fs    gs    hs
        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 1238385     1 0.991    13     0     0     0     1     0     0     0     0
    2 1238393     1 0.995    23     0     0     0     0     0     0     0     0
    3 1238898     1 0.995    43     0     0     0     0     0     0     2     0
    4 1239516     1 0.962    17     1     0     0     0     0     0     0     0
    5 1239961     1 0.934    20     1     0     0     0     0     0     0     0
    6 1240266     1 0.995    32     0     0     0     0     0     0     0     1
    # ℹ 5 more variables: js <dbl>, iss <dbl>, iciss <dbl>, bp <dbl>, rts <dbl>
    Code
    #str(dat_trauma)

    I made side-by-side boxplots of the distributions of ISS, AGE, RTS, and AS through JS for the survivors and non-survivors. In several body regions the number of injuries is limited, so these boxplots are not very enlightening. Survivors tend to have lower ISS scores, tend to be slightly younger, tend to have higher RTS scores, and tend to have fewer severe head (AS) and abdomen injuries (ES) than non-survivors. The importance of the effects individually towards predicting survival is directly related to the separation between the survivors and non-survivors scores.

    Code
    # Create boxplots for each variable by survival
    dat_trauma_long <-
      dat_trauma |>
      pivot_longer(
        cols = -one_of("id", "surv", "prob")
      , names_to  = "variable"
      , values_to = "value"
      )
    
    # Plot the data using ggplot
    library(ggplot2)
    p <- ggplot(dat_trauma_long, aes(x = factor(surv), y = value))
    # boxplot, size=.75 to stand out behind CI
    p <- p + geom_boxplot(size = 0.75, alpha = 0.5)
    # points for observed data
    p <- p + geom_point(position = position_jitter(w = 0.05, h = 0), alpha = 0.1)
    # diamond at mean for each group
    p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
                          alpha = 0.75, colour = "red")
    Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
    ℹ Please use the `fun` argument instead.
    Code
    # confidence limits based on normal distribution
    p <- p + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar",
                          width = .2, alpha = 0.8)
    p <- p + facet_wrap( ~ variable, scales = "free_y", ncol = 4)
    p <- p + labs(title = "Boxplots of variables by survival")
    print(p)

    21.6.1 Selecting Predictors in the Trauma Data

    The same automated methods for model building are available for the glm() procedure, including backward elimination, forward selection, and stepwise methods, among others. Below we perform a stepwise selection using AIC starting at the full model, starting with a full model having 13 effects: ISS, BP, RTS, AGE, and AS–JS. Revisit Chapter @ch:ADA2_notes_10_ModelSelection for more information.

    In our previous logistic regression analyses, the cases in the data set were pre-aggregated into groups of observations having identical levels of the predictor variables. The numbers of cases in the success category and the group sample sizes were specified in the model statement, along with the names of the predictors. The trauma data set, which is not reproduced here, is raw data consisting of one record per patient (i.e., 3132 lines). The logistic model is fitted to data on individual cases by specifying the binary response variable (SURV) with successes and \(1-\textrm{SURV}\) failures with the predictors on the right-hand side of the formula. Keep in mind that we are defining the logistic model to model the success category, so we are modeling the probability of surviving.

    As an aside, there are two easy ways to model the probability of dying (which we don’t do below). The first is to swap the order the response is specified in the formula: cbind(1 - surv, surv). The second is to convert a model for the log-odds of surviving to a model for the log-odds of dying by simply changing the sign of each regression coefficient in the model.

    I only included the summary table from the backward elimination, and information on the fit of the selected model.

    Code
    glm_tr <-
      glm(
        cbind(surv, 1 - surv) ~
          as + bs + cs + ds + es + fs + gs + hs + js + iss + rts + age + bp
      , family = binomial
      , data = dat_trauma
      )
    
    # option: trace = 0 doesn't show each step of the automated selection
    glm_tr_red_AIC <-
      step(
        glm_tr
      , direction = "both"
      , trace = 0
      )
    
    # the anova object provides a summary of the selection steps in order
    glm_tr_red_AIC$anova
      Step Df   Deviance Resid. Df Resid. Dev      AIC
    1      NA         NA      3118   869.4309 897.4309
    2 - as  1 0.04911674      3119   869.4800 895.4800
    3 - bs  1 0.12242766      3120   869.6024 893.6024
    4 - fs  1 0.15243093      3121   869.7549 891.7549
    5 - cs  1 0.78703127      3122   870.5419 890.5419
    6 - ds  1 0.81690443      3123   871.3588 889.3588
    7 - gs  1 1.18272567      3124   872.5415 888.5415
    8 - hs  1 1.17941462      3125   873.7209 887.7209
    9 - js  1 1.71204029      3126   875.4330 887.4330

    The final model includes effects for ES (number of severe abdominal injuries), ISS, RTS, AGE, and BP. All of the effects in the selected model are significant at the 5% level.

    Code
    summary(glm_tr_red_AIC)
    
    Call:
    glm(formula = cbind(surv, 1 - surv) ~ es + iss + rts + age + 
        bp, family = binomial, data = dat_trauma)
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  0.355845   0.442943   0.803   0.4218    
    es          -0.461317   0.110098  -4.190 2.79e-05 ***
    iss         -0.056920   0.007411  -7.680 1.59e-14 ***
    rts          0.843143   0.055339  15.236  < 2e-16 ***
    age         -0.049706   0.005291  -9.394  < 2e-16 ***
    bp          -0.635137   0.249597  -2.545   0.0109 *  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 1825.37  on 3131  degrees of freedom
    Residual deviance:  875.43  on 3126  degrees of freedom
    AIC: 887.43
    
    Number of Fisher Scoring iterations: 7

    The p-value for \(D\) is large, indicating no gross deficiencies with the selected model.

    Code
    # Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
    dev_p_val <- 1 - pchisq(glm_tr_red_AIC$deviance, glm_tr_red_AIC$df.residual)
    dev_p_val
    [1] 1

    Letting \(p\) be the probability of survival, the estimated survival probability is given by \[\begin{aligned} \log \left( \frac{\tilde p}{1-\tilde p} \right) & = & 0.3558 -0.4613 \textrm{ ES} -0.6351 \textrm{ BP} -0.0569 \textrm{ ISS} \nonumber\\ %=== & & +0.8431 \textrm{ RTS} -0.0497 \textrm{ AGE}. \nonumber \end{aligned}\] Let us interpret the sign of the coefficients, and the odds ratios, in terms of the impact that individual predictors have on the survival probability.

    Code
    ## coefficients and 95% CI
    cbind(OR = coef(glm_tr_red_AIC), confint(glm_tr_red_AIC))
    Waiting for profiling to be done...
                         OR       2.5 %      97.5 %
    (Intercept)  0.35584499 -0.51977300  1.21869015
    es          -0.46131679 -0.67603693 -0.24307991
    iss         -0.05691973 -0.07159539 -0.04249502
    rts          0.84314317  0.73817886  0.95531089
    age         -0.04970641 -0.06020882 -0.03943822
    bp          -0.63513735 -1.12051508 -0.14005288
    Code
    ## odds ratios and 95% CI
    exp(cbind(OR = coef(glm_tr_red_AIC), confint(glm_tr_red_AIC)))
    Waiting for profiling to be done...
                       OR     2.5 %    97.5 %
    (Intercept) 1.4273863 0.5946555 3.3827539
    es          0.6304529 0.5086287 0.7842088
    iss         0.9446699 0.9309075 0.9583952
    rts         2.3236592 2.0921220 2.5994786
    age         0.9515087 0.9415679 0.9613293
    bp          0.5298627 0.3261118 0.8693123

    21.6.2 Checking Predictions for the Trauma Model

    To assess the ability of the selected model to accurately predict survival, assume that the two types of errors (a false positive prediction of survival, and a false negative prediction) have equal costs. Under this assumption, the optimal classification rule is to predict survival for an individual with an estimated survival probability of 0.50 or larger.

    In the below script, a table is generated that gives classifications for cutoff probabilities thresh\(= 0.0, 0.1, \ldots, 1.0\) based on the selected model. While I do not do this below, it is common to use the jackknife method for assessing classification accuracy. To implement the jackknife method, each observation is temporarily held out and the selected model is fitted to the remaining (3131) cases. This leads to an estimated survival probability for the case, which is compared to the actual result. The results are summarized in terms of total number of correct and incorrect classifications for each possible outcome. In the table below, the columns labeled Event (a survival) and Non-Event (a death) refer to the classification of observations, and not to the actual outcomes. The columns labeled Correct and Incorrect identify whether the classifications are accurate.

    Code
    # thresholds for classification given model proportion predictions for each observation
    thresh  <- seq(0, 1, by = 0.1)
    
    # predicted probabilities
    Yhat <- fitted(glm_tr_red_AIC)
    
    # Name: lower (0) = NonEvent, higher (1) = Event
    YObs <-
      cut(
        dat_trauma$surv
      , breaks = c(-Inf, mean(dat_trauma$surv), Inf)
      , labels = c("NonEvent", "Event")
      )
    
    classify_table <-
      data.frame(
        Thresh     = rep(NA, length(thresh))
      , Cor_Event  = rep(NA, length(thresh))
      , Cor_NonEv  = rep(NA, length(thresh))
      , Inc_Event  = rep(NA, length(thresh))
      , Inc_NonEv  = rep(NA, length(thresh))
      , Cor_All    = rep(NA, length(thresh))
      , Sens       = rep(NA, length(thresh))
      , Spec       = rep(NA, length(thresh))
      , Fal_P      = rep(NA, length(thresh))
      , Fal_N      = rep(NA, length(thresh))
      )
    
    for (i_thresh in 1:length(thresh)) {
      # choose a threshold for dichotomizing according to predicted probability
      YhatPred <-
        cut(
          Yhat
        , breaks = c(-Inf, thresh[i_thresh], Inf)
        , labels = c("NonEvent", "Event")
        )
    
      # contingency table and marginal sums
      cTab <- table(YhatPred, YObs)
      addmargins(cTab)
    
      # Classification Table
      classify_table$Thresh   [i_thresh] <- thresh[i_thresh]                  # Prob_Level
      classify_table$Cor_Event[i_thresh] <- cTab[2,2]                         # Correct_Event
      classify_table$Cor_NonEv[i_thresh] <- cTab[1,1]                         # Correct_NonEvent
      classify_table$Inc_Event[i_thresh] <- cTab[2,1]                         # Incorrect_Event
      classify_table$Inc_NonEv[i_thresh] <- cTab[1,2]                         # Incorrect_NonEvent
      classify_table$Cor_All  [i_thresh] <- 100 * sum(diag(cTab)) / sum(cTab) # Correct_Overall
      classify_table$Sens     [i_thresh] <- 100 * cTab[2,2] / sum(cTab[,2])   # Sensitivity
      classify_table$Spec     [i_thresh] <- 100 * cTab[1,1] / sum(cTab[,1])   # Specificity
      classify_table$Fal_P    [i_thresh] <- 100 * cTab[2,1] / sum(cTab[2,])   # False_Pos
      classify_table$Fal_N    [i_thresh] <- 100 * cTab[1,2] / sum(cTab[1,])   # False_Neg
    }
    classify_table |> round(2)
       Thresh Cor_Event Cor_NonEv Inc_Event Inc_NonEv Cor_All   Sens   Spec
    1     0.0      2865         0       267         0   91.48 100.00   0.00
    2     0.1      2861        79       188         4   93.87  99.86  29.59
    3     0.2      2856       105       162         9   94.54  99.69  39.33
    4     0.3      2848       118       149        17   94.70  99.41  44.19
    5     0.4      2837       125       142        28   94.57  99.02  46.82
    6     0.5      2825       139       128        40   94.64  98.60  52.06
    7     0.6      2805       157       110        60   94.57  97.91  58.80
    8     0.7      2774       174        93        91   94.13  96.82  65.17
    9     0.8      2727       196        71       138   93.33  95.18  73.41
    10    0.9      2627       229        38       238   91.19  91.69  85.77
    11    1.0         0       267         0      2865    8.52   0.00 100.00
       Fal_P Fal_N
    1   8.52   NaN
    2   6.17  4.82
    3   5.37  7.89
    4   4.97 12.59
    5   4.77 18.30
    6   4.33 22.35
    7   3.77 27.65
    8   3.24 34.34
    9   2.54 41.32
    10  1.43 50.96
    11   NaN 91.48

    %```{r texoutput-trauma-classify, results=‘asis’, echo=FALSE, size=‘footnotesize’} %library(xtable) %xtab.out <- xtable(round(classify.table, 3)) %print(xtab.out, floating=FALSE, math.style.negative=TRUE) %@

    The data set has 2865 survivors and 267 people that died. Using a 0.50 cutoff, 2825 of the survivors would be correctly identified, and 40 misclassified. Similarly, 139 of the patients that died would be correctly classified and 128 would not. The overall percentage of cases correctly classified is \((2825+138)/3132 = 94.6\%\). The sensitivity is the percentage of survivors that are correctly classified, \(2825/(2825+40)=98.6\%\). The specificity is the percentage of patients that died that are correctly classified, \(139/(139+128)=52.1\%\). The false positive rate, which is the % of those predicted to survive that did not, is \(128/(128+2825)=4.3\%\). The false negative rate, which is the % of those predicted to die that did not is \(40/(40+139)=22.5\%\).

    Code
    # Thresh = 0.5 classification table
    YhatPred <-
      cut(
        Yhat
      , breaks = c(-Inf, 0.5, Inf)
      , labels = c("NonEvent", "Event")
      )
    # contingency table and marginal sums
    cTab <- table(YhatPred, YObs)
    addmargins(cTab)
              YObs
    YhatPred   NonEvent Event  Sum
      NonEvent      139    40  179
      Event         128  2825 2953
      Sum           267  2865 3132
    Code
    classify_table |> filter(Thresh == 0.5) |> round(2)
      Thresh Cor_Event Cor_NonEv Inc_Event Inc_NonEv Cor_All Sens  Spec Fal_P
    1    0.5      2825       139       128        40   94.64 98.6 52.06  4.33
      Fal_N
    1 22.35

    The misclassification rate seems small, but you should remember that approximately 10% of patients admitted to UNM eventually die from their injuries. Given this historical information only, you could achieve a 10% misclassification rate by completely ignoring the data and classifying each admitted patient as a survivor. Using the data reduces the misclassification rate by about 50% (from 10% down to 4.4%), which is an important reduction in this problem.

    21.7 Historical Example: O-Ring Data

    The table below presents data from Dalal, Fowlkes, and Hoadley (1989) on field O-ring failures in the 23 pre-Challenger space shuttle launches. Temperature at lift-off and O-ring joint pressure are predictors. The binary version of the data views each flight as an independent trial. The result of a trial (\(y\)) is a 1 if at least one field O-ring failure occurred on the flight and a 0 if all six O-rings functioned properly. A regression for the binary data would model the probability that any O-ring failed as a function of temperature and pressure. The binomial version of these data is also presented. It views the six field O-rings on each flight as independent trials. A regression for the binomial data would model the probability that a particular O-ring on a given flight failed as a function of temperature and pressure. The two regressions model different probabilities. The Challenger explosion occurred during a takeoff at 31 degrees Fahrenheit.

    Code
    #### Example: Shuttle O-ring data
    dat_shuttle <-
      read_csv("http://statacumen.com/teach/ADA2/notes/ADA2_notes_Ch11_shuttle.csv")
    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:41:05 2023

    The regression model for the binomial data (i.e., six trials per launch) is suspect on substantive grounds. The binomial model makes no allowance for an effect due to having the six O-rings on the same flight. If these data were measurements rather than counts, they would almost certainly be treated as dependent repeated measures. If interest is focused on whether one or more O-rings failed, the simplest, most direct data are the binary data.

    Consider fitting a logistic regression model using temperature and pressure as predictors. Let \(p_i\) be the probability that any O-ring fails in case \(i\) and model this as \[\log \left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 \textrm{ Temp}_i + \beta_2 \textrm{ Pressure}_i.\]

    Logistic histogram plots of the data show a clear marginal relationship of failure with temp but not with pressure. We still need to assess the model with both variables together.

    Code
    # plot logistic plots of response to each predictor individually
    #library(popbio)
    popbio::logi.hist.plot(
        dat_shuttle$temp
      , dat_shuttle$y
      , boxp   = FALSE
      , type   = "hist"
      , rug    = TRUE
      , col    = "gray"
      , ylabel = "Probability"
      , xlabel = "Temp"
      )

    Code
    popbio::logi.hist.plot(
        dat_shuttle$pressure
      , dat_shuttle$y
      , boxp   = FALSE
      , type   = "hist"
      , rug    = TRUE
      , col    = "gray"
      , ylabel = "Probability"
      , xlabel = "Pressure"
      )

    We fit the logistic model below using \(Y=1\) if at least one O-ring failed, and 0 otherwise. We are modelling the chance of one or more O-ring failures as a function of temperature and pressure.

    The \(D\) goodness-of-fit statistic suggest no gross deviations from the model. Furthermore, the test of \(H_0: \beta_1=\beta_2=0\) (no regression effects) based on the Wald test has a p-value of 0.1, which suggests that neither temperature or pressure, or both, are useful predictors of the probability of O-ring failure. The z-test test p-values for testing \(H_0:\beta_1=0\) and \(H_0:\beta_2=0\) individually are 0.037 and 0.576, respectively, which indicates pressure is not important (when added last to the model), but that temperature is important. This conclusion might be anticipated by looking at data plots above.

    Code
    glm_sh <-
      glm(
        cbind(y, 1 - y) ~ temp + pressure
      , family = binomial
      , data = dat_shuttle
      )
    
    # Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
    dev_p_val <- 1 - pchisq(glm_sh$deviance, glm_sh$df.residual)
    dev_p_val
    [1] 0.4589415
    Code
    # Testing Global Null Hypothesis
    coef(glm_sh)
     (Intercept)         temp     pressure 
    16.385319489 -0.263404073  0.005177602 
    Code
    # specify which coefficients to test = 0  (Terms = 2:3 is for terms 2 and 3)
    coef_test_values <-
      rep(0, length(coef(glm_sh)))
    
    library(aod) # for wald.test()
    test_wald <-
      wald.test(
        b     = coef(glm_sh) - coef_test_values
      , Sigma = vcov(glm_sh)
      , Terms = c(2, 3)
      )
    test_wald
    Wald test:
    ----------
    
    Chi-squared test:
    X2 = 4.6, df = 2, P(> X2) = 0.1
    Code
    # Model summary
    summary(glm_sh)
    
    Call:
    glm(formula = cbind(y, 1 - y) ~ temp + pressure, family = binomial, 
        data = dat_shuttle)
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)  
    (Intercept) 16.385319   8.027474   2.041   0.0412 *
    temp        -0.263404   0.126371  -2.084   0.0371 *
    pressure     0.005178   0.009257   0.559   0.5760  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 28.267  on 22  degrees of freedom
    Residual deviance: 19.984  on 20  degrees of freedom
    AIC: 25.984
    
    Number of Fisher Scoring iterations: 5

    A reasonable next step would be to refit the model, after omitting pressure as a predictor. The target model is now \[\log \left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 \textrm{ Temp}_i.\]

    Code
    glm_sh <-
      glm(
        cbind(y, 1 - y) ~ temp
      , family = binomial
      , data = dat_shuttle
      )
    
    # Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
    dev_p_val <- 1 - pchisq(glm_sh$deviance, glm_sh$df.residual)
    dev_p_val
    [1] 0.5013827
    Code
    # Model summary
    summary(glm_sh)
    
    Call:
    glm(formula = cbind(y, 1 - y) ~ temp, family = binomial, data = dat_shuttle)
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  15.0429     7.3786   2.039   0.0415 *
    temp         -0.2322     0.1082  -2.145   0.0320 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 28.267  on 22  degrees of freedom
    Residual deviance: 20.315  on 21  degrees of freedom
    AIC: 24.315
    
    Number of Fisher Scoring iterations: 5

    Our conclusions on the overall fit of the model and the significance of the effect of temperature on the probability of O-ring failure are consistent with the results for two predictor model. The model estimates the log-odds of (at least one) O-ring failure to be \[\log \left( \frac{\tilde{p}}{1-\tilde{p}} \right) = 15.043 - 0.2322 \textrm{ Temp}.\] The estimated probability of (at least one) O-ring failure is \[\tilde{p} = \frac{ \exp( 15.043 - 0.2322 \textrm{ Temp}) } { 1 + \exp( 15.043 - 0.2322 \textrm{ Temp}) }.\] This is an decreasing function of temperature.

    The Challenger was set to launch on a morning where the predicted temperature at lift-off was 31 degrees. Given that temperature appears to affect the probability of O-ring failure (a point that NASA missed), what can/should we say about the potential for O-ring failure?

    Clearly, the launch temperature is outside the region for which data were available. Thus, we really have no prior information about what is likely to occur. If we assume the logistic model holds, and we can extrapolate back to 31 degrees, what is the estimated probability of O-ring failure?

    The following gives the answer this question. I augmented the original data set to obtain predicted probabilities for temperatures not observed in the data set. Note that the fitted model to data with missing observations gives the same model fit because glm() excludes observations with missing values. Predictions are then made for all temperatures in the dataset and the resulting table and plot are reported.

    Code
    # append values to dataset for which we wish to make predictions
    dat_shuttle_pred <-
      tibble(
        case     = rep(NA, 5)
      , flight   = rep(NA, 5)
      , y        = rep(NA, 5)
      , six      = rep(NA, 5)
      , temp     = c(31, 35, 40, 45, 50) # temp values to predict
      , pressure = rep(NA, 5)
      )
    dat_shuttle <-
      dat_shuttle_pred |>
      bind_rows(dat_shuttle)
    # fit model
    glm_sh <-
      glm(
        cbind(y, 1 - y) ~ temp
      , family = binomial
      , data = dat_shuttle
      )
    # Note: same model fit as before since glm() does not use missing values
    summary(glm_sh)$coefficients |> round(3)
                Estimate Std. Error z value Pr(>|z|)
    (Intercept)   15.043      7.379   2.039    0.041
    temp          -0.232      0.108  -2.145    0.032
    Code
    # predict() uses all the temp values in dataset, including appended values
    fit_logit_pred <-
      predict(
        glm_sh
      , newdata = dat_shuttle |> select(temp)  # use temp from dat_shuttle
      , type = "link"
      , se.fit = TRUE
      ) |>
      as_tibble()
    
    # put the fitted values in the data.frame
    dat_shuttle <-
      dat_shuttle |>
      mutate(
        fit_logit    = fit_logit_pred$fit
      , fit_logit_se = fit_logit_pred$se.fit
      # added "fit_p" to make predictions at appended Load values
      , fit_p        = exp(fit_logit) / (1 + exp(fit_logit))
      # CI for p fitted values
      , fit_p_lower  = exp(fit_logit - 1.96 * fit_logit_se) / (1 + exp(fit_logit - 1.96 * fit_logit_se))
      , fit_p_upper  = exp(fit_logit + 1.96 * fit_logit_se) / (1 + exp(fit_logit + 1.96 * fit_logit_se))
      )
    % latex table generated in R 4.3.1 by xtable 1.8-4 package % Tue Dec 19 16:41:05 2023
    Code
    library(ggplot2)
    p <- ggplot(dat_shuttle, aes(x = temp, y = y))
    # predicted curve and point-wise 95% CI
    p <- p + geom_ribbon(aes(x = temp, ymin = fit_p_lower, ymax = fit_p_upper), alpha = 0.2)
    p <- p + geom_line(aes(x = temp, y = fit_p), colour="red")
    # fitted values
    p <- p + geom_point(aes(y = fit_p), size=2, colour="red")
    # observed values
    p <- p + geom_point(size = 2)
    p <- p + ylab("Probability")
    p <- p + labs(title = "Observed events and predicted probability of 1+ O-ring failures")
    print(p)
    Warning: Removed 5 rows containing missing values (`geom_point()`).