ADA2: Class 13, Ch 08, polynomial regression

Advanced Data Analysis 2, Stat 428/528, Spring 2024, Prof. Erik Erhardt, UNM


Your Name


December 20, 2023

Hooker’s Himalayian boiling point altitude data

Dr. Joseph Hooker collected the following data in the 1840s on the boiling point of water and the atmospheric pressure at 31 locations in the Himalayas. Boiling point is measured in degrees Fahrenheit. The pressure is recorded in inches of mercury, adjusted for the difference between the ambient air temperature when he took the measurements and a standard temperature.

The goal was to develop a model to predict the atmospheric pressure from the boiling point.

Historical note: Hooker really wanted to estimate altitude above sea level from measurements of the boiling point of water. He knew that the altitude could be determined from the atmospheric pressure, measured with a barometer, with lower pressures corresponding to higher altitudes. His interest in the above modelling problem was motivated by the difficulty of transporting the fragile barometers of the 1840s. Measuring the boiling point would give travelers a quick way to estimate elevation, using the known relationship between elevation and barometric pressure, and the above model relating pressure to boiling point.

Registered S3 methods overwritten by 'ggpp':
  method                  from   
  heightDetails.titleGrob ggplot2
  widthDetails.titleGrob  ggplot2
Registered S3 method overwritten by 'ggpmisc':
  method                  from   
  as.character.polynomial polynom
Warning: replacing previous import 'ggplot2::annotate' by 'ggpp::annotate' when
loading 'erikmisc'
── Attaching packages ─────────────────────────────────────── erikmisc 0.2.12 ──
✔ tibble 3.2.1     ✔ dplyr  1.1.4
── Conflicts ─────────────────────────────────────────── erikmisc_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
erikmisc, solving common complex data analysis workflows
  by Dr. Erik Barry Erhardt <>
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.4.4     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<>) to force all conflicts to become errors
dat_boil <-
  , skip = 2
  ) %>%
    boilingF_cen = boilingF - mean(boilingF)
Rows: 31 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): boilingF, pressure

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# x-variable mean for centering
dat_boil$boilingF %>% mean()
[1] 191.7871
tibble [31 × 3] (S3: tbl_df/tbl/data.frame)
 $ boilingF    : num [1:31] 211 210 208 202 201 ...
 $ pressure    : num [1:31] 29.2 28.6 28 24.7 23.7 ...
 $ boilingF_cen: num [1:31] 19.01 18.41 16.61 10.71 8.81 ...

(2 p) Plot the data.

Using ggplot, try to implement these features in a plot. Overlay both a straight-line regression line in blue (geom_smooth(method = lm, col = "blue", ...)), as well as a loess smooth (default) dashed line in red (geom_smooth(method = loess, col = "red", linetype = 2, ...)). Using alpha=1/5 will make the confidence bands more transparent. Also, if you plot the points last, they’ll lie on top of the lines.

Describe the key features of this plot.


I’ll give you this first plot to help get started, in particular to illustrate a nice use of the caption and the annotation of a second x-axis for the centered version of the boilingF variable.

p <- ggplot(dat_boil, aes(x = boilingF, y = pressure))
p <- p + scale_x_continuous(sec.axis = sec_axis(~ . - mean(dat_boil$boilingF), name = "boilingF centered"))
p <- p + geom_vline(xintercept = mean(dat_boil$boilingF), alpha = 1/4)
p <- p + geom_smooth(method = lm, se = TRUE, col = "blue", fill = "blue", alpha = 1/5)
p <- p + geom_smooth(method = loess, se = TRUE, col = "red", fill = "red", linetype = 2, alpha = 1/5)
p <- p + geom_point(size = 2)
p <- p + labs(title = "Simple linear model"
            , caption = "Blue solid = line, Red dashed = loess smooth curve"
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'


(3 p) Fit a simple linear regression, assess assumptions.

Fit a simple linear regression model for predicting pressure from boiling point. Provide output for examining residuals, outliers, and influential cases.

Looking at the plots, are there any indications that the mean pressure is not linearly related to boiling point? Are there any observations that appear to be highly influencing the fit of this model? Are there certain points or regions of the data where the model does not appear to fit well? Discuss.

Which, if any, of the standard linear regression model assumptions appears to be violated in this analysis? If you believe that some of the assumptions are violated, does it appear that deleting one or two points would dramatically improve the fit? Would you use this model for predicting pressure from boiling point? Discuss and carry out any needed analysis to support your position.



(1 p) Interpret \(R^2\)

Interpret \(R^2\) in the previous simple linear regression model.



(2 p) A better model.

Decide whether transformation, or a polynomial model in boiling point, is needed to adequately summarize the relationship between pressure and boiling point. If so, perform a complete analysis of the data on this scale (that is, check for influential observations, outliers, non-normality, etc.).



(2 p) Final model.

Regardless of which scale you choose for the analysis, provide an equation to predict pressure from boiling point. Write a short summary, pointing out any limitations of your analysis.



Example based on the first linear model

Assuming you called your linear model object lm_p_b1, then the equation with code below will place the intercept and slope in the equation. Just add an r before each of the signif(...) inline code chunks to make the numbers appear. Then use this example to write your final model here.

\[ \widehat{\textrm{pressure}} = ` signif(coef(lm_p_b1)[1], 3)` + ` signif(coef(lm_p_b1)[2], 3)` \textrm{ boilF} \]