ADA1: Class 04, Study Design and Sampling

Advanced Data Analysis 1, Stat 427/527, Fall 2023, Prof. Erik Erhardt, UNM


Your Name


August 24, 2023


Answer each question and specify the supporting evidence.

(6 p) 1. IRS Sampling

In this case study there are two errors in the methodology: one regarding sampling and the other regarding calculation. Focus on the sampling issue.

Case Study 1: The stated case of the IRS1

The entire details of the lawsuit brought by the IRS against the defendant will not be covered in this paper. However, parts of this case are statistically interesting. The defendant was the owner of a tax preparation firm with several locations, and he was directly or indirectly responsible for the preparation and filing of at least 24,399 federal income tax returns for the tax years 2003 through 2007. The IRS stated that they reviewed 345 returns of the 24,399 identified. Of the 345 which the IRS reviewed, 313 resulted in needing additional tax assessment. This means that 91% of the original sample had returns that owed additional tax to the IRS, and the additional tax was owed for a variety of reasons. The IRS calculated from these 345 returns that the actual tax loss directly due to these returns being improperly prepared by the defendant(s) was in excess of $1.1 million (United States v. Brier, et. al., pg. 3). The IRS further stated that if this rate loss were applied to all 24,399 returns, then the estimated loss to the United States government would be in excess of $85 million for the years 2003 through 2007 (United States v. Brier, et. al., pg. 5). Thus the IRS was looking for damages close to 85 million dollars.

1. (2 p) Explain how the sample was selected. How do you know?


2. (2 p) What sampling method should have been used and what is that method called?


3. (2 p) What problems with the inferential results might have been introduced because of the poor sampling method?


(4 p) 2. Electronic health records (EHR)

Background2: A substantial portion of the US population remains uninsured and even a larger group uses healthcare rarely only. Although the trend is toward greater use of EHRs, only about 40% of patients currently have their information recorded in EHRs.

Case Study 2: Nurses Health Study

The large Nurses Health Study followed 48,470 postmenopausal women, 30–63 years of age for 10 years (337,854 person-years). The study3 concluded that use of hormone replacement therapy cut the rate of serious coronary heart disease nearly in half.

1. (2 p) Despite the enormous sample size, why should we be skeptical of this result? What sampling issue is present? How could that affect the results?

Case Study 3: Estimating disease prevalence

A young MD wanted to predict the number of patients she could get in his specialized field. She obtained EHRs from her university hospital from the previous year and calculated the proportion of people admitted who had a particular ailment from the total people who were admitted to hospital. She concluded that there was a high proportion of people who would need her services in a clinic.

1. (2 p) What mistake did she make? Is her estimated proportion too high or too low? Should she open a clinic?



  1. Kennedy, K., & Bishop, J. (2014). Random sampling issues in a federal court case, a case study. Case Studies In Business, Industry And Government Statistics, 5(2), 111-114. pdf↩︎

  2. Kaplan, Robert M., David A. Chambers, and Russell E. Glasgow. “Big data and large sample size: a cautionary note on the potential for bias.” Clinical and translational science 7.4 (2014): 342-346. pdf↩︎

  3. Stampfer MJ, Colditz GA, Willett WC, Manson JE, Rosner B, Speizer FE, Hennekens CH. Postmenopausal estrogen therapy and cardiovascular disease. Ten-year follow-up from the nurses’ health study. N Engl J Med. 1991; 325: 756–762.↩︎