Two Sample Proportions - Lister and the Germ Theory

1 Kudo

Lister and the Germ Theory
Two Sample Proportions

Produced by: Dr. DeWayne Derryberry

Idaho State University Department of Mathematics

Key ideas:

Practical Importance and Statistical Significance, Observational Studies, Relative Risk, Scope of Inference.

Background

“In the 1860s the notion that there were millions of tiny invisible organisms responsible for infection and death seemed a little odd. We should pause for a minute and appreciate how bizarre this claim was at the time. Much of what we now take for granted sounds quite implausible, on first hearing. We should not be surprised if the germ theory was initially met with much skepticism.

“Nevertheless, if the germ theory is true, there is an advantage in sterilizing surgical instruments between operations. In fact, Joseph Lister, a surgeon, performed one of the first natural experiments to demonstrate the power of antiseptics (a solution of carbolic acid and water), presumably in killing these mysterious germs.”

– From Lord Joseph Lister: The Rise of Antiseptic Surgery and the Modern Place of Antiseptics in Wound Care, David Leaper. European Wound Management Association, 17^th annual conference. May 2-4, 2007.

The Task

Lister worked in a hospital where amputations were performed. The hospital performed 35 amputations with 16 deaths immediately before instituting sterilization of instruments, and 40 amputations with six deaths immediately after the institution of sterilization.

Is there evidence that the sterilization process reduces deaths when amputations are performed? If sterilization is effective, can we quantify the level of effectiveness?

The Data : Lister.jmp

The data set contains the results of the experiment.

Death Yes or no.

Intervention Whether the count was for the pre- or post-sterilization period.

Count The number of patients in each classification. For example, 16 patients were in the pre-sterilization period and did not survive amputation. Count has been assigned the “Freq” role in the JMP data table.

Exhibit 1 The data

Analysis

Exhibit 2 is a more visual display of the data. An apparent effect of sterilization is visible in this graph. In the pre-sterilization period, almost half (47.5%) of the patients died. In the post-sterilization period, only 15% died. But is the result statistically significant?

Exhibit 2 A Visual Representation of the Data

Screen Shot 2020-07-03 at 1.49.48 PM.png

(Analyze > Fit Y by X, select Death as Y, Response and Intervention as X, Factor. Count, which was previously assigned the “Freq” role, appears automatically in the Freq field. Click OK.

To show percents, right-click on the Mosaic Plot and select Cell Labeling > Show Percents

To assign the “Freq” role before analysis, right-click on the variable in the data table and select Preselect Role > Freq.)

When working with counts we should realize that these are very small sample sizes, so we must rule out chance as an explanation for what we are seeing. Even if the intervention is completely ineffective, 50% of the time the post-intervention group will display improvement of some sort (just by chance).

We want first to test the claim that sterilization is effective:

Ho: the sterilization process does not impact deaths

Ha: the sterilization process reduces deaths

The one-sided Fisher’s Exact test (right-tailed, probability of death higher in pre-intervention than post-intervention) in Exhibit 3 directly addresses this question.

Exhibit 3 Testing the Main Hypothesis

Screen Shot 2020-07-03 at 1.50.55 PM.png

The two-sided tests make a less informative comparison, but it will be easy to see how these tests could be used to draw the same conclusions.

Ho: the sterilization process does not impact deaths

Ha: the sterilization process impacts deaths

The two-sided Pearson and likelihood ratio tests (Exhibit 3) also produce small p-values, providing strong evidence the intervention is effective. How can I say this?

A Note About One- and Two-Sided Tests

The likelihood ratio test is commonly not used in introductory textbooks. In fact, textbooks typically only discuss Pearson’s test statistic. But this is a two-sided test. How should we interpret two-sided tests when our alternative is one-sided?

When an alternative hypothesis is one-sided, the best situation is a correct one-sided p-value. However, often it is the case that only two-sided p-values appear in computer output. Even when both one- and two-sided tests appear, the two-sided tests contribute to the narrative, and should not be ignored. However, a tedious conversion of two-sided p-values into one-sided p-values is often both confusing and unnecessary. If the data does not support the alternative hypothesis (in this case, if deaths increased in the intervention group), there is really no need to look at p-values in detail – the data cannot possibly support the alternative hypothesis. If a conversion to one-sided p-values is done correctly, the p-values will be large.

On the other hand, if the data appears to support the alternative hypothesis (in this case, if deaths did indeed decrease in the intervention group), the two-sided p-value usually leads to the same conclusion as the one-sided p-value. Usually both p-values are large or small. The only time the p-values might lead to different conclusions is when the two-sided p-value is between 0.05 and 0.10.

Of course, in writing a final report, every detail should be correct, and it costs nothing to report the appropriate p-value, one- or two-sided. However, when scanning computer output and thinking about the data, it is important to not get caught up in minutiae and to have a sense of what the data, and the computer output, are saying. In this example, all of the computer output makes it clear that the intervention seems to have had a huge impact.

Certainly a statistically trained Lister would be ecstatic about the results, even if no one-sided p-values were displayed.

Two-Sample Test for Proportions

When you have a one-sided alternative, there is usually a way to get a one-sided p-value. In this case, a two-sample test for proportions is available (Exhibit 4).

The two-sided p-value for this test almost exactly matches the Pearson p-value (Exhibit 3), and this test does have one-sided versions. The p-value of 0.0017 provides convincing evidence that deaths were reduced in the sterilization period.

Let’s think about the implications of these results. We are not just saying deaths were reduced in the sample in the sterilization period – we can see that from the data. We are really saying that, if this procedure were adopted by other hospitals, deaths would be reduced in the future in these hospitals. The test allows us to generalize beyond the sample.

Exhibit 4 Two-Sample Test of Proportions

(From the Fit Y by X output, click on the top red triangle and select Two Sample Test for Proportions. This option is only available for 2 x 2 tables.

The default category of interest, in this case, is no. Since the hypotheses are stated in terms of deaths (Death = yes), change the category of interest to yes using the radio button at the bottom of the output.)

The confidence interval provided with the two-sample test of proportions tells us a lot more – in particular, that in similar circumstances we would estimate that deaths can be reduced by anywhere from 9.7% to 48.9%. This is obviously a huge reduction in deaths! If the germ theory is true, sterilization of instruments can be expected to have a huge impact in a variety of circumstances.

Relative Risk

Another way of quantifying the gains from sterilization is in terms of the ratio of probabilities, known as the relative risk. This computation (Exhibit 5, and below) tells us that our best estimate is that deaths are about three times more likely without sterilization that with (wow!).

Screen Shot 2020-07-03 at 1.51.20 PM.png

A 95% confidence interval (1.34, 6.93) provides an interval estimate for the relative risk. We can say that deaths are between 34% (1.34 – 1) and 593% (6.93 – 1) more likely without sterilization than with sterilization. Note that this interval is very wide because the sample size is quite small.

Exhibit 5 Relative Risk

Screen Shot 2020-07-03 at 1.51.16 PM.png

(From the Fit Y by X output, click on the top red triangle and select Relative Risk. This option is also only available for 2 x 2 tables.)

What does all of this mean?

There is overwhelming evidence that those in the post-intervention period had a much lower death rate than those in the pre-intervention period, and sterilization is a likely reason for this difference. We also know that the difference was large (of great practical importance), so we could expect this change to produce big positive results in other hospital and clinical settings. However, we cannot assume that the reduction in deaths would exactly mirror that found in Lister’s hospital.

Summary

Statistical Insights – Scope of Inference

How far can we generalize these results? Two types of generalization are usually of interest. Can we infer beyond this sample to some larger population? Can we infer cause-effect (did sterilization cause the changes we saw)?

Causation

The question of cause-effect is a bit murky. The study was not a true experiment. Patients were not randomly assigned to receive amputations with or without sterilization, and the study was not double blind. Of course, with the benefit of hindsight we now believe the germ theory and we know sterilization plays a critical role in all hospitals and clinics. If we, for a moment, forget this knowledge, we can see that there were limitations to this study at that time. Although we don’t want to overstate the point, the probable reason for the difference in death rates is sterilization.

Nevertheless, the variable we are manipulating, sterilization, is confounded with time. Other changes in the hospital (who operates, changes in procedures, time of year, etc.) are a rich source of potential alternative explanations of any changes in outcomes from the pre- to post-intervention periods. Further, if Lister is performing some of the surgeries, he may be (quite unintentionally) taking more care that patients survive in the post-intervention period.

Nothing here should be understood as a criticism of Lister. He did what he could do. In a number of circumstances it is not possible to actually employ random assignment of subjects to treatments. A researcher should never miss an opportunity to use random assignment, but when that cannot be done, important studies still occur. Lister was working in 1864-1866, and R.A. Fisher did not begin his work until about 1925. So it is possible that Lister knew nothing about the notion of random assignment!

Generalizability

So suppose sterilization is effective; how does this work generalize to other settings? Obviously the effect of sterilization is huge, and we would expect other doctors to immediately adopt this method with great results. However, the precise way the results generalize is unclear. The interval estimates would apply best in circumstances almost identical to that of Lister’s hospital. For example, hospitals performing less invasive surgeries might expect lower death rates overall, and might expect fewer benefits from sterilization (although they might still experience great benefits). On the other hand, a hospital catering to those with already compromised immune systems might experience even greater benefits. Perhaps just the opposite is the case. We are only trying to give a flavor for the kinds of issues that come up when trying to generalize a result to a new setting, when random sampling from a population did not occur. A medical expert could say a lot more – it is a matter of subject matter expertise.

JMP^® Features and Hints

This case used the Fit Y by X platform to produce a mosaic plot and contingency table for Deaths versus Intervention. Count was assigned the Freq role in the data table, which allowed it to automatically populate the Freq field in the Fit Y by X dialog window (a good practice, in general).

The two-sided Pearson and likelihood ratio tests and one-sided Fisher’s exact test were used to test the claim that sterilization is effective. Since we were interested in conducting a one-sided test, we used the two-sample proportions test and confidence interval for the difference in the probability of death. Finally, we used the risk ratio (relative risk) and confidence intervals to quantify the reduction in deaths from sterilization.

Exercises

It has often been claimed that great basketball players such as Larry Bird perform especially well in the playoffs.

Here are Larry Bird’s shooting “numbers” for the 1979 regular season and 1979 playoffs. This was his last year of college.

Shot	When	Count
made	season	376
missed	season	331
made	playoff	52
missed	playoff	43

The numbers can be verified at http://www.sports-reference.com/cbb/players/larry-bird-1.html.

Enter the data in JMP (remember to set Count to “Freq”).

Analyze this data, following the example provided in this case.

Does the data provide evidence Larry Bird shot better in the playoffs than in the regular season?
State the null and alternative hypotheses.
State and interpret the confidence intervals for a difference in proportions. Explain how the confidence intervals above reflect the conclusions of the hypothesis test.
State the relative risk, and interpret the confidence interval.
Which of the following statements is most correct?

There is strong evidence that Larry Bird shot better in the playoffs than in the regular season.
There is moderate evidence Larry Bird shot better in the playoffs than in the regular season.
There is little evidence Larry Bird shot better in the playoffs than in the regular season.
There is strong evidence Larry Bird shot the same in the playoffs as in the regular season.
There is moderate evidence Larry Bird shot the same in the playoffs as in the regular season.