If you missed the March episode of Statistically Speaking, you can watch it without registering! We feel the perspectives and examples presented by Jessica Utts and Dick De Veaux – the brilliant award-winning statisticians, professors, authors, and ASA Fellows – are so very compelling that their talks should be viewed far and wide.
In addition, we ran out of time before we could address every follow-up question. Thankfully, Jessica and Dick have provided thorough answers; we are grateful to them for taking the time to further delve into this important topic.
Question: What about incurring unnecessary waste in setting up an analysis? We see so many people still doing one-factor-at-a-time experiments, which is really just wasteful trial and error. Does this go back to Jessica’s point about teaching statistical literacy?
Jessica: I don’t think a discussion of how to design an experiment or analysis would fit with my idea of statistical literacy, because I’m mostly concerned with the literacy of statistical consumers. But that leads to a very important point. The problem of doing one-factor-at-a-time experiments goes beyond being wasteful. The problem is that results can be very misleading if interactions exist but are not taken into account. And interactions can only be measured by studying multiple factors in the same experiment.
For example, some studies show that consuming caffeine before working out helps with athletic performance. But other studies show the opposite. One explanation that has been proposed is that it depends on whether someone is genetically a fast or a slow caffeine metabolizer. So, if a study of this sort did not take that genetic difference into account, the results could be quite misleading.
When consumers read about results of a study that don’t seem to make sense (or even results that do!) a good strategy is to think about what additional variables might lead to interactions. For example, in this webinar I discussed the results of the Women’s Health Initiative, which seemed to show that taking hormone replacement therapy after menopause increased the risk of coronary heart disease. But follow-up research showed that the picture is not so straightforward. According to the Committee on Gynecological Practice of The American College of Obstetricians and Gynecologists:
“Recent analyses suggest that HT [hormone therapy] does not increase CHD [coronary heart disease] risk for healthy women who have recently experienced menopause. There is some evidence that lends support to the timing hypothesis, which posits that cardiovascular benefit may be derived when ET [estrogen therapy] or HT is used close to the onset of menopause.”
In other words, not only are hormones unlikely to cause coronary heart disease in women who start taking them at menopause, the hormones might in fact be beneficial for reducing heart disease! The problem is that the average age of women in the original study was 63, and most of them were many years past the menopause transition and had never used hormones. A physician friend explained to me that this means they may have had time to build up plaque in their arteries after menopause, and when they started the hormones many years later the plaque was loosened, causing coronary heart disease. If they had taken hormones immediately after going through menopause, the plaque would not have built up, and the hormones would have protected them from heart disease. In this example, the timing of starting hormones in terms of number of years after the menopause transition was an important interacting variable. If consumers had consulted the original research article, they would have noticed that the women who participated in the study were not representative of how hormone therapy is used in practice. In the “real-world” hormone therapy is almost always prescribed immediately upon reaching menopause.
If you are designing experiments or observational studies, it is very important to think about how the combination of explanatory variables might affect the outcome. If interactions are likely, it is not only wasteful, but also potentially misleading to investigate factors one at a time.
Question: Should practical significance be defined by a specific effect size cutoff, such as 0.5, which psychologists define as a "medium effect"?
Jessica: No! Practical significance should be viewed in context. Effect sizes are useful though, especially for determining whether a finding is replicated from one study to another. This is because, unlike p-values, effect sizes are not dependent on sample size. (However, as with any statistical estimate, the larger the sample the more accurate the estimate of the true effect size will be.) So if a particular outcome variable consistently shows the same magnitude of effect size from study to study, that is a better indicator of replication than if studies of widely varying size consistently find similar p-values.
A simple way to think about an effect size is that it’s the difference between the means of two groups in terms of number of standard deviations. For instance, average heights of men and women differ by about 5 inches, and within each group the standard deviation is about 2.5 inches. So the difference of 5 inches has an effect size of 5/2.5 = 2, which is the difference in mean heights in terms of number of “within group” standard deviations. Effect sizes have been classified as small if they are around 0.2, medium around 0.5, and large if they are 1.0 or more. Roughly, it’s said that a large effect size is noticeable to anyone, a medium effect is noticeable to an observant expert, and a small one is not really observable without using statistics, even though it may be important. Using the example here, a mean height difference in two groups of 2.5 inches or more (effect size of 1) should be noticeable to any observer, a mean height difference of 1.25 inches (effect size of 0.5) probably wouldn’t be noticeable unless someone often measures heights for some purpose, and a difference of half an inch (effect size of 0.2) would be very hard to detect without statistical summaries.
But practical significance is always context-dependent. For example, suppose a study is done to determine whether taking a particular herbal supplement with no side effects might help reduce cholesterol. Various sources report that the population standard deviation for total plasma cholesterol is about 40 mg/dl. So even a relatively small effect size of 0.25 (¼ of a standard deviation) represents an average cholesterol reduction of 10 mg/dl, which could have meaningful clinical significance. On the other hand, suppose a cholesterol-reducing drug had serious side effects. In that case, an average reduction of only 10 mg/dl as a result of taking the drug probably would not be enough to make it worth risking the side effects. And as illustrated by the height and cholesterol examples, the numerical value of an effect size depends on the natural within-group variability. If the natural variability in cholesterol levels in the population was only 5 mg/dl instead of 40 mg/dl, then a reduction of 10 mg/dl would represent an effect size of 10/5 = 2.0, meeting the criterion of a “large effect.”
Another example is when determining whether an effect exists at all. For example, laboratory studies to test whether extrasensory perception is possible have consistently found small effect sizes, generally between 0.15 and 0.2. But any consistent non-zero effect size in this context would be meaningful, because it would indicate that extrasensory perception is possible.
So, to summarize, determining whether a study result shows practical significance requires contextual expertise. Effect sizes can be used as a guide, but what is considered to be of practical significance always will depend on other factors, beyond statistical measures.
Question: How do you define practical/biological significance, compared to statistical significance?
Dick: This is a great question, and one that comes up frequently. I’ll answer the question in general first and then give a couple of answers. Statistical “significance” is a narrow term that describes the situation when the observed data fall “far enough” away from what the proposed model predicts. So, if we assume a coin is fair and flip it 100 times and the observed frequency of heads is far enough away from 0.5, we’ll say the result is statistically significant, meaning that the chance of it happening under that model is small (this is the p-value). When the p-value is small enough, we say that there is evidence against the model (or hypothesis), whether the actual difference from 0.5 is far enough to be practically interesting, or not. Practical/biological significance describes a result that is meaningful to the scientist or other investigator. A result can be practically significant but not statistically significant or (even worse), vice versa.
A student came to me yesterday with a theory about how insects visit plants. Do they tend to come independently, or do they come in groups? We used a Poisson model to model the occurrences of 0, 1, 2, 3 … insects arriving at the flower and then used a Chi-squared test to see if the actual occurrences fit the Poisson. By making the time period very small, and thus making the number of time periods large, we were able to reject the Poisson model (and thus independent arrivals) fairly easily. But we didn’t answer the question of whether the arrivals were far enough away to really answer the biological question. How different from the Poisson-expected numbers would truly give evidence that the insects are “pairing” up. That’s not a question that a statistician can answer.
Another example comes from a lab at Princeton that was trying to show that humans can influence (or at least predict) the outcome of a random number generator. They generated a random (uniform) number between 0 and 1 and then asked subjects to tell whether it was less than, or greater than 0.5. They collected tens of thousands of these observations and announced that the percentage was clearly statistically significant (with a p-value < 0.00001). The actual success percentage (around 0.5002) was such a small deviation that at best it would influence two to three outcomes out of 10,000. Of course, the director of the lab argued that any difference, not matter how small, would be interesting. Even if the results were true, and not the result of experimental anomalies or random error, I’ll leave it to the reader to decide whether the results are practically meaningful.
If a drug has a success rate that’s enough greater than a placebo to be unlikely to have occurred by chance, we’ll say it’s a statistically significant result. But if the actual difference were, say, to cure 31% of patients, instead of the placebo’s 30%, I suspect that the marketing department would have a tough time advertising that improvement even if that difference was statistically significant.
Question: Do you think anyone who has an interest or stake in an outcome of a study should always be scrutinized or identified in a study?
Dick: More than ever, data are easy to find on almost any topic that we care to research. But, of course, all data are not equal. A result from a double-blind placebo-controlled clinical trial is always preferable to anecdotal information. When the source of the data cannot be identified, it might be impossible to distinguish the two scenarios. The pedigree of the data, including how they were collected, under what conditions, by whom and for what purpose, are essential before making claims of what the data imply about the world at large.
We are interested in hearing other examples you may have on ethics in data science, where careful consideration was given for an ethical analysis, an analysis where ethics were clearly lacking, or an ethical dilemma where it is unclear how best to proceed. Thanks again to Jessica and Dick for sharing their wisdom and expertise on this topic, which is so relevant for these times.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.