Why Is This a Cringeworthy Statistics Statement? #1

JerryFish · Jun 8, 2023 5:41 PM

I'm going to start a few discussion threads over the next few weeks, and I invite any JMP user to chime

in with their thoughts. These discussions will revolve around "Cringeworthy Statistics Statements."

Please consider contributing. After a few days, I'll summarize thoughts in a blog post, and introduce another Cringeworthy Statement!

Here is Cringeworthy Statement #1:

We run a t-test to compare the means of two populations. We want 95% confidence in the results. We run the test, and the p-value comes back at 0.63. We make the statement "Since p is not less than 0.05, we conclude that there is no difference in the means of these populations."

Why is this statement Cringeworthy????

brady_brady · Nov 3, 2021 1:51 PM

In addition to other problems mentioned in previous posts,

1) When testing equivalence of means, the sidedness of the test must be specified a priori. If not otherwise stated, testing for mean equivalence usually implies a 2-sided test, in which case alpha, the 5% type I error probability, is the sum of the area in the upper and lower tails of the null distribution. In almost all cases, this probability is divided equally, so the tail area to which the p-value should be compared should be alpha/2, which is 0.025, not 0.05.

** EDIT: JMP takes care of this for you by doubling the tail it found using the critical statistic, but if you were doing this "by hand", you would end up rejecting for a single-tail area of < 0.025.

2) "95% confidence in the results" is cringeworthy because it misunderstands the main idea behind a hypothesis test. You can be 100% confident that, in running a hypothesis test at the 0.05 level of significance, you will have employed a certain procedure (if you've done things correctly). When the means of the 2 distributions are actually the same, over the long term (and assuming all assumptions hold), 5% of the time that procedure will produce a test statistic which results in a conclusion of differing means, due to sampling variability.

3) There are only 2 possible conclusions for a hypothesis test of this nature: 1) "The sample data provides (compelling) evidence that the null hypothesis is false" and 2) "The sample data does not provide compelling evidence that the null hypothesis is false". Notably absent is the statement "Based on the sample data we conclude that the null hypothesis is true." Generally when you want to prove something, you frame what you are trying to prove as the alternative hypothesis, hoping that the data supports rejection of the null in favor of the alternative. If the researcher is trying to demonstrate equivalence in means, then two one-sided tests would be more appropriate.

4) Others have mentioned this but it bears repeating: statistical significance and practical significance need not be, and usually are not, the same thing. Practical significance should be discussed and agreed upon from square one, if not square zero.

@dale_lehman mentions that "The problem with the correct interpretation is that it doesn't leave us able to say anything about the sample we actually have." I'll disagree, but I know where he is going with this. Every statistical technique has a particular question it is designed to answer. Particularly in the case of a null hypothesis that the data has failed to reject, one of the main issues non-statisticians (and Bayesian statisticians, for that matter) have with frequentist hypothesis tests is that a p-value answers a question that is an unnatural question to ask, namely: "Under the null hypothesis, what is the probability that I would observe results as extreme as, or even more extreme than, the results I actually did observe?". Bayesians will say that their methods answer more natural questions, but Bayesian techniques have their own issues--there is no panacea. There is no substitute for understanding which techniques are better equipped to answer certain questions, and choosing a technique that is well-suited to the question you are trying to answer. The recent(ish) backlash against the p-value, especially in the medical research arena, strikes me a bit like a backlash against forks when they're used as eye patches. A p-value is a tool, like any other. Use it the right way, for the right job, and you'll be fine... use it the wrong way and "you'll shoot your eye out!".

dale_lehman · Nov 3, 2021 05:43 PM

I don't think my example really has anything to do with Bayesian vs Frequentist analysis or p values (at least directly). You are correct, of course, that saying I'm 95% confident the true value likes in this 95% confidence interval is incorrect: the confidence is in the procedure, not any one particular confidence interval. The true value either is or is not in that one particular interval.

But my point is that 95% is the best estimate I can give of whether or not my one particular interval is one of the 95% of "good ones" that contains the true value or one of the 5% that would be giving the wrong conclusion. You may find the wrong interpretation cringeworthy but I do not. In fact, I find the correct interpretation somewhat cringeworthy, as it is a bunch of words that makes most people's eyes water over. And, I think it contributes to the far worse (and more cringeworthy) practice of ignoring the uncertainty altogether and treating the point estimate as the true value.

PatrickGiuliano · Nov 18, 2021 11:17 PM

To me it just doesn't reflect an adequate level of understanding, but at least it's more or the less the "correct interpretation."

A few folks discussed something along the lines of what I like to call the "p-value method" and the "confidence interval method" of running a T-test. I like to think of these two ways of analysis as in-essence the same thing and I will propose here (and please feel free to disagree with me) that they are in fact, different ways (of mathematically expressing) the same thing. JMP is very smart about giving you the statistical presentation both ways in most cases, and I really like that! Interpretation using confidence intervals (in the context of plain vanilla T-testing, non-inferiority testing, equivalence, and superiority testing) has always been more interpretable to me.

A dear friend of mine gave me a nice textbook reference that I think speaks to this:

Bluman, Elementary Statistics (7th edition), 2015

Re: Why Is This a Cringeworthy Statistics Statement? #1

Re: Why Is This a Cringeworthy Statistics Statement? #1

Re: Why Is This a Cringeworthy Statistics Statement? #1