Learning from my mistakes -- Part 2: What does the t-test really tell us?

JerryFish · Jan 29, 2019 10:14 AM

T-tests are easy to run, but results can be confusing!In my previous blog post, I outlined a scenario where an engineer was tasked with determining whether a process change affected the strength of a part. The engineer chose to sample three parts with the new treatment (Treatment A) and compare to three standard (Control) parts. The results were analyzed with a standard t-test, and the resulting p-value was larger than 0.05 (for 95% confidence). Therefore, the new process was abandoned.

In my career as an engineer, I’ve seen this scenario several times, and I contend that the engineer may have abandoned a good process improvement.

Why do I say that? To start, I think the test may have been poorly planned and executed due to a lack of understanding about t-tests, and a lack of clarity about what we were trying to accomplish. In other words, Treatment A could have been a viable solution to our desire to increase part strength!

Let’s look at the t-test, what we can expect from it, and how results are often misinterpreted.

What Can We Expect from a t-test?

What are we trying to do with a t-test? Engineers often use t-tests to try to detect differences between averages (or means) of two large groups of parts (or populations). The populations are too large for us to measure all parts, so we draw a limited number of samples from each of the populations, hoping that they are representative of the whole. But since the sample sizes are limited, the sample means will almost always be different from those of their parent populations. In fact, they can be quite far off! (For example, we might accidentally draw three high samples from one population, so the sample mean might be much higher than the population mean.) Comparing just the sample means can certainly be misleading, so we apply some statistics (i.e., the t-test) to allow some "Kentucky windage" as compensation for the sample means.

A Typical (and Uninformed) t-test Interpretation

In our scenario, the engineer drew three samples of each type of part. (We don’t know why he did that… perhaps it was just tradition at his company.) We tell our software package how many samples have been collected, and what confidence level (often 95% confidence) that we want to have in the results. We then let our software run the t-test calculation, and we get a “p-factor.” This p-factor assesses whether we should “refute the null hypothesis.” (Why can’t statisticians talk in plain English, right?) In our scenario, for 95% confidence we need a p-factor of less than 0.05, which we did not achieve. So we abandoned the improvement.

The Problem with This Interpretation

Confidence levels, p-factors, null hypotheses… this is statistical jargon that it often difficult to understand and can lead to improper conclusions. Let’s unwrap it a bit in layman’s terms.

What the t-test (with a p-factor of greater than 0.05) is telling us is that given our sample data (including sample size), we CANNOT SAY with 95% certainty that there is a difference in the means of the two large populations.
What the t-test IS NOT SAYING is that the populations share the same means! This is very subtle but extremely important! We haven’t been able to prove that they are different, but they may in fact still be different! The variance (or noise) in our data made it impossible to detect the difference in this test. But perhaps if we had collected five samples each (or 10, or 100) we would have been able to detect a difference!

So, the first mistake that I have seen is concluding from the failed (i.e., high p-factor) t-test that population means are the same. Remember that T-tests can only tell whether population means are different!

(A second mistake that I glossed over was that we didn’t explicitly define what level of confidence was needed in the results. A confidence leve of 95% is typical in the engineering world, but much higher confidence is often needed in medical studies, etc.)

On to the next mistake…

What Is an Important Difference to You?

Recall that the engineer launched into the experiment simply looking for whether a difference existed between the two population means. But in most practical scenarios, we are interested in whether the Treatment A (in our example) gives us at least a certain amount of improvement over the standard. In our scenario, is a 0.00001% improvement in strength important to us? Maybe, but probably not. Would an improvement of 1% be important to us? Or 10%?

This “important difference” really boils down to the Voice of the Customer, as (generally) a process improvement must be judged against what the customer wants, and therefore what the customer will pay for. It could also involve several other considerations to the production company.

In our scenario, nothing was said about what difference in population means is important to us. In fact, we didn’t even explicitly say that we wanted the mean of the Treatment A population to be higher than the Control! (In this case, it seems obvious that we want to increase strength of the parts, but perhaps the part is overdesigned, and we want to reduce strength to save cost.) Whether we are looking for an increase or a decrease also factors into the statistical calculations, so it is important to specify this. The engineering team — along with management — needs to decide what difference is important, and whether it is an increase or a decrease, before any test planning can commence.

So, the second mistake that I often see is that the size of the important difference was not specified, nor was the direction (increasing or decreasing) of the difference specified.

How did these mistakes affect our test scenario?

If we had given thought to these points, we might have redesigned our test and come to an entirely different conclusion about Treatment A. But because of our lack of knowledge about the t-test, and our poor definition of what we were trying to accomplish, we passed over Treatment A as a viable solution.

We’ll talk more about other aspects of test planning in the next blog post, culminating in Ask the Complete Question, which can help avoid many of these mistakes. I hope you can tune in for Part 3 in about a week!

Looking Ahead

I’m planning to touch on all of the following subjects as we go along. Hope you can join me for the whole series!

Laying Out a Typical Scenario
What Does the t-test Really Tell Us? (this post)
Other Test Considerations, and Ask the Complete Question
Is Your Gauge Capable (and What to Do if It Isn’t)?
Planning and Running the Test: From Sample Sizes to Best Practices
Analyzing the Results (Including Garbage In = Garbage Out)
Final Wrap-Up