Subscribe Bookmark



Mar 30, 2012

Fundamentals of power analysis in experiment design

When I took my first course in linear models and design of experiments, my professor told the class that the most common question that he encountered in his statistical consulting was, “How many samples do I need [for my results to be statistically significant]?” This question comes out of a desire to obtain useful results from a study without wasting time or resources.

Although the above question is simple, the answer is not. That is because the answer depends on the values of several quantities that the consultant does not know yet. Worse yet, often the client does not know these values either.

What are these problematic quantities?

The easiest and least important quantity is the significance level of the hypothesis test that the client plans to use. Historically, 0.05 has been the standard significance level. I could envision a blog post on this one choice alone, but let us move on.

The signal is the most important quantity. That is the smallest effect large enough to be practically important. You would like to detect such an effect with a high probability, and that probability is the power of the test. In my experience, it is surprisingly difficult to get a scientist or engineer to commit to a value for the smallest practically important effect.

The noise is another required piece of information. The noise in this case is the standard deviation of the experimental error. An engineering team might reasonably ask how they are supposed to know this value before they run the experiment to estimate it.

Statistical power calculations depend on the above three quantities as well as assumptions about the distribution of the response. Standard assumptions are that the responses are independently distributed according to a normal distribution and that the variance is constant for all possible factor combinations.

Given the significance level, the signal and the noise as well as the above assumptions, a statistician (or statistical software) can tell you the power of your experimental plan.

What can an investigator do to increase the power of an experiment?

Here are the things you can do to increase the power of your experiments:

1)      Increase the size of the signal

2)      Decrease the noise

3)      Do more runs

4)      Use a bigger significance level – say, 0.1 instead of 0.05

How do you increase the size of the signal?

Obviously, the effect of a factor depends on nature. However, you can decide to make the minimum practically important effect larger. That may sound like cheating, but on reflection you may decide that you can actually be satisfied as long as you can detect effects that are two or three times bigger than what you specified at first.

How do you decrease the noise?

The experimental error is the result of random variation in the experimental conditions from run to run. The best way to reduce this variation is to identify sources of random variability and include them in the experiment as blocking factors. For example, if there is day-to-day variation and an experiment is going to require five days to run, then you can make remove this variation from the error variance by making Day a blocking factor.

More experimental runs are expensive. How much difference will a few runs make?

In screening experiments, there are often nearly as many factors as there are experimental runs. In the extreme case of a saturated design, there are exactly as many unknowns as runs. In this case, you cannot estimate the error variance for the full model, and therefore you cannot even perform a significance test.

An example of a saturated design is a seven-factor experiment for a main-effects model in eight runs. You have eight runs, and you are estimating an overall mean (or intercept term) and seven-factor effects, so there is no information left over to estimate the error variance.

If we add one run and perform a t-test for one of the main effects, the signal has to be more than 12 times the noise in order to reject the null hypothesis. The table below shows how big the signal-to-noise ratio in the t-test must be in order to reject the null hypothesis. Obviously, you cannot detect an effect unless you reject the hypothesis that there is no effect. So, with one extra run, the barrier to detection is pretty high. A good rule of thumb is to have four or five more experimental runs than unknown parameters.

How the required signal-to-noise ratio for the t-test drops as the number of extra runs increases.

What is the cost of increasing the significance level from 0.05 to 0.1?

When your significance level is 0.05, you will detect an effect that is actually spurious once in every 20 tests on average. Raising the level to 0.1 increases spurious detection to one in 10 tests.

What’s next?

In my next post, I will show the new interface for power calculations in JMP 10. So, this post is basically motivation for next week's.

I would like to wrap up by reminding you that you will not know the error standard deviation until after you run the experiment. Therefore, power calculations are based on a guess about what that number will be. Though statistical software reports the power to three decimal places, you should take these numbers with a large grain of salt.

Community Member

Michael Anderson wrote:

I would like to wrap up by reminding you that you will not know the error standard deviation until after you run the experiment. Therefore, power calculations are based on a guess about what that number will be.

Unless you're studying some phenomenon that was just discovered (or manufactured) this morning, there's bound to be some descriptive statistics lurking around somewhere. So dig 'em up. If there really aren't any, you need to squeeze a pilot study into your plans before the experimental Big Event, so you can get an estimate (with a confidence interval) for that pesky standard deviation.

If every teacher of basic statistics drilled the idea "Pilot, then Experiment" into their students, we'd rescue countless grad students from ill-defined, underpowered experiments and surveys. It only took me 10 years of teaching to figure that out...

Community Member

Steve Figard wrote:

Thank you for a very lucid and concise discussion of this important topic! I look forward to further installments.

Bradley Jones wrote:

There is certainly some value in looking at the data you already have or running a small pilot study if there is no data. However, the data you already have may have variability that you can avoid in an experiment either through blocking or more careful control of the experimental conditions. Thus, the error standard deviation you obtain using the current data may be unnecessarily large and lead you to think that your experiment will not have enough power. That, in turn, may cause you to do more runs than necessary thereby increasing the cost.

Community Member

Power Analysis in Custom Design for a designed experiment in JMP 10 wrote:

[...] my previous post, I talked about the fundamental quantities that affect the ability of a designed experiment to [...]

Community Member

Jay Lee wrote:

In addition to changing the level of significance, a research can test a specific (directional) hypothesis, such that alpha is not a two-tailed test.

Community Member

Andrew Ekstrom wrote:

I have been struggling with this issue. I need to determine 'n' for a multiple logistic regression at different powers for a system where the response rate might be as high as 5% and probably closer to 0.10%. I've asked around and have been told that I need to run a simulation(s). Does JMP have the ability to do these types of simulations?


Bradley Jones wrote:

While it is not trivial to write a script to simulate power for a logistic regression analysis, JMP can definitely do it. In fact, I often compare analytical results with the outputs of such simulations.