Fundamentals of power analysis in experiment design
Apr 2, 2012 9:28 AM
When I took my first course in linear models and design of experiments, my professor told the class that the most common question that he encountered in his statistical consulting was, “How many samples do I need [for my results to be statistically significant]?” This question comes out of a desire to obtain useful results from a study without wasting time or resources.
Although the above question is simple, the answer is not. That is because the answer depends on the values of several quantities that the consultant does not know yet. Worse yet, often the client does not know these values either.
What are these problematic quantities?
The easiest and least important quantity is the significance level of the hypothesis test that the client plans to use. Historically, 0.05 has been the standard significance level. I could envision a blog post on this one choice alone, but let us move on.
The signal is the most important quantity. That is the smallest effect large enough to be practically important. You would like to detect such an effect with a high probability, and that probability is the power of the test. In my experience, it is surprisingly difficult to get a scientist or engineer to commit to a value for the smallest practically important effect.
The noise is another required piece of information. The noise in this case is the standard deviation of the experimental error. An engineering team might reasonably ask how they are supposed to know this value before they run the experiment to estimate it.
Statistical power calculations depend on the above three quantities as well as assumptions about the distribution of the response. Standard assumptions are that the responses are independently distributed according to a normal distribution and that the variance is constant for all possible factor combinations.
Given the significance level, the signal and the noise as well as the above assumptions, a statistician (or statistical software) can tell you the power of your experimental plan.
What can an investigator do to increase the power of an experiment?
Here are the things you can do to increase the power of your experiments:
1) Increase the size of the signal
2) Decrease the noise
3) Do more runs
4) Use a bigger significance level – say, 0.1 instead of 0.05
How do you increase the size of the signal?
Obviously, the effect of a factor depends on nature. However, you can decide to make the minimum practically important effect larger. That may sound like cheating, but on reflection you may decide that you can actually be satisfied as long as you can detect effects that are two or three times bigger than what you specified at first.
How do you decrease the noise?
The experimental error is the result of random variation in the experimental conditions from run to run. The best way to reduce this variation is to identify sources of random variability and include them in the experiment as blocking factors. For example, if there is day-to-day variation and an experiment is going to require five days to run, then you can make remove this variation from the error variance by making Day a blocking factor.
More experimental runs are expensive. How much difference will a few runs make?
In screening experiments, there are often nearly as many factors as there are experimental runs. In the extreme case of a saturated design, there are exactly as many unknowns as runs. In this case, you cannot estimate the error variance for the full model, and therefore you cannot even perform a significance test.
An example of a saturated design is a seven-factor experiment for a main-effects model in eight runs. You have eight runs, and you are estimating an overall mean (or intercept term) and seven-factor effects, so there is no information left over to estimate the error variance.
If we add one run and perform a t-test for one of the main effects, the signal has to be more than 12 times the noise in order to reject the null hypothesis. The table below shows how big the signal-to-noise ratio in the t-test must be in order to reject the null hypothesis. Obviously, you cannot detect an effect unless you reject the hypothesis that there is no effect. So, with one extra run, the barrier to detection is pretty high. A good rule of thumb is to have four or five more experimental runs than unknown parameters.
How the required signal-to-noise ratio for the t-test drops as the number of extra runs increases.
What is the cost of increasing the significance level from 0.05 to 0.1?
When your significance level is 0.05, you will detect an effect that is actually spurious once in every 20 tests on average. Raising the level to 0.1 increases spurious detection to one in 10 tests.
In my next post, I will show the new interface for power calculations in JMP 10. So, this post is basically motivation for next week's.
I would like to wrap up by reminding you that you will not know the error standard deviation until after you run the experiment. Therefore, power calculations are based on a guess about what that number will be. Though statistical software reports the power to three decimal places, you should take these numbers with a large grain of salt.