This question concerns the chance of Type I error in hypothesis testing. The data come from a series of experiments. To make it easier to describe clearly, we could consider the experiment as having 3 parameters that could be varied. We could call these A, B, and C. A has five levels, B has four levels, and C has three levels. Some combinations of A, B, and C don't go together, so when we remove those we have 42 distinct combinations left.
For each of the 42 combinations, we test the combination of treatment parameters on each of seven subjects to yield a single real number as a result. We compute the mean for each of these 42 combinations and test the hypothesis that the mean is equal to one. So far, this analysis seems to avoid problems.
However, the professor says we must now stratify according to levels of A to perform pairwise t-tests between those levels. (He did not specify how to handle levels of B and C. He's not a statistician.) The null hypothesis is that the means are equal. Is some correction necessary to avoid Type I error here? Statistics professors have rightly cautioned against "data dredging" and performing too many statistical tests on the same set of data to try to find something significant. If it were the pairwise t-tests only, perhaps the Bonferroni adjustment would help, but performing one-sample t-tests then performing pairwise t-tests on the same data is troubling. What would you say? It is not possible to gather more data in the lab at this point, so is some correction possible?
My apologies for the slow reply. I've been thinking about this question and I want to make sure I understand your study correctly so I can provide the most specific recommendation. In this study, it sounds like you have 7 subjects whom you have measurements on for each of the 42 conditions, and that you aren't comparing each combination to each other, but rather testing whether across the 7 subjects you have evidence the mean (within any particular combination) is different from 1. So, 42 one-sample t-tests? Or, are you taking the mean across the 42 conditions for each subject (so, one analyzable data point for each subject)? If it's the former, you will probably need to take explicit measures to control for alpha escalation. I wouldn't recommend Bonferroni correcting with that many comparisons -- it will be overly conservative. False discovery rate correction will be a better approach, and if this is the situation you find yourself in I can help you carry that out in JMP.
As for the second part of your post, can you explain a little more what you mean by "stratify according to levels of A." What are the pairwise tests you are talking about here? One particular combination vs another? Or, the average of all combinations involving A1 vs all combinations vs A2, or something like that? It sounds like you'll find yourself with a lot of t-tests in either case, so again I would recommend false discovery rate correction over Bonferroni. With regard alpha-controling methods in general, non-independence of the tests isn't strictly necessary for them to work. Bonferroni corrections, for instance, swing more conservative in the presence of non-independence. This is still a problem of course because such procedures attenuate statistical power, but at the least you will not be understating your familywise error rate in the presence of non-independnece of the tests.
I hope this helps!
Yes, you are exactly correct. For the first part there are 42 one-sample t-tests. You mentioned that JMP can help, but I don't know how to do this without Bonferroni.
Your questions about the levels of "A" show that I must provide a better description of the challenge at hand. We use one metric, and the construction of that metric is described here. The research question and experiment are about optimizing direct stimulation of muscle, preferably not indirectly through the motor nerve. (We control this sometimes by modulating the current level and sometimes with tetrodotoxin.) Ideally we would try to maximize the force output for the chosen amount of power input. Toward this end, we issue a one-second control pulse, one experimental pulse train, then another control pulse. This is done to prevent temperature fluctuations from causing large changes in the measure of interest over the course of the experiment. For each control pulse or experimental pulse, we measure the peak force output. Then a ratio of the peak force from the experiment train to the average of peak forces from control pulses indicates whether the experimental pulse train is better. The hypothesized mean for each is one, which would indicate that experimental pulse and control pulse give the same peak force output and there is no benefit to the experimental pulse. A ratio of two would indicate an experimental train that produced double the peak force of the control pulse.
There are 5 pulse widths, 4 frequencies, and 3 electric current levels. The 42 one-sample t-tests show whether each combination of pulse width, frequency, and current level is better than the control pulse. Like most biological examples, the muscle is nonlinear in its response, so the results are not what you would expect by linear thinking. That is, doubling the electric current does not, as a rule, double the force output. Given that a parametric math model for this tissue is not yet available, the statistical tests help to reveal the truth so the math modelers can work on developing a coupled differential equations model for this particular gastrointestinal tissue.
For this part, we consider groups of results which differ only in pulse width (i.e. levels of "A"). The pairwise comparison is desired this time because it could indicate which pulse width is best. We tried comparing the five pulse widths against each other using JMP's Wilcoxon Signed Rank test with promising results. These five are a subset of the 42 combinations. All five had the same frequency and electric current level Your idea of False Discovery Rate Correction sounds great, but I did not learn this in class. Could you tell which of the many JMP resources might show how it is done?
JMP is an amazing suite of tools, and a novice like me is completely lost with so many possibilities. Thank you very much for your advice!
Here is a link to an easy False Discovery Rate tool: False Discovery Rate PValue
What you have to do for this to work is create a table with the exact p-values in one column, and a second column identifying (labeling) the tests. For example, see the final table below I made by running some one-sample tests using the Iris sample dataset.
To make that final table of p-values, I used Analyze > Distribution, then used the four numeric columns in the dataset. Then, I did a one-sample test of the mean for each column, which yields an output like below:
To get the p-values in one table, I right-clicked one of the tables of p-values and selected "Make Combined Data Table." This produces the table below:
I prefer two-tailed, non-directional tests, so I deleted all but the "Prob > |t|" rows, which yields the following:
Now we want to "correct" these p-values for the number of tests we made, such that we control the proportion of the time we would make a "false discovery." Using that add-in I linked you to above, we can generate the FDR p-values by specifying the p-value column, and label column.
Which will return the following table:
Notice that the FDR p-values are slightly larger than the original p-values (in this case they're all ludriciously small, but the more tests you have, the larger the correction will be and could make a substantive difference. For your first 42 tests, I think this would be a good way to control for multiple comparisons.
Regarding your elaboration and pairwise tests, this sounds a lot like TMS studies I have helped out on. The same FDR correction procedure could be used in this case, just by making that table of p-values and labels. JMP also has built in platforms that support FDR correction -- these include Modeling > Response Screening, and Analyze > Fit Model, using the "Response Screening" personality. These platforms are more set up for screening a large number of outcome variables for an effect, rather than for controlling multiple comparisons with the same DV. I think it might be more straightforward to use the add-in I linked to above.
I hope this helps!
Thank you very much!
This helps tremendously because no-one knew how to do this when I posed this question on a popular statistics forum (not affiliated with SAS) some time ago. It's great that you are here helping JMP users! Also, thank you for the screen shots since they make it so much easier to follow your excellent explanation.