Hi @34South,
I wouldn't say this is a "naive question" whatsoever! It actually gets fairly deep into the theoretical underpinnings and assumptions of our usual parametric testing. Let me peel apart each of your points to make some comments, and then hopefully end with some practical suggestions:
- What the central limit asserts: as you stated, the CLT asserts that the sampling distribution of a statistic (let's assume the mean going forward) approaches a normal distribution as the sample size (n) approaches infinity. The CLT does not assert (it doesn't need to assert) the unbiasedness of the mean (that the sampling distribution is centered at the population parameter) and the asymptotic consistency of the mean (as you increase the size of the sample, the variance of the sampling distribution decreases/the expected error between the estimate and the parameter decreases). These are true about the mean as an estimator whether or not the CLT is true; the CLT is about distributional form, and what it asserts is magical enough without adding in unbiasedness and consistency.
- Why the CLT matters: In usual parametric hypothesis testing we form a test statistic based on sample data, and then generate a p-value in order to assess how unlikely the test statistic from our sample would have occurred by chance alone. Our method for knowing what would happen by "chance alone" is informed by the CLT because we presume that the sampling distribution of our statistic (the mean in this case) is normally distributed. We know the sampling distribution of the mean will be normally distributed in two cases: if the population is normal (thus a sampling distribution formed from any size n will be normally distributed), or the population is non-normal but our sample size is large enough (what large enough is we'll tackle soon). So, because we trust what the CLT asserts, we can, without actually knowing the distributional form of the population, work out the p-values based on the assumption that the sampling distribution is normal. And since we know our statistic is unbiased, mean of 0, and know how to calculate the variance (due in part to the variance sum law), we can locate our sample in the sampling distribution of the statistic assuming the null hypothesis is true, and find the proportion of samples with statistics more extreme than our statistic (the usual p-value). If we didn't have the CLT, or didn't believe in it, we would either have to know ahead of time the distributional form of the population (so we could know the form of the sampling distribution) in order to know the p-value, or would need to use simulation statistics to generate our estimate of the sampling distribution and create an empirical p-value.
- What is a "large enough" sample for the CLT to do its magic: this is a hard question to answer in the abstract since it depends entirely on how non-normal the population is. If there is a minor departure from normality in the population, or a departure but still good symmetry, the CLT draws the sampling distribution toward the normal rather quickly. The consequence of this is that the p-value you generate based on the normal assumption is relatively close to the true p-value you can't know. We care about this because a) we do not want to false alarm more than our stated alpha proportion fo the time, and b) we do not want our power to be less than we assume it is. In some cases, the population is shaped such that our tests become hyper-conservative (obtained p-values > true p-value), or the opposite (obtained p-values < true p-values). Whatever the case, we don't want that. So what is a large enough sample? I can tell you it's not 35 -- there is nothing magical about that number, it was just a convenient cut-off to put in college textbooks. If your population is severely kurtotic (heavy-tailed) or of a class of forms that the CLT does not work on (Cauchy Distribution, you jerk), you are going to need very, very large samples (n> 1000) before your obtained p-values are an acceptable distance from the true p-values.
- Hypothesis testing for whether you have evidence the population is non-normal: I think there is a fundamental issue with these tests (not that they're inaccurate, but that their power profile is the reverse of what we need). As your sample size grows, normality of the population matters less and less for your inferences (“matters less” in the sense that your type I and type II error rates are less affected by the population distribution the larger your sample size is because your obtained p-values more closely approximate the true unknown p-value). Consider the fact that tests for non-normality (Kolmogorov-Smirnov, Shapiro–Wilk, etc) are also hypothesis tests… and rejecting the null for one is tantamount to detecting a departure from normality in the population. The power of these tests behaves in the same way as all hypothesis tests… larger samples, higher power… which means that with a very large sample (the kind where departures from population normality cause fewer problems ) a K-S, or S-W test will have very high power to detect even tiny departures from normality in the population. So… in true absurdity, with a very large sample, we can detect the tiniest departure from normality, one that wouldn’t cause a problem even with a small sample, and certainly isn’t causing us any issues with that large of a sample.
- Reporting of exact p-values: The American Statistician piece makes excellent points, and reporting of exact p-values has great merits but I believe also lulls us into a false sense of security that they really mean what they state. I don't mean in terms of the misconceptions of p-values (of which there are many), but that because they're an "exact" figure, they're known without error, which they are not. They're a sample estimate of our sample's location in a sampling distribution of that statistic we can't possibly know for sure. Reporting a number with 5 decimal places encourages a belief in their precision I think is unjustified.
So what's an analyst to do? Simulation/Permutation tests for one. We don't need to trust our assumptions and the p-values obtained with them. Simulation and permutation tests, where we resample from our data and generate an empirical sampling distribution, allow us to relax certain assumptions like the presumption of a normally distributed sampling distribution. They're not a fix for all problems (we still need to adhere to assumptions of exchangeability/independence/equal variance, etc) but it's perhaps a step forward.
I hope some of this has been helpful! Also, I'm sorry I didn't catch your question sooner, I always love a chance to talk about the Central Limit Theorem, it's truly magical.
@julian