Solved: Re: Central Limit Theorem - a naive question

34South · May 31, 2019 09:52 AM

This question does not relate to JMP itself, but rather represents a basic question of interpretation. I understand that, due to the the phenomenon we call the CLT, sampling of a population will always strive towards a normal distribution of the resulting means with the overall mean and SD approaching the population mean and SD, irrespective of how the population data is distributed (skewed or normal), but provided a sufficient number of repeat samples are taken (≥30). Furthermore, as the size of each sample is increased, the precision of the estimated mean and SD increases. I also understand that, one does not need to perform such multiple sampling as the CLT is accomodated in parametric testing. What I'm trying to understand is whether there is a cut off point of sample size, above which normality can categorically be assumed without testing. Secondly, and perhaps unrelated, with the call by the The American Statistician journal to drop significance level thresholds (p<0.05) in favour of reporting of actual p-values and to refrain from the use of the term "statistically significant", how does one then determine the outcome in such matters as confirming normal distribution, for example through Shapiro-Wilk testing? Is there a grey line there too?

julian · Jun 3, 2019 8:54 AM

Hi @34South,

I wouldn't say this is a "naive question" whatsoever! It actually gets fairly deep into the theoretical underpinnings and assumptions of our usual parametric testing. Let me peel apart each of your points to make some comments, and then hopefully end with some practical suggestions:

What the central limit asserts: as you stated, the CLT asserts that the sampling distribution of a statistic (let's assume the mean going forward) approaches a normal distribution as the sample size (n) approaches infinity. The CLT does not assert (it doesn't need to assert) the unbiasedness of the mean (that the sampling distribution is centered at the population parameter) and the asymptotic consistency of the mean (as you increase the size of the sample, the variance of the sampling distribution decreases/the expected error between the estimate and the parameter decreases). These are true about the mean as an estimator whether or not the CLT is true; the CLT is about distributional form, and what it asserts is magical enough without adding in unbiasedness and consistency.
Why the CLT matters: In usual parametric hypothesis testing we form a test statistic based on sample data, and then generate a p-value in order to assess how unlikely the test statistic from our sample would have occurred by chance alone. Our method for knowing what would happen by "chance alone" is informed by the CLT because we presume that the sampling distribution of our statistic (the mean in this case) is normally distributed. We know the sampling distribution of the mean will be normally distributed in two cases: if the population is normal (thus a sampling distribution formed from any size n will be normally distributed), or the population is non-normal but our sample size is large enough (what large enough is we'll tackle soon). So, because we trust what the CLT asserts, we can, without actually knowing the distributional form of the population, work out the p-values based on the assumption that the sampling distribution is normal. And since we know our statistic is unbiased, mean of 0, and know how to calculate the variance (due in part to the variance sum law), we can locate our sample in the sampling distribution of the statistic assuming the null hypothesis is true, and find the proportion of samples with statistics more extreme than our statistic (the usual p-value). If we didn't have the CLT, or didn't believe in it, we would either have to know ahead of time the distributional form of the population (so we could know the form of the sampling distribution) in order to know the p-value, or would need to use simulation statistics to generate our estimate of the sampling distribution and create an empirical p-value.
What is a "large enough" sample for the CLT to do its magic: this is a hard question to answer in the abstract since it depends entirely on how non-normal the population is. If there is a minor departure from normality in the population, or a departure but still good symmetry, the CLT draws the sampling distribution toward the normal rather quickly. The consequence of this is that the p-value you generate based on the normal assumption is relatively close to the true p-value you can't know. We care about this because a) we do not want to false alarm more than our stated alpha proportion fo the time, and b) we do not want our power to be less than we assume it is. In some cases, the population is shaped such that our tests become hyper-conservative (obtained p-values > true p-value), or the opposite (obtained p-values < true p-values). Whatever the case, we don't want that. So what is a large enough sample? I can tell you it's not 35 -- there is nothing magical about that number, it was just a convenient cut-off to put in college textbooks. If your population is severely kurtotic (heavy-tailed) or of a class of forms that the CLT does not work on (Cauchy Distribution, you jerk), you are going to need very, very large samples (n> 1000) before your obtained p-values are an acceptable distance from the true p-values.
Hypothesis testing for whether you have evidence the population is non-normal: I think there is a fundamental issue with these tests (not that they're inaccurate, but that their power profile is the reverse of what we need). As your sample size grows, normality of the population matters less and less for your inferences (“matters less” in the sense that your type I and type II error rates are less affected by the population distribution the larger your sample size is because your obtained p-values more closely approximate the true unknown p-value). Consider the fact that tests for non-normality (Kolmogorov-Smirnov, Shapiro–Wilk, etc) are also hypothesis tests… and rejecting the null for one is tantamount to detecting a departure from normality in the population. The power of these tests behaves in the same way as all hypothesis tests… larger samples, higher power… which means that with a very large sample (the kind where departures from population normality cause fewer problems ) a K-S, or S-W test will have very high power to detect even tiny departures from normality in the population. So… in true absurdity, with a very large sample, we can detect the tiniest departure from normality, one that wouldn’t cause a problem even with a small sample, and certainly isn’t causing us any issues with that large of a sample.
Reporting of exact p-values: The American Statistician piece makes excellent points, and reporting of exact p-values has great merits but I believe also lulls us into a false sense of security that they really mean what they state. I don't mean in terms of the misconceptions of p-values (of which there are many), but that because they're an "exact" figure, they're known without error, which they are not. They're a sample estimate of our sample's location in a sampling distribution of that statistic we can't possibly know for sure. Reporting a number with 5 decimal places encourages a belief in their precision I think is unjustified.

So what's an analyst to do? Simulation/Permutation tests for one. We don't need to trust our assumptions and the p-values obtained with them. Simulation and permutation tests, where we resample from our data and generate an empirical sampling distribution, allow us to relax certain assumptions like the presumption of a normally distributed sampling distribution. They're not a fix for all problems (we still need to adhere to assumptions of exchangeability/independence/equal variance, etc) but it's perhaps a step forward.

I hope some of this has been helpful! Also, I'm sorry I didn't catch your question sooner, I always love a chance to talk about the Central Limit Theorem, it's truly magical.

@julian

View solution in original post

34South · May 31, 2019 10:25 AM

Just to clarify, I know one can do sample size analysis based on variance, power, etc. to avoid Type I errors, but is there a sample size where non-parametric testing is avoided by default?

Jeff_Perkinson · Jun 1, 2019 6:35 AM

It may not apply, but make sure you're not suffering from leptokurtosiphobia: the irrational fear of non-normality.

-Jeff

34South · Jun 1, 2019 02:46 AM

Well, thanks, I'll build that into my presentation on the subject - it'll be sure to get a laugh!

P_Bartell · Jun 1, 2019 07:16 PM

To add a bit to my former colleague @Jeff_Perkinson’s reply, you may want to add some commentary on that dreaded disease ‘mononumerosis’. The affliction data analysts succumb to when they focus on a single statistic (p value being a prime example) to make a decision or guide action.

34South · Jun 3, 2019 04:35 AM

Although I enjoy a bit of wit, I thought it would have been tempered with at least some degree of serious contemplation. The matter of significance thresholds is a serious one which I believed protagonists of JMP/SAS would have provided a modicum of guidance. I would suggest reading the editorial in The American Statistician Vol 73;S1 (2019).

The essence of that article recommends:

Not to base one's conclusions solely on whether an association or effect was found to be 'statistically significant' (i.e., the p-value passed some arbitrary threshold such as p < 0.05).
Not to believe that an association or effect exists just because it was 'statistically significant'.
Not to believe that an association or effect is absent just because it was not 'statistically significant'.
Not to believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
Not to conclude anything about scientific or practical importance based on statistical significance (or lack thereof).

The journal further confirms that a statement “be sent to the editor-in-chief of every journal in the natural, behavioral and social sciences for forwarding to their respective editorial boards and stables of manuscript reviewers."

I am not disagreeing with these recommendations but, as I pointed out, this goes further than one's interpretation of significance levels in final inferential statistical testing to dictating which methods are used to arrive at that answer.

LimitedInfo · Jun 3, 2019 10:48 AM

Some drama in the JMP Discussion boards what a nice addition to my work day lol

julian · Jun 3, 2019 8:54 AM

Hi @34South,

I wouldn't say this is a "naive question" whatsoever! It actually gets fairly deep into the theoretical underpinnings and assumptions of our usual parametric testing. Let me peel apart each of your points to make some comments, and then hopefully end with some practical suggestions:

What the central limit asserts: as you stated, the CLT asserts that the sampling distribution of a statistic (let's assume the mean going forward) approaches a normal distribution as the sample size (n) approaches infinity. The CLT does not assert (it doesn't need to assert) the unbiasedness of the mean (that the sampling distribution is centered at the population parameter) and the asymptotic consistency of the mean (as you increase the size of the sample, the variance of the sampling distribution decreases/the expected error between the estimate and the parameter decreases). These are true about the mean as an estimator whether or not the CLT is true; the CLT is about distributional form, and what it asserts is magical enough without adding in unbiasedness and consistency.
Why the CLT matters: In usual parametric hypothesis testing we form a test statistic based on sample data, and then generate a p-value in order to assess how unlikely the test statistic from our sample would have occurred by chance alone. Our method for knowing what would happen by "chance alone" is informed by the CLT because we presume that the sampling distribution of our statistic (the mean in this case) is normally distributed. We know the sampling distribution of the mean will be normally distributed in two cases: if the population is normal (thus a sampling distribution formed from any size n will be normally distributed), or the population is non-normal but our sample size is large enough (what large enough is we'll tackle soon). So, because we trust what the CLT asserts, we can, without actually knowing the distributional form of the population, work out the p-values based on the assumption that the sampling distribution is normal. And since we know our statistic is unbiased, mean of 0, and know how to calculate the variance (due in part to the variance sum law), we can locate our sample in the sampling distribution of the statistic assuming the null hypothesis is true, and find the proportion of samples with statistics more extreme than our statistic (the usual p-value). If we didn't have the CLT, or didn't believe in it, we would either have to know ahead of time the distributional form of the population (so we could know the form of the sampling distribution) in order to know the p-value, or would need to use simulation statistics to generate our estimate of the sampling distribution and create an empirical p-value.
What is a "large enough" sample for the CLT to do its magic: this is a hard question to answer in the abstract since it depends entirely on how non-normal the population is. If there is a minor departure from normality in the population, or a departure but still good symmetry, the CLT draws the sampling distribution toward the normal rather quickly. The consequence of this is that the p-value you generate based on the normal assumption is relatively close to the true p-value you can't know. We care about this because a) we do not want to false alarm more than our stated alpha proportion fo the time, and b) we do not want our power to be less than we assume it is. In some cases, the population is shaped such that our tests become hyper-conservative (obtained p-values > true p-value), or the opposite (obtained p-values < true p-values). Whatever the case, we don't want that. So what is a large enough sample? I can tell you it's not 35 -- there is nothing magical about that number, it was just a convenient cut-off to put in college textbooks. If your population is severely kurtotic (heavy-tailed) or of a class of forms that the CLT does not work on (Cauchy Distribution, you jerk), you are going to need very, very large samples (n> 1000) before your obtained p-values are an acceptable distance from the true p-values.
Hypothesis testing for whether you have evidence the population is non-normal: I think there is a fundamental issue with these tests (not that they're inaccurate, but that their power profile is the reverse of what we need). As your sample size grows, normality of the population matters less and less for your inferences (“matters less” in the sense that your type I and type II error rates are less affected by the population distribution the larger your sample size is because your obtained p-values more closely approximate the true unknown p-value). Consider the fact that tests for non-normality (Kolmogorov-Smirnov, Shapiro–Wilk, etc) are also hypothesis tests… and rejecting the null for one is tantamount to detecting a departure from normality in the population. The power of these tests behaves in the same way as all hypothesis tests… larger samples, higher power… which means that with a very large sample (the kind where departures from population normality cause fewer problems ) a K-S, or S-W test will have very high power to detect even tiny departures from normality in the population. So… in true absurdity, with a very large sample, we can detect the tiniest departure from normality, one that wouldn’t cause a problem even with a small sample, and certainly isn’t causing us any issues with that large of a sample.
Reporting of exact p-values: The American Statistician piece makes excellent points, and reporting of exact p-values has great merits but I believe also lulls us into a false sense of security that they really mean what they state. I don't mean in terms of the misconceptions of p-values (of which there are many), but that because they're an "exact" figure, they're known without error, which they are not. They're a sample estimate of our sample's location in a sampling distribution of that statistic we can't possibly know for sure. Reporting a number with 5 decimal places encourages a belief in their precision I think is unjustified.

So what's an analyst to do? Simulation/Permutation tests for one. We don't need to trust our assumptions and the p-values obtained with them. Simulation and permutation tests, where we resample from our data and generate an empirical sampling distribution, allow us to relax certain assumptions like the presumption of a normally distributed sampling distribution. They're not a fix for all problems (we still need to adhere to assumptions of exchangeability/independence/equal variance, etc) but it's perhaps a step forward.

I hope some of this has been helpful! Also, I'm sorry I didn't catch your question sooner, I always love a chance to talk about the Central Limit Theorem, it's truly magical.

@julian

34South · Jun 4, 2019 05:51 AM

Dear Julian. Thank you so much for your very detailed explanation. It provided a far greater insight into the CLT than I had hoped for and is much appreciated. On the other point, it will be interesting to see how far the call for retirement of the p-value threshold will progress. Perhaps the vast majority of scientists will again misunderstand how to apply that concept should it indeed be adopted as, I think for now, the guidelines focus not on what we should do but rather what we shouldn't. My background is not in statistics - indeed, in my day, it was hardly taught at University and I have largely gained an understanding through my own experience over the years with JMP being largely instrumental in my having become the go-to-guy for any statistical queries in my area of research. Personally, I cannot imagine investigating data with any other software package. I certainly enjoyed the humour of the other posts though and will keep them in my back pocket for cocktail parties ;->

julian · Jun 4, 2019 07:51 AM

You're very welcome! I'm also curious to see how things will change with regard to p-values and inferential statistics. I'm also eager to see how things change in our application of statistics in general with the continued advancements in computational speed that can make feasible the fitting of even more complicated models to even larger amounts of data. It really is a pretty exciting time to be around, and I suspect we're going to witness some truly exciting advancements in how we can extract meaning from data.

I agree with you 100% about JMP; I can't imagine having to learn from data without it!