How do I interpret Central Limit Theorem?

AAzad · Jan 4, 2024 10:11 AM

The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.

For continued process verification (CPV) or other statistical analysis, we assume that n=30 or more will ensure normal distribution and this "n" (sample number) is generally represented by the number of batches. For example, assay of an API (active pharmaceutical ingredient) in a drug product from 30 different batches is considered normally distributed. According to above definition, the assay of each batch has to be calculated mean of adequate number of samples (sample size) for each and every batch and not the composite sample assay or assay of one tablet or capsule. Hence the assumption of normal distribution on the basis mean coming from one individual assay is not right. In this regard, dissolution can be considered better candidate for the application of central limit theorem. For each mean value of dissolution time point, at least we use 6 units or more (L2, L3). Any comment will be appreciated.

Mark_Bailey · Jan 4, 2024 8:16 AM

The CLT specifically says that the sum of random variables asymptotically approaches a normal distribution. The arithmetical mean happens to be the sum of the variables divided by their frequency, so the CLT applies to this mean as well.

The assumption you cite is a common 'rule of thumb' about the minimum sample size for asymptotic behavior. The sample size actually depends on the third moment, skewness. The literature contains articles that state a formula to calculate n, given the skewness. You could also use JMP to simulate the sampling distribution for different n.

dlehman1 · Jan 5, 2024 07:04 AM

One addition to Mark_Bailey's comment. The rule of thumb for sample size and the CLT also depends on kurtosis. When kurtosis is high, much larger sample sizes are required before the sampling distribution of the mean approaches a normal distribution. As with most rules of thumb in statistics, they are reliable until they aren't. Simulating data to explore this is often a good way to see how the sample size is related to the sampling distribution.

I want to reiterate a complaint I have about the redesign of the Community (since it doesn't seem to belong anywhere else). To comment now requires that I get a code first. I used to save my login information and could quickly comment. Now, I had to create a new login because my university email quarantines the code for a day, and then go to my other email to get the code in order to comment. For me, at least, this is a more cumbersome process and makes me less likely to engage in the discussion. Also, I really don't see the importance of the multi-factor authorization here - it seems like an unnecessary concern. That is part of another ongoing complaint I have about IT departments in general - MFA has become the mantra of IT departments everywhere, regardless of how users feel about it. Frankly there are many websites that I don't care about my identity being exposed.

Mark_Bailey · Jan 5, 2024 10:48 AM

I do not mean to argue with your point about kurtosis. My understanding is that the issue arises due to asymmetry in the distribution. Kurtosis alone does not affect symmetry. Thanks for bringing up this point.

hogi · Jan 5, 2024 11:06 AM

it's still single factor: the password is replace by a verification code (sent via email)
Nevertheless - compare to a password that is save in the browser settings - it feels like MFA due to the multiple steps: open email app, copy code, paste code

@Ryan_Gilmore , any chance to specify a secondary (private) email address to get the verification code 2x in parallel?

I hope a Cookie will save the credentials for a loooong long time (> 1 year?)

same for Jmp activation via myJMP verification code ...

Ryan_Gilmore · Jan 5, 2024 01:27 PM

For Community related issues, you can start a discussion in the Community Discussions board.

My JMP ID is the method being used throughout the JMP ecosystem, of which the Community is one part. This transition is the first step. Based on feedback such as this, we will investigate methods for improving the experience.

AAzad · Jan 5, 2024 09:28 AM

Thanks a lot Mark. Could you please add the literature reference that you mentioned?

Mark_Bailey · Jan 5, 2024 10:45 AM

One specific reference is Sugden, Smith, and Jones (2000) “Cochran’s rule for simple random sampling.” Journal of the Royal Statistical Society: Series B (Statistical Methodology). 62: 787–793.

dlehman1 · Jan 5, 2024 11:05 AM

Here is one paper I found on kurtosis and the CLT: https://www.umass.edu/remp/Papers/Smith&Wells_NERA06.pdf. The most usable reference I know of is in a textbook (Statistics for Business, Robert Stine and Dean Foster, Addison Wesley). In the 2011 edition of their book, they provide sample size conditions referring to skewness and kurtosis which I have found useful:

n> 10*skewness squared and n>10*absolute value of kurtosis, where the skewness and kurtosis are measured from z scores of the data. I once queried them about the origins of these conditions and the best they could recollect was that they based these on running many simulations. If anybody has a more precise reference, I'd be interested to see it.

statman · Jan 5, 2024 12:55 PM

Pardon my comments and please ignore if you prefer. Perhaps I don't understand your situation, but doesn't your situation (CPV) call for analytical statistical thinking, not enumerative (see Deming, W. Edwards (1975), On Probability As a Basis For Action. The American Statistician, 29(4), 1975, p. 146-152)? I suppose your concern for normality is because you want to use the appropriate estimates for central tendency and variation? I would think what you want is a sample that appropriately represents the central tendency and variation of the API of the process for the period of time you are assessing (over the variables in the process that change during that time period). How you sample the process can have a huge effect on conclusions you draw about the process central tendency and variation. I realize I may be in the minority here, but a sample size of 30 seems quite arbitrary. Consider these possible hypothetical situations:

30 samples randomly taken from the process making batches over multiple shifts and lots of raw material
30 samples from 1 batch, 1 shift and 1 lot of raw material
30 samples, 3 samples from 10 batches and 1 shift
30 samples, 1 sample from 30 consecutive batches over 3 shifts
30 samples, 1 from 30 batches, 10 batches from each of 3 lots of raw materials
30 samples, each sample measured 3 times for 10 batches

Each may give completely different estimates of mean and variation. Each will confound or separate different components of variation.

From the above referenced paper:

“Analysis of variance, t-test, confidence intervals, and other statistical techniques taught in the books, however interesting, are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production. Most if not all computer packages for analysis of data, as they are called, provide flagrant examples of inefficiency.”

"All models are wrong, some are useful" G.E.P. Box

How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?

Re: How do I interpret Central Limit Theorem?