- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Testing for Normality
I wanted to obtain Shapiro Wilks Test for a distribution but I cannot obtain it. I tried the method posted on the discussion about legacy fitters. That only gives me KSL test. Otherwise if I do not use legacy fitters it gives me Anderson-Darlling test. My sample size is extremely large, >30k, not sure if that has anything to do with it. Thank you for help
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
For the legacy distributions, the Shapiro-Wilk test for normality is reported when the sample size is less than or equal to 2000. The KSL test is computed for samples that are greater than 2000. When the sample size is large, the Shapiro-Wilk test has a large power. Hence any small difference between your distribution and the null hypothesis is meaningful and leads to a rejection of the null hypothesis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
For the legacy distributions, the Shapiro-Wilk test for normality is reported when the sample size is less than or equal to 2000. The KSL test is computed for samples that are greater than 2000. When the sample size is large, the Shapiro-Wilk test has a large power. Hence any small difference between your distribution and the null hypothesis is meaningful and leads to a rejection of the null hypothesis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
Thank you for your message. So between KSL and Anderson-Darling, which test is a good substitute for Shapiro Wilks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
Anderson-Darling is a more powerful test and is the one we suggest be used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
I am confused on how to interpret it. All my Anderson-Darling p-values for ~50 variables are significant but the distributions of many of them look normal....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
Your large sample N > 30,000 leads to detection of non-normal features (departures from ideal normal CDF) that might be statistically significant but practically unimportant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
Thank you @Mark_Bailey for this comment. This is a very important point with all statistical significance testing. unfortunately, too many people are fixed on p values without looking at visuals or the size of the coefficients. In some research fields this is partially avoided by using minimal sample sizes to detect a practically meaningful difference.
Yet, it is very difficult to explain to some people, including reviewers and researchers, that it is very easy to get statistical significance that is meaningless.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
Most of the statistics we are taught were developed 80-100 years ago, when small samples were the rule. So the issue was, "What can I convincingly learn from or decide with a small sample?" Inferential techniques started then. Such situations still exist (e.g., no data available, need to experiment...) but a new situation arose 25 years ago. We now have massive databases from which to learn. Inference is pointless. Everything is statistically significant. That is, even the smallest differences or smallest parameter estimates are statistically significant. So, predictive modeling does not try to infer significance. Instead, the focus is on predictive performance through feature / model selection. It is a different challenge and requires a different mindset.
Also, in this specific case, why is a normal distribution assumed? What is supposed to be normally distributed? Data? An estimate? And what are the consequences of non-normality? How much of a departure from normal is necessary to compromise the result or decision? How robust is the method?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Testing for Normality
@tonya_mauldin explained this very well when the changes were first made in a JMP Blog post.