Subscribe Bookmark RSS Feed

Predictive Modeling using Normal Distribution


Occasional Contributor


Jul 5, 2017

Quick question. I've created an model that predicts the light weight blow off and overweight product using excels normal distribution function. The purpose of this model is to assign cost to the various categories of product in the distribution (i.e. light weight, under target, on target, above target, heavy blow off). As the function implies, the model uses the normal distribution function which of course assumes that the data set comes from a normally distributed process. My question has to do with the results that JMP gives me when I evaluate the data to see if it is "Normal."


When I take samples of the line, form subgroups, and aggregate the data JMP supports the conclusion that the process is "Normal." However, when I look at the data from our X-Ray system which scans every Xi thereby giving me access to the population JMP would have me reject the normal distribution. 


Is this something that happens often? I suspect the issue has to do with the size of the data files. When sampling the line I usually pull 25 subgroups of 10 whereas any perticular population would be made up of 600K to 1.5 Million units. In your opinion, aside from knowing my mean will shift over time and all of that, is it safe to use my model going forward or should I re-consider?


Community Trekker


Jun 5, 2014

Hi, mmeewes!

The answer to your question is not so quick or straightforward.


Titans like Gauss, Pearson, Bessel, Huber, Hampel and Tukey have all noted that real-world data are hardly ever strictly Normal...they seem to most often resemble a Students t distribution with between 2 and 10 degrees of freedom.  That means that real-world distributions are platykurtic, where the tails are heavier than Normal.


Depending on how you aggregate your data, the Central Limit Theorem may be throwing you the Normality bone.


Since assigning costs to various categories of products would seem to involve assessing probabilities of observations in the tails of your distribution, I think it would be safer for you to discard the model based on the Normal distribution, and perhaps explore an empirical approach.


Good luck.