Solved: Re: How to transfer a non-normal data set into normal

mujahida · Jan 10, 2016 07:35 AM

Dear JMP fans and experts

In process study, we often encount a case where we collect a data set for process capability study, unfortunately the data set that we collect is non-normal distribution, but if you want to calculate its Cpk, then data set shall be normal distribution, in the moment, we first think is how to transfer non-normal distribution into normal distribution, here

I'd like to share what's my thought in JMP

1. Analysis> distribution

2. under red triangle of distribution, select continuous fit> normal,

3. Under red triangle of Normal Fitted, select save fitted quantile

then all data in designed column can be transfer into normal one in a new column

Is it right?

mclayton200 · Jan 10, 2016 09:45 PM

Most data is non-normal if you have enough samples, and metrology is very precise and accurate.

Its like most p-values, easily overwhelmed by lots of data.

And if its based on SPC chart data, watch out for sampling plans as automatic gaging has given us lots of autocorrelated data which show up as run rule violations and can lead to SPC overcontrol.

And remember for capability studies use the RAW data not the SPC Stats data....in my opinion...since spec limits are about individual parts usually. But different industries have very different metrology and "critical" variables that are monitored. Talk to industry-specific experts before you waste time with fitting distributions to get a Cpk or Ppk index. Many industries avoid spec limits and use TARGETS and Cpm measures of Taguchi-style capability. Stay on target. Continuously tighten variation.

If you graph the data and study the outliers and possibly remove those you understand, that helps.

Then if the data is highly SKEWED you may need to transform it, but if NOT you may simply CALL it normal and get the index values including the sample size and confidence interval data using JMP.

Then IF the index values are very high...that's enough.

If they are very low..that's enough.

Its only when the index is between 1 and 2 that you may need to "torture" it to pass or fail some reporting limit.

Long term process capability wanders around on monthly basis in many factories, and this kind of study is used mostly to decide on major project or tool or metrology investments (rather than fine tuning which is usually done with SPC charts and OCAPS) so there is often little need to torture the data by transforming it but some need to study outliers. And even if the data is highly skewed...the log or other transformations may mislead you. Better to study the outliers to see where they come from and look at the economics of doing something about those outliers before you worry about data transforms. Tortured data will tell you anything you want to hear!

Better perhaps spend your exploratory data time on Variance Components Studies to infer root cause categories, then run some screening DOE's to validate. I spent decades working with non-normal data very effectively by graphing it many ways, and focusing on variance components hints for actions. Multi-Vari plots help illuminate. Sharing the graphs gets lots of feedback.

BUT if you have hundreds of parameters to watch as possible yield problems, then the JMP Capability Platform is useful for Pareto look, and the Distribution platform for looking at long tails and fitting other distributions AFTER sharing the graphs with operators, engineers, maintenance folks to get inputs on variability known issues and cost of doing anything about them. Some factory process variables have NATURAL distributions which are not NORMAL but are well understood. Particle counts, for example, are rarely normal, and some follow Poisson or Reynolds distributions, but again, you have to study the cost of trimming those tails vs living with them.

One final point.

Capability uses ENG OR CUSTOMER SPEC LIMITS not statistical limits, and those are OFTEN WRONG as they may have been set during R/D period not updated once volume manufacturing gives information on how these parameters impact yield or reliability or other costs. It is huge waste of time and money to report capabilities based on bogus limits.

View solution in original post

txnelson · Jan 10, 2016 11:03 AM

My preferred method to convert to a normal distribution is to use the Continuous Fit==>All.

This provides you with an ordered list of which distributions are the best fit for the data. You can then choose from there what transform is best, such as using GLot and save the transformed data directly, without having to force the data to quantiles.

Jim

mclayton200 · Jan 10, 2016 09:45 PM

Most data is non-normal if you have enough samples, and metrology is very precise and accurate.

Its like most p-values, easily overwhelmed by lots of data.

And if its based on SPC chart data, watch out for sampling plans as automatic gaging has given us lots of autocorrelated data which show up as run rule violations and can lead to SPC overcontrol.

And remember for capability studies use the RAW data not the SPC Stats data....in my opinion...since spec limits are about individual parts usually. But different industries have very different metrology and "critical" variables that are monitored. Talk to industry-specific experts before you waste time with fitting distributions to get a Cpk or Ppk index. Many industries avoid spec limits and use TARGETS and Cpm measures of Taguchi-style capability. Stay on target. Continuously tighten variation.

If you graph the data and study the outliers and possibly remove those you understand, that helps.

Then if the data is highly SKEWED you may need to transform it, but if NOT you may simply CALL it normal and get the index values including the sample size and confidence interval data using JMP.

Then IF the index values are very high...that's enough.

If they are very low..that's enough.

Its only when the index is between 1 and 2 that you may need to "torture" it to pass or fail some reporting limit.

Long term process capability wanders around on monthly basis in many factories, and this kind of study is used mostly to decide on major project or tool or metrology investments (rather than fine tuning which is usually done with SPC charts and OCAPS) so there is often little need to torture the data by transforming it but some need to study outliers. And even if the data is highly skewed...the log or other transformations may mislead you. Better to study the outliers to see where they come from and look at the economics of doing something about those outliers before you worry about data transforms. Tortured data will tell you anything you want to hear!

Better perhaps spend your exploratory data time on Variance Components Studies to infer root cause categories, then run some screening DOE's to validate. I spent decades working with non-normal data very effectively by graphing it many ways, and focusing on variance components hints for actions. Multi-Vari plots help illuminate. Sharing the graphs gets lots of feedback.

BUT if you have hundreds of parameters to watch as possible yield problems, then the JMP Capability Platform is useful for Pareto look, and the Distribution platform for looking at long tails and fitting other distributions AFTER sharing the graphs with operators, engineers, maintenance folks to get inputs on variability known issues and cost of doing anything about them. Some factory process variables have NATURAL distributions which are not NORMAL but are well understood. Particle counts, for example, are rarely normal, and some follow Poisson or Reynolds distributions, but again, you have to study the cost of trimming those tails vs living with them.

One final point.

Capability uses ENG OR CUSTOMER SPEC LIMITS not statistical limits, and those are OFTEN WRONG as they may have been set during R/D period not updated once volume manufacturing gives information on how these parameters impact yield or reliability or other costs. It is huge waste of time and money to report capabilities based on bogus limits.

David_Burnham · Jan 11, 2016 10:35 AM

... JMP also allows you to calculate process capability statistics for non-normal distributions without having to perform an explicit transformation

-Dave

louv · Jan 11, 2016 11:37 AM

This capability is found in the Distribution platform (where you can understand the underlying distribution)>Capability Analysis.