Each statistic should have a graph to go with it – not!
Sep 24, 2013 10:56 AM
When we thought about starting a new software system many years ago, we were very enamored of an article published in 1973 in American Statistician authored by Frank Anscombe called “Graphs in Statistical Analysis.” As you can read in Wikipedia, Anscombe cleverly devised four sets of data that had identical statistics but were in fact very different. You couldn’t tell the differences from the statistics alone — you had to look at the graph to see them.
The first was a random scatter of data across a sloped line (the usual).
The second was a perfect fit for a parabola (systematic lack of fit).
The third was a perfect fit for a line except for one outlier (outlier point).
The fourth was a fit in which all the data had the same X except one (influential point).
The lesson of that was that you ALWAYS had to look at your data, look at graphs of your data. If you didn’t look at the graphs, you wouldn’t have known that the four data sets had completely different situations to consider.
From that day on, we made a rule. For every statistical fit or test we do, there has to be a meaningful graph to go with it.
The graph should be automatic – you shouldn’t have to ask for it.
The graph should reveal patterns or outliers.
The graph should be interactive – especially to find the identity of each point.
And so we have the rule that has lived on for more than 20 years, serving millions of users with graphs as well as statistics.
It wasn’t easy. In several cases, we had to invent new graphs to tell the story of model test (leverage plot), means comparison (comparison circles) or recursive partitioning (partition graph). In other cases, the graphs had been invented, but had not yet become popular, such as mosaic plots and PCA biplots. Generally, we have succeeded very well with this rule, and generations of users are better informed for it.
But now we have to reconsider and qualify the rule. Why? Big Data. When you have hundreds, thousands or even millions of things to look at, you can’t look at every one. You have to use something faster.
Suppose that you are responsible for keeping a process in order in a factory, and you have lots of monitoring systems watching many aspects of the process. Open Semiconductor Capability.jmp in the JMP Sample Data for this example. You have data on 128 sensor variables on the process. Each variable has specification limits to guide you on whether to worry about that part of the process. So you do capability analysis in the Distribution platform on all of them. Here are the first 15 analyses, lined up to fit on one large screen. There are eight more pages of output for all the 128 variables.
The display has each capability index, Cpk, illustrated in great detail, but too much. It takes a few minutes to see what part of the process is nicely within three standard deviations of the specification limits and which are not. Furthermore, many of these plots are not worth looking at if the process is capable. You want to just look at the variables you need to worry about.
You dream that you could see all the capabilities in one graph and instantly locate the problem variables.
How are we going to do that? You could extract all the Cpk values into a table, and then show the distribution of that table, highlighting the bad capabilities under 1. You could sort the Cpks to see the worst ones. But all these don’t show in what way the variable is incapable. Is it off target to the right? Off target to the left? Or on target but with high variance? One capability index doesn’t tell you this. You need to take apart the index in order to do that. You need one axis to show if it is off-target, the other to show its variance.
So in order to make the means and standard deviations comparable in the same graph, you need to normalize by the specification range, Upper Spec Limit – Lower Spec Limit. So you set up a graph where the X coordinate will be (xbar-target)/(USL-LSL) and the Y coordinate will be s/(USL-LSL), where xbar is the mean estimate, and s is the standard deviation estimate from the data. We have the definition:
Cpk = min( (xbar-LSL)/3s , (xbar -mean)/3s )
By taking apart this into its two parts for the lower and upper spec limits, and solving for a Cpk=1, we have the following diagram:
If a process has normalized mean and standard deviation to land right on the left or right black lines, then it will solve to have a Cpk of exactly 1. If it is inside the triangle, it will have a Cpk of greater than 1. If it is outside, it will have a Cpk of less than 1. If we inscribe a contour showing all the combinations of means and standard deviations that have probability of being outside the spec limits at .0013 (about one in a thousand), we have the red contour line that very closely matches up with the Cpk=1 triangle. So now we also understand the relationship between Cpk and defect probabilities.
Putting all our processes in a goal plot, we see that many processes are incapable. We have process variables like INM2 that are on target but have very high variance. We have process variables like LYA1 that don’t have too high a variance but are off target. And we have process variables like VIA1 that have both problems.
So we have switched from looking at 128 reports, to looking at one graph. We still see important features of each process variable, but now we see them in a bigger context, and we can easily pick out the ones to worry about.
This is the story for capability plots. As we move to looking at process change and testing across many responses, we also need a way to look at many one-way ANOVAs in one plot instead of many plots. Doing many fits and tests -- and seeing them well in one graph -- is the topic of my next blog post.
By the way, the Capability platform has supported the Goal plot for a number of releases now in JMP. Every new release of JMP will have more features for evaluating across many responses or groups.
Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.