World Statistics Day was yesterday, but we’re celebrating all week long! This celebration means acknowledging the impact statistics has on our world. Who is your favorite statistician? Share with us who they are and why they top your favorites list.
Choose Language Hide Translation Bar
Each statistic should have a graph to go with it – not!

When we thought about starting a new software system many years ago, we were very enamored of an article published in 1973 in American Statistician authored by Frank Anscombe called “Graphs in Statistical Analysis.” As you can read in Wikipedia, Anscombe cleverly devised four sets of data that had identical statistics but were in fact very different. You couldn’t tell the differences from the statistics alone — you had to look at the graph to see them.

  1. The first was a random scatter of data across a sloped line (the usual).
  2. The second was a perfect fit for a parabola (systematic lack of fit).
  3. The third was a perfect fit for a line except for one outlier (outlier point).
  4. The fourth was a fit in which all the data had the same X except one (influential point).
  5. The lesson of that was that you ALWAYS had to look at your data, look at graphs of your data. If you didn’t look at the graphs, you wouldn’t have known that the four data sets had completely different situations to consider.

    From that day on, we made a rule. For every statistical fit or test we do, there has to be a meaningful graph to go with it.

    • The graph should be automatic – you shouldn’t have to ask for it.
    • The graph should reveal patterns or outliers.
    • The graph should be interactive – especially to find the identity of each point.
    • And so we have the rule that has lived on for more than 20 years, serving millions of users with graphs as well as statistics.

      It wasn’t easy. In several cases, we had to invent new graphs to tell the story of model test (leverage plot), means comparison (comparison circles) or recursive partitioning (partition graph). In other cases, the graphs had been invented, but had not yet become popular, such as mosaic plots and PCA biplots. Generally, we have succeeded very well with this rule, and generations of users are better informed for it.

      But now we have to reconsider and qualify the rule. Why? Big Data. When you have hundreds, thousands or even millions of things to look at, you can’t look at every one. You have to use something faster.

      Suppose that you are responsible for keeping a process in order in a factory, and you have lots of monitoring systems watching many aspects of the process. Open Semiconductor in the JMP Sample Data for this example. You have data on 128 sensor variables on the process. Each variable has specification limits to guide you on whether to worry about that part of the process. So you do capability analysis in the Distribution platform on all of them. Here are the first 15 analyses, lined up to fit on one large screen. There are eight more pages of output for all the 128 variables.

      The display has each capability index, Cpk, illustrated in great detail, but too much. It takes a few minutes to see what part of the process is nicely within three standard deviations of the specification limits and which are not. Furthermore, many of these plots are not worth looking at if the process is capable. You want to just look at the variables you need to worry about.

      You dream that you could see all the capabilities in one graph and instantly locate the problem variables.

      How are we going to do that? You could extract all the Cpk values into a table, and then show the distribution of that table, highlighting the bad capabilities under 1. You could sort the Cpks to see the worst ones. But all these don’t show in what way the variable is incapable. Is it off target to the right? Off target to the left? Or on target but with high variance? One capability index doesn’t tell you this. You need to take apart the index in order to do that. You need one axis to show if it is off-target, the other to show its variance.

      So in order to make the means and standard deviations comparable in the same graph, you need to normalize by the specification range, Upper Spec Limit – Lower Spec Limit. So you set up a graph where the X coordinate will be (xbar-target)/(USL-LSL) and the Y coordinate will be s/(USL-LSL), where xbar is the mean estimate, and s is the standard deviation estimate from the data. We have the definition:

      Cpk = min(  (xbar-LSL)/3s  ,  (xbar -mean)/3s  )

      By taking apart this into its two parts for the lower and upper spec limits, and solving for a Cpk=1, we have the following diagram:

      If a process has normalized mean and standard deviation to land right on the left or right black lines, then it will solve to have a Cpk of exactly 1. If it is inside the triangle, it will have a Cpk of greater than 1. If it is outside, it will have a Cpk of less than 1. If we inscribe a contour showing all the combinations of means and standard deviations that have probability of being outside the spec limits at .0013 (about one in a thousand), we have the red contour line that very closely matches up with the Cpk=1 triangle. So now we also understand the relationship between Cpk and defect probabilities.

      Putting all our processes in a goal plot, we see that many processes are incapable. We have process variables like INM2 that are on target but have very high variance. We have process variables like LYA1 that don’t have too high a variance but are off target. And we have process variables like VIA1 that have both problems.

      So we have switched from looking at 128 reports, to looking at one graph. We still see important features of each process variable, but now we see them in a bigger context, and we can easily pick out the ones to worry about.

      This is the story for capability plots. As we move to looking at process change and testing across many responses, we also need a way to look at many one-way ANOVAs in one plot instead of many plots. Doing many fits and tests -- and seeing them well in one graph -- is the topic of my next blog post.

      By the way, the Capability platform has supported the Goal plot for a number of releases now in JMP. Every new release of JMP will have more features for evaluating across many responses or groups.

      Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

      Article Labels

        There are no labels assigned to this post.

      Article Tags

      David Muller wrote:

      Thans for this valuable post.

      Where can we download the "Semiconductor" file ???




      Arati Mejdal wrote:


      The file is in the sample data that comes with JMP.


      Mike Clayton wrote:

      Thanks for all that effort over the past decade.

      We have been using the newer capability graphs to compare two suppliers of the same IC which has over 400 tests (and the raw data is available for dozens of lots from each supplier). The trick of course is to deal with the far outliers and very non-normal data in meaningful way, and still rank and rate the suppliers. Luckily, your support team has great semiconductor specialists experienced with such issues. A major issue has been with the scrap limits for each test, versus the control limits which tell a very different story.

      So using the new SPC charts with Phase to represent different suppliers is one way to augment the capability comparisons graphically. That gives supply chain managers better understanding of stability differences vs capability differences.


      Dave Garbutt wrote:


      I had not seen this capability plot before but I think it is great and would be perfect for assessing Lab data or vital sign abnormalities as seen in Clinical data.

      (For which we also have normal ranges).

      Now we need an animated version to plot over time and not drop the subject level variation out.


      Rick Wicklin wrote:

      Your Goal Plot is reminiscent of the Funnel Plot, which is used often in health statistics. The idea behind both is to order the points according to uncertainty. You can then overlay confidence limits if you know the distribution of the errors. See my articles on funnel plots for proportions and funnel plots for normally distributed errors.