The Sampling Tree: A Strategic Sampling and Analysis Tool (2020-US-30MP-542)
Dave Sartori, Sr. Data Scientist, PPG
A sampling tree is a simple graphical depiction of the data in a prospective sampling plan or one for which data has already been collected. In variation studies such as gage, general measurement system evaluations, or components of variance studies, the sampling tree can be a great tool for facilitating strategic thinking on: What sources of process variance can or should be included? How many levels within each factor or source of variation should be included? How many measurements should be taken for each combination of factors and settings? Strategically considering these questions before collecting any data helps define the limitations of the study, what can be learned from it, and what the overall effort to execute it will be. What’s more, there is an intimate link between the structure of the sampling plan and the associated variance component model. By way of examples, this talk will illustrate how inspection of the sampling tree facilitates selecting the correct variance component model in JMP’s variability chart platform: Crossed, Nested, Nested then Crossed or Crossed then Nested. In addition, the application will be extended to the interpretation variance structures in control charts and split-plot experiments.
Speaker | Transcript |
Dave | Hi, everybody. Thanks for joining me here today, I'd like to share with you a topic that has been part of our Six Sigma Black Belt program since 1997. |
1997. So I think this is one of the tools that people really enjoy and I think you'll enjoy it, too, and find it informative in terms of how it interfaces with | |
some of the tools available in JMP. | |
The first quick slide or two in terms of a message from our sponsor. I'm with PPG Industries outside of Pittsburgh, Pennsylvania, in our Monroeville Business and Technical Center. | |
I've been a data scientist there on and off for over 30 years, moved in and out of technical management. And now, back to what I truly enjoy, which is working with data and JMP in particular. | |
So PPG has been around for a while, was founded in 1883. Last year we ranked 180th on the Fortune 500. | |
And we made mostly paints, although people think that PPG stands for Pittsburgh Plate Glass, that was no longer the case as of about 1968. So it's a...it's PPG now and it's primarily a coatings company. | |
performance coatings and industrial coatings. | |
cars, airplanes, of course, houses. You may have bought PPG paint or a brand of PPG's to to use on your home. | |
But it's also used inside of packaging, so | |
if you don't have a coating inside of a beer can, the beer gets skunky quite quickly. My particular business is the specialty coatings and materials. | |
So my segment we make OLED phosphors for universal display corporation that you find in the Samsung phone and also the photochromic dyes that go into | |
the transition lenses, which turn dark when you head outside. | |
So what I'm going to talk to you today about is this this tool called sampling tree. And what it is, it's really just a simple graphical depiction | |
of the data that you're either planning to collect or maybe that you've already collected. | |
And so in variation studies like a Gage R&R general measurement system evaluations, components and various studies (or CoV, as we sometimes call them), | |
the sampling tree is a great tool for for thinking strategically about a number of things. So, for example, what sources of variance can or should be included in this study? | |
How many levels within each factor or source of variation can you include? And how many measurements to take for each combination factors and settings? So you're kind of getting to a | |
sample size question here. So strategically considering these questions before you collect any data helps you also define the limitation of the study, what you can learn from it, | |
and what the overall effort is going to be to execute. So we put this in a classification tools that we teach in our Six Sigma program, | |
what we call critical thinking tools because it helps you think up front. And it is a nice sort of whiteboard exercise that you can work on paper or the whiteboard to to kind of think prospectively about the the data, you might collect. | |
It's also really useful for understanding the structure of factorial designs, | |
especially when you have restrictions on randomization. So I'll give you one sort of conceptual example, towards the end here, where | |
you can describe on a sampling tree, a line of restricted randomization. | |
And so that tells you where the whole plot factors are and where the split plot of factors are. So it can provide you again upfront with a better understanding of the of the data that you're planning to collect. They're also useful in | |
where, I'll share another conceptual example, where we've combined a factorial design with a component of variations study. So this, this is really cool because it accelerates the learning about the system under study. So we're simultaneously trying to manipulate factors that we think | |
impact the level of the response, and at the same time understanding components of variation which we think contributes a variation of response. | |
So once the data is acquired, the sampling tree can really help you facilitate the analysis of the data. And this is especially true when you're trying to select the variance component model | |
within a variance chart...variability chart that you have available in JMP. And so if you've ever used that tool (and I'll demonstrate it for you here in a couple...with a couple of examples), | |
if you're asking JMP to calculate for you the various components, you have to make a decision as to what kind of model do you want. Is it nested? Is it crossed? Maybe it's crossed then nested. | |
Maybe it's nested then crossed. So helping you figure out what the correct variance component model is, is really well facilitated by by good sampling tree. The other place that we've used them is where we are thinking about control charts. So the | |
the control chart application really helps you | |
see what's changing within subgroups and what's changing between subgroups. So it helps you think critically about what you're actually seeing in the control charts. So as I mentioned, they're they're good for kind of | |
showing the lines of restrictions in split plot but they're kind of less useful for the analysis of designed experiments, so again for for DOE types of applications aremore kind of kind of up front. | |
So let's jump into it here with an example. So here's a what I would call a general components of variance studies. | |
And so in this case, this is actually from the literature. This is from Box Hunter and Hunter, "Statistics for Experimenters," and you'll find it towards the back of the book | |
where they are talking about components of variance study and it happens to be on a paint process. And so what they have in this particular study are 15 batches of | |
pigment paste. They're sampling each batch twice and then they're taking two moisture measurements on each of those samples. | |
So the first sample in the first batch is physically different than the second batch, and the first | |
sample out of the second batch is physically different from any of the other samples. And so one practice that we tried to use and teach is that | |
for nested factors, it's often helpful to list those in numerical order. So that again emphasizes that you have physically different experimental units you're going from sample to sample throughout. | |
And so this is a this is a nested sampling plan. So the sample is nested under the batch. So let's see how that plays out in variability chart within JMP. | |
Okay, so here's the data and we find the variability chart under quality and process variability. | |
And then we're going to list here as the x variables the batch and then the sample. | |
And one thing that's very important in a nested sampling plan is that the factors get loaded in here in the same order that you have them in a sampling tree. So this is hierarchical. | |
So, otherwise the results will be a little bit | |
confusing. So we can decide here in this this launch platform what kind of variance component model we want to specify. So we said this is a nested sampling plan. | |
And so now we're ready to go. We leave the the measurement out of the...out of the list of axes because the measurement really just defines where the, where the sub groups are. | |
So we just we leave that out. And that's going to be what goes into the variant component that JMP refers to as within variation. | |
Okay, so here's the variability chart. One of the nice things too with the variability chart is there's an option to add some | |
some graphical information. So here I've connected the cell mean. And so this is really indicating the kind of visually what kind of variation you have between the samples within the batch. | |
And then we have two measurements per batch, as indicated on our sampling tree. And so the the distance between the two points within the batch and the sample indicates the | |
within subgroup variation. So you can see it looks like just right off the bat it there's a good bit of of sample to sample variation. And the other thing we might want to show here are the group means. | |
And so that shows us the batch to batch variations. So the purple line here is the, the average on a batch to batch basis. Okay. Now, what about the actual breakdown of the variation here. Well that's nicely done in JMP here under variance components. | |
And | |
Get that up there, we can see it then I'll collapse this. | |
As we saw graphically, it looked like the sample to sample variation within a batch was a major contributor to the overall variation in the data. And in fact, the calculations confirm that. So we have | |
about 78% of the total variation coming from the sample; about 20% of variations coming batch to batch and | |
only about 2.5% of the variation is coming from the measurement to measurement variation within the batch and sample. I noticed here to in the variance components table, the | |
the notation that's that used here. So this is indicated that the sample is within the batch. So this is an nested study. And again, it's important that we load the factors into the | |
into the variability chart in the order indicated here in the in the plot. So wouldn't make any sense to say that within sample one we have batch one and two. That just doesn't make any physical sense. And so it kind of reflects that in the in the tree. | |
And just | |
Now let's compare that with something a little bit different. I call this a traditional Gage R&R study. And so what you have in a traditional Gage R&R study is | |
you have a number of parts sample batches that are being tested. And then you have a number of operators who are testing each one of those. And then each one test the same sample or batch multiple times. So in this particular example we're showing five | |
parts or samples or batches, three operators measuring each one twice. Now in this case, operator one | |
for the for batch number one is the same as operator number one for batch or sample report number five. | |
So you can think of this as saying, well, the operator kind of crosses over between the part, sample, batch whatever the whatever the thing is that's getting | |
getting measured. So this is referred to as a as a crossed study. And it's important that they measure the same article because one of the things that | |
comes into play in a crossed study is that you don't have in a nested study is a potential interaction between the operators and what they're measuring. So that's going to be reflected in the in the variance component analysis that we see from JMP. Now let's have a look here. | |
at this particular set of data. So again, we go to the handy variability chart, which again is found under the quality and process. And in this case, I'll start by using the same order for the variables for the Xs as shown on the sampling tree. | |
But, as I'll show you one of the features of a of a crossed study is that we're no longer stuck with the hierarchical structure of the tree. We can we can flip these around. | |
And so this is crossed. I'm going to be careful to change that here. Remember that we had a nested study from before. And I'm going to go ahead and click okay. | |
And I'm going to put our | |
cell means and group means on there. | |
So the group means in this case are the samples (three) and we've got three operators. | |
And now if we asked for the variance components. | |
Notice that we don't have that sample within operator notation like we had in the in the nested study. What we have in this case is a sample by operator interaction. | |
And it makes sense that that's a possibility in this case, because again, they're measuring the same sample. | |
So Matt is measuring the same sample a as the QC lab is, as is is as Tim. So an interaction in this case really reflects the how different this pattern is as you go from one sample to the other. So you can see that it's generally the same | |
It looks like | |
Matt and QC tend to measure things perhaps a little bit lower overall than Tim. This part C is a little bit the exception. So the, the interaction | |
variation contribution here is is relatively small. There is some operator to operator variation, and the within variation really | |
is the largest contributor. And that's easy to see here because we've got some pretty pretty wide bars here. But again, this is a is a crossed study so | |
we should be able to change the order in which we load these factors and and get the same results. So that's my proposition here; let's test it. | |
So I'm just going to relaunch this analysis and I'm going to switch these up. I'm going to put the operator first and the sample second. Leave everything else the same. | |
And let's go ahead and put our cell means | |
and group means on there. | |
And now let's ask for the variance components. | |
So how do they compare? I'm going to collapse that part of the report. | |
So in the graphical part and this is a cool thing to recognize with a crossed study is | |
because again, we're not stuck with the hierarchy that we have in a nested study, we can kind of change the perspective on how we look at the data. | |
So that perspective with loading in the operator first gives us sort of a direct operator to operator comparison here in terms of the group means. And again that interaction is reflected of how this pattern changes between the | |
operators here as we go from Part A, B, or C, A, B, or C. | |
What about the numbers in terms of the variance components? Well, we see that the variance components table here reflects the order in which we loaded these | |
factors into the | |
into the dialog box and... | |
But the numbers come out very much the same. So the sample on the lefthand side here, the standard deviation is 1.7. | |
Standard deviation due to the operator is about 2.3 and it's the same value over here. | |
The sample by operator or operator by sample interaction, if you like, is exactly the same. And the within is exactly the same. So, | |
with a crossed study, we have some flexibility in how we load those factors in and then the interpretation is a little bit different. If these were different samples, | |
we might expect this pattern from going from operator to operator, to be somewhat random because they're they're measuring different things. So there's no reason to expect that the pattern would repeat. | |
If you do see a significant interaction term in a typical kind of a traditional Gage R&R study, like we have here, well, then you've got a real | |
issue to deal with because that's telling you that | |
the nature of the sample is is causing the operators to measure differently. So that's a bit harder of a problem to solve than if you just have a no interaction situation there. | |
OK. | |
Dave | So again, this, for your reference, I have this listed out here. |
Um, so now let's get to something a little bit more juicy. So here we have sort of a blended study where we've got both crossed and nested factors. So this was the business that I work in. | |
The purity of the materials that we make is really important and a workhorse methodology for measuring purity is a high performance liquid chromatography or HPLC for short. | |
So this was a...this was a product and it was getting used in an FDA approved application so the purity getting that right was was really important. | |
So this is a slice from a larger study. But what I'm showing is the case where we had three samples; I'm labeling them here S1, S2, S3. We have two analysts in the study. | |
And so each analyst is going to measure the same sample in each case. So you can see that similar to what we had in the previous example there that what I call traditional Gage R&R, | |
where each operator or analyst in this case is measuring exactly the same part or sample. So that part is crossed. When you get down under the analyst, each analyst then takes the material and preps it two different times. | |
And then they measure each prep twice. They do two injections into the HPLC with with each preparation. So preparation one is different than preparation two and that's physically different than the first preparation for the next analyst over here. And so again, we try to remember to | |
label these nested factors sequentially to indicate that they're they're physically different units here. | |
It doesn't really make any difference from JMP's point of view, it'll handle it fine, if you were to go 1-2, 1-2, 1-2, and so on down the line, that's fine, | |
as long as you tell it the proper variance component model to start with. So this would be crossed and then nested. So let's see how that works out in JMP. | |
So here's our data sample, analyst prep number, and then just an injection number which is really kind of within subgroup. | |
So once again we go to analyze, quality and process. We go to the variability chart. | |
And here we're going to put in | |
the factors in the same order as they were showing on the sampling tree. | |
And then we're going to put the area in there as the percent area is the response. And we said this was crossed and then nested, so we have | |
some couple of other things to choose from here. And in this case, again, the sampling tree is really, really helpful for for helping us be convinced that this is the case, and selecting the right model. This is crossed, and then nested. | |
Let's click OK. | |
I'm going to put the cell means and group means on there. Again, we have a second factor involved above the within. So let's pick both of them. | |
And let's again ask for the variance components. | |
And I'm going to just collapse this part, hopefully, and maybe I'm going to collapse the standard deviation chart, just | |
bringing a little bit further up onto the screen. | |
So what we can see in the in the graph as we go, we see a good bit of sample to sample variation. | |
The within variation doesn't look too bad. | |
But we do maybe see a little bit of | |
a variation of within the preparation. | |
So, um, the sample in this case is by far the biggest | |
component of variation, which is really what we were hoping for. | |
The analyst is | |
is really | |
below that, within subgroup variation. And so this this lays it out for us very nicely. So in terms of | |
what it's showing in the variance components here table in terms of the components, is it's sample analyst and then because these two are crossed, we've got a potential interactions to consider in this case. | |
Doesn't seem to be contributing a whole lot to the to the overall variation. And again, that's the how the pattern changes as we go from analyst to analyst and sample to sample. | |
Now, | |
the claim I made before with the fully crossed study was that we could swap out the | |
the crossed factors in terms of their | |
in terms of their order and and it would be okay. So let's let's try that in this case. | |
So I'm just going to redo this, relaunch it and I can I think I can swap out the | |
crossed factors here but again I have to be careful to leave the nested factor where it is in the tree. So I notice over here in the variance components table, the way we would read this as we have the prep | |
nested within the sample and the analyst. So that means it has to go below those on the tree. | |
So let's go ahead and | |
connect some things up here. | |
I'm going to take the | |
standard deviation chart off and asked for the variance components. | |
Okay, so just like we saw in the traditional Gage R&R example we've got | |
the analyst and the sample switching. | |
But their values for the, if we look at the standard deviation over here in the last column, they're identical. | |
We have again the identical value for the interaction term and interact on the identical value for the prep | |
term, which again is nested within the, within the sample and the analyst. So again, here's where the, where the sampling tree helps us really fully understand the structure of the data and complements nicely what with what we see in the | |
variance components | |
chart of JMP. | |
So, those, those are a couple of examples where these are geared towards components of variation study. | |
One thing you might notice too, I forgot to point this out earlier, is | |
look at the sampling tree here. And if I bring this back | |
and I'm just trying to | |
reproduce this. | |
That backup. | |
Dave | It's interesting if you look at the horizontal axis in the variability chart, it's actually the sampling tree upside down. |
So that's another way to kind of confirm that you're you're looking at the right structure here when you are trying to decide what variance component component model to to apply. | |
So again, here are the screenshots for that. | |
Here's an example where | |
the sampling tree can help you in terms of understanding sources of variation in a in a control chart of all things. So in this particular case, | |
over a number of hours, a sample is being pulled off the line. These are actually lens samples. I mentioned that we we make photochromatic dyes to go into the transitions lenses and | |
they will periodically check the film thickness on the lenses and that's a destructive test. And so when they take that lens and measure the film thickness, well, they're they're done with that with that sample. And so | |
what we would see if we were to construct an x bar and R chart for this is you're going to see on the x bar chart as an average, the hour to hour average. | |
And then within subgroup variation is going to be made up of what's going on here sample to sample and the thickness, the thickness measurement. | |
Now in this case, notice that there's vertical lines in the sampling tree, so that the tree doesn't branch in this case. | |
So when you see vertical lines when you're drawing a vertical lines on to the sampling tree, that's an indication that the variability between those two levels of the tree are confounded. | |
So, | |
I can't really separate the inherent measurement variation in the film thickness from the inherent variation of the sample to sample variation. So I'm kind of stuck with those in terms of how this measurement system works. So let's let's whip up a control chart with this data. | |
And for that are, again, we're going to go to quality and process. And I'm going to jump into the control chart builder. So again, our measurement variable here is the film thickness. | |
And we're doing that on an hour to hour basis. So when we get it set up by by doing that, we see that JMP smartly sees that the subgroup size is 3, just as indicated on our, on our sampling tree. | |
But what's interesting in this example is that you might at first glance, be tempted to be concerned because we have so many points out of control on the x bar chart. | |
But let's think about that for a minute in terms of what the sampling tree is telling us. | |
So the sampling tree again is telling us that's what's changing within the subgroup, what's contributing to the average range, is the | |
film thickness to film thickness measurement, along with the sample to sample variation. And remember how the control limits are constructed on an x bar chart. | |
They are constructed from the average range. So we take the overall average. And then we add that range plus or minus | |
a factor related to the subgroup sample size so that the width of these control limits is driven by the magnitude of the average range. | |
And so really what this chart is comparing is, let's consider this measurement variation down here at the bottom of the tree. So it's comparing measurement variation | |
to the hour to hour variation that we're getting from the, from the line. So that's actually a good thing because it's telling us that we can see | |
variation that rises above the noise that that we see in the in the subgroup. So in this case, that's, that's actually desirable. | |
And so, that's again, a sampling tree is really helpful for reminding us what's what's going on in the Xbar chart in terms of the within subgroup and between subgroup variation. | |
Now, just a couple of conceptual examples in the world of designed experiments. So split plot experiment is an experiment in which you have a restriction on the run order of the of the experiment. | |
And what that does is it ends it ends up giving a couple of different error structures, and JMP does a great job now of designing experiments for for that situation where we have restrictions on randomization and also | |
analyzing those. So, nevertheless, though it's sometimes helpful | |
to understand where those error structures might be | |
splitting, and in a split plot design, you get into what are called whole plot factors and subplot factors. And the reason you have a restriction on randomization is typically because one or more of the factors is hard to vary. So in this particular | |
scenario, we have a controlled environmental room where we can spray paints at different temperatures and humidities. | |
But the issue there is you just can't randomly change the humidity in the room because it just takes too long to stabilize and it makes the experiment rather impractical. So what's shown in this sampling tree is you really have | |
three factors here humidity, resin and solvant. These are shown in blue. And so we only change humidity once because it's a difficult to change variable. So that's how you set up a split plot experiment in JMP is you can | |
specify how hard the factors are to change. So in this case, humidity is a hard, very hard to change factor. | |
And so, JMP will take that into account when it designs the experiment and when you go to analyze it. | |
But what this shows us is that the the humidity would be considered a whole plot factor because it's above the line restriction and then the resin and the solvent are | |
subplot factors; they're below the line of restriction. So there's a there's a different error structure above the line of restriction for whole plot factors than there is for subplot factors. | |
In this case we have a whole bunch of other factors that are shown here, which really affect how a formulation | |
which is made up of a resin and a solvent gets put into a coating. So this, this is actually a 2 to the 3 experiment with a restriction randomization. | |
It's got eight different formulations in there. Each one is applied to a panel and then that panel is measured once so that | |
what we see in terms of the measurement to measurement variation is confounded with the coating in the in the panel variation. As, as I said before, when we have vertical lines on the | |
on the sampling tree, then we have | |
then we have some confounding at those levels. So that's, that's an example where we're using it to show us where the, where the splitting is in the split plot design. | |
This particular example again it's conceptual, but it actually comes | |
from the days when PPG was making fiberglass; we're no longer in the fiberglass business. But in this case, what what was being sought was a an optimization, or at least understanding the impact of four controllable variables on what was called loss ???, so they basically took coat | |
fiber mats and then measure their the amount of coating that's lost when they basically burn up the sample. So what we have here is at the top of the tree is actually a 2 to the 4 design. So there's 16 combinations of the four factors in this case and for each run in the design, | |
the mat was split into 12 different lanes as they're referred to as here. So you're going to cross the mat | |
from 1 to 12 lanes and then we're taking out three sections which within each one of those lanes and then we're doing a destructive measurement on each one of those. So this actually combines a factorial design experiment. | |
with a components of variations study. | |
And so again, we've got vertical lines here at the bottom of the tree indicating that the measurement to measurement variation is confounded with the | |
section to section variation. And so what we ended up doing here in terms of the analysis was, we treated the data from each DOE run as sort of the sample to sample variation like we had in the | |
moisture example from Box Hunter and Hunter, to have instead of batches, here you have DOE run 1, 2, 3 and so on through 16 and then we're sub sampling below that. And so | |
we treat this part as a components of variation study and then we basically averaged up all the data to look and see what would be the best settings for the | |
four controllable factors involved here. So this is really a good study because it got to a lot of questions that we had about this process in a very efficient manner. | |
So again, combining a COV with a DOE, design of experiments with components of variations study. | |
So in summary, I hope you've got an appreciation for sampling trees that are, they're pretty simple. | |
They're easy to understand. They're easy to construct, but yet they're great for helping us talk through maybe what we're thinking about in terms of sampling of process or understanding a measurement system. | |
And they also help us | |
decide what's the best variance components model when we we look to get the various components from JMP's variability chart | |
platform, which we get a lot of use out of that particular tool, which I like to say that it's worth the price of admission that JMP | |
for that for that tool in and of itself. So I've shown you some examples here where it's nested, where it's crossed, crossed then nested, and then also where we've applied this kind of thinking to | |
control charts to help us understand what's varying within subgroups versus was varying between subgroups. And then also, perhaps less useful...less | |
we can use those with designed experiments as well. So thanks for sharing a few minutes with me here and my email's on the cover slide so if you have any questions, I'd be happy to converse with you on that. So thank you. |