21st Century Screening Designs (2021-EU-45MP-772)
Bradley Jones, JMP Distinguished Research Fellow, SAS
JMP has been at the forefront of innovation in screening experiment design and analysis. Developments in the last decade include Definitive Screening Designs, A-optimal designs for minimizing the average variance of the coefficient estimates, and Group-orthogonal Supersaturated Designs. This tutorial will give examples of each of these approaches to screening many factors and provide rules of thumb for choosing which to apply for any specific problem type.
Auto-generated transcript...
Speaker |
Transcript |
Brad Jones | Hello, thanks for joining me. My name is Bradley Jones and I work in the development department of JMP and today we're going to talk about 21st Century screening designs. |
And JMP has been an innovator in screening experiments over the last,. | |
well, I would say the last decade or so. And I'm going to talk about three | |
different kinds of screening designs and tell you what they are, and how to use them and, then, when you should use one in preference the other. | |
So the three screening designs I'm going to talk about are A-optimal screening designs, definitive screening designs (or DSDs), and group orthogonal supersaturated designs (or GO SSDs). | |
So let me first introduce A-optimal screening designs. And A-optimal designs minimizes the average variance of the parameter estimates for the model. | |
And the remember that...the way to remember that, A-optimal stands for average, so A average. | |
By contrast, the D-optimal design minimizes the volume of a confidence ellipsoid around the parameter estimates by maximizing the determinant of the information matrix. So that's a lot of words | |
and it's hard to see why all of that's true from the determinant. But you can remember that D-optimal the D in D-optimal stands for determinant. | |
So why am I saying that we should do A-optimal screening designs? D-optimal designs have been the default in the custom designer for 20 years. Why should we change? | |
Well, my first reason is that the A-optimal...the A-optimality criterion is easier to understand than the D criterion. | |
Everybody know what an average is. | |
The average of the parameter estimates are | |
directly related to | |
the average variance of the parameter estimates, are directly related to your capability of detecting a particular active effect. | |
The D criterion, talking about this global confidence ellipsoid, doesn't tell you about any | |
particular capability to to estimate any one parameter with precision. | |
So, so I think the A optimality criterion is easier to understand. | |
But more than that, when when the A-optimal design is different from the D-optimal design, the A-optimal design often has better statistical properties. And it is true that a lot of times, the A- and D-optimal designs...the A- and D-optimal screening designs are going to be the same. | |
But there are times when they're different and when they are different, boy, I really like the A-optimal design better. | |
So I want to motivate you with an example. The first example I have is is I have...suppose I have four continuous factors. | |
And I want to fit a model with all the main effects of these factors and all their two factor interactions. So there are four main effects and six two factor interactions, but I can only afford 12 runs. | |
So I'm going to do a JMP demonstration. In that demonstration, I'm going to make a D-optimal design for this case, an A-optimal design for this case, and then compare those two designs. | |
So I'm going to open up my JMP journal here and go to the first case, and that's the four factor with two factor interactions and 12 runs. | |
And this is a script that makes it where I don't have to actually create the D-optimal design. Just believe that it is the D-optimal. Here here's the D-optimal design. You notice that all of the numbers in the design are either plus or minus one. | |
And here's the A-optimal design and there's a surprise here. In this first column, you see that X1 has four zeros in it, | |
whereas X2, 3, and 4 have all plus and minus ones. So that's a bit of a surprise. It's a good thing that, that this is...all these factors are continuous, but let's now compare these two designs. | |
So to start the comparison, let's look at the relative efficiencies of the estimates of the parameters. | |
This is the efficiency of the A-optimal design to the D-optimal design. So when the...when the numbers are bigger than one, that means the A-optimal design is more efficient than the D-optimal design. | |
And there's one parameter that's less efficient for the A-optimal design, but all the rest of the | |
parameters are at least as well estimated as for the D-optimal design, and many of them are up to 15% better estimated by the A-optimal design. | |
I'd like to show you one other thing, and that is this. | |
This is the correlation cell plot of these two designs and you can see that the the D-optimal design has all kinds of correlations on... | |
off the diagonal here, up to 33.33; that's a third. | |
And, but a lot of other ones, whereas the A-optimal design only has three | |
pairwise correlations that are not zero. Everything else in this...in this set of pairwise correlations is actually zero, and the main effects are all orthogonal to each other, and the first main effect is orthogonal to all of the two factor interactions. | |
Many of the two factor interactions are also orthogonal to each other. All of this orthogonality means that it's going to be easier to make | |
model selection decisions for the A-optimal design than for the D optimal design. And then one one other thing that I think you might be interested in is the G efficiency of the A-optimal design is more than 87% better than the D-optimal design. What G efficiency is, is the | |
the maximum...it's a measure of the maximum variance of prediction within the space of the parameters. So the | |
maximum variance prediction for the D-optimal design is | |
nearly twice as big as the maximum variance and prediction for the | |
A-optimal design. | |
And | |
also, the the I efficiency of the A-optimal design is is more than 14% better than the D-optimal design. So for all these reasons, I think it's pretty clear cut that the A-optimal design is better than the D-optimal design in this case. | |
So let me clean up a bit. | |
Going back to the slides, this is just the picture that you just saw. | |
But I wanted to show it to you in JMP instead. So let's move on to another another reason why I like A-optimal designs. | |
A-optimal designs allow for putting differential weight on groups of parameters. So what does that mean? In a D-optimal design, you could, | |
in theory, weight the different parameters more...some parameters more than others, but | |
in so doing, you don't change the design that gets created. | |
The...whatever weighting you use is still going to give you the same...the same design, so that's so that means that weighting in D-optimal design is not useful. In fact, | |
for most of the variance optimality criteria, including I-optimal designs, weighting doesn't help you put more emphasis | |
in the design on on the parameters. But in A-optimal designs when you weight the parameters, you can achieve a different design and that might be useful in some | |
cases. So here's my weighted A-optimal design example. Suppose I have five continuous factors now, and I want to fit a model with all the main effects and all the two factor interaction. So I have | |
five main effects and ten two factor interactions and I can only afford 20 runs. So I care more about being able to estimate the main effects precisely, because I figured that | |
the main effects are going to be bigger than the two factor interactions. I want...I want to get really good estimates of them. | |
So what I'm going to do is make...again I'm going to make a D-optimal design for this case. I'm going to make a weighted A-optimal design for this case. And then I'm going to compare the designs. So here here's a picture | |
of the demo. So here's the D-optimal design and again, we see that it's all plus and minus ones. | |
Here's the A-optimal design, which...in which I have weighted the main effects so that they're 10 times more important than the two factor interactions. And now I want to compare these two designs. | |
Now, in this case, | |
I'm now looking at the D efficiency of the D-optimal design relatively...relative to the A-optimal design, and so the A-optimal design is better when | |
the numbers are less than one. And we see that, for all of the main effects, the A-optimal design is doing a better job of estimating them | |
than the D-optimal design. Of course, there's a price that you have to pay for that, because in weighting...downweighting that two factor interactions, you get slightly worse precision for estimating those. | |
So this is a trade off, but you said...you said you were more...or I said that I was more interested in main effects in two factor interactions and so that that's what I got. | |
But here's here this compared...comparison of correlations is another reason why you might like the A-optimal design preferred over the D-optimal design. | |
In the in the D-optimal design, there are a lot more pairwise correlations. Notice that in the the A-optimal design, all of the main effects are completely orthogonal to all the two factor interactions. | |
You can see this whole region of the correlation cell plot has...is white, which means zero correlation. | |
The main effects are not orthogonal to each other, except that X1 and X2 are both are all orthogonal to X3 and 4 or 5, that that is to say their pairwise correlations are zero. | |
But there's some small correlations | |
between | |
X1 and X2 and between X3, 4 and 5, | |
correlations of .2. | |
But there's there's a lot of orthogonality in this plot, way more than there is in | |
in the | |
the D-optimal design. | |
So if you really wanted to, if you were sure that you wanted to be able to estimate the main effects better than you estimate the two factor interactions, again the A-optimal design would be preferred to the D-optimal design. | |
I don't want to close everything, because that would close my journal too. | |
Okay, going back to the slides. | |
So when would I want to use an A-optimal screening design? | |
Well, | |
I'm going to tell you about DSDs and GO SSDs | |
after this, but whenever they're not appropriate, then you would use an A-optimal screening design and that may often be the case in real world situations. | |
When you have many categorical factors and and some of the categorical factors may have more than two levels, then you can't use either a DSD or a GO SSD. | |
If certain factor level combinations are infeasible, for instance, if some corner of the design space it doesn't make sense to run, | |
then you would have to use the A-optimal design, because that that that is supported by the custom designer in JMP and and that can handle infeasible combinations, | |
either through disallowed combinations or inequality constraints. Or if there's a non standard model of that you want to fit, for instance, you might | |
want to fit | |
a three way interaction | |
of some factors, and then the DSD or the GO SSD are not appropriate there. And then, finally, when you want to put more weight on some effects than others | |
that your only choice is to use an A-optimal design. | |
So, so that concludes the section on A-optimality and I'm going to proceed now to definitive screening designs. These these designs first were published in the literature in 2011 so they're they're now 10 years old. | |
And this is what they look like. Here's the six factor definitive screen designs, but definative screening designs exist for any number of factors so it's... | |
they're very flexible. | |
What the...if we look first at the first and second run, we notice that | |
every place in Run Number 1 that is plus one, Run Number 2 is minus one, and every place that Row Number 1 is minus one, Row Number 2 is plus one. | |
So that Run 1 and 2 are like mirror images of each other, and if we look at all these pairs of runs, they're all like that. Every... every odd | |
run is a mirror image of every even numbered run. And finally we end with with a row of zeroes. | |
Another thing we might notice is that there are a couple of zeros for each | |
factor in the design. | |
So this, this is a useful thing, as it turns out, because it allows us to estimate quadratic effects of all the factors. | |
This overall center run allows you to estimate all the quadratic effects of all the factors, as well as the intercept term. | |
So, what are the positive consequences of running a definitive screening design? Well, | |
if there is an active two factor interaction, it will not bias the estimate of any main effect, so the two factor interactions and the quadratic effects, for that matter, are all uncorrelated with the main effects. | |
Also, any single active two factor interaction can be identified by this design, as well as any single...single quadratic effect. And it and...the final, very useful | |
property of this design is that if only three factors are active, you can fit a full quadratic model in those three factors, no matter which three factors there are. So and it and if... | |
that would also apply, of course, to two factors or one factor being active. | |
So you might be able to avoid having to do a response surface experiment, as a follow up to the screening experiment. | |
Now, in interest of full disclosure, there is a trade off that definitive screening designs have to make when comparing them with a D-optimal screening design, and that is that the parameter estimates for the main effects are about 10% longer than parameter estimates for the | |
the D-optimal screening design. So | |
so that's a small price to pay, though, for the ability. It also estimates all the...all the quadratic effects and have and have protection against two factor interactions. | |
So let's look at a small demo of definitive screening design in action. | |
So | |
I recently wrote a paper with Chris Nachtsheim on how to analyze definitive screening designs, and | |
when we submitted the original paper, the referees asked us to provide a real example. It's kind of difficult to provide a real example for a new | |
kind of design, because nobody will have ever done that before, because they haven't heard of it. So but we have this way of instrumenting the custom designers so that we know how long a design is going to take, in order to run it. So we, we had a set of | |
various factors that can make a design longer...take longer to to produce. One of them, for instance, is | |
the number of random starts that you use. So here are the times for all all...that the custom designer took for all of these examples. And then | |
if we look at DOE definitive screening, there's this automatic analysis tool called Fit Definitive Screening, and if you if you create a definitive screen design using JMP this | |
script is always available, so we can see what | |
what the analysis is. And the analysis goes in two stages. First, the main effects are fit and then the second order effects are fit and then they're combined. And it turns out that this | |
this Factor E is a type one error, but all the rest of these effects are actually active and we know that because | |
we also ran a full factorial design | |
on on all of these factors, and | |
so we know what the true model is in this case. And Factor E didn't show up in the in the analysis of the full factorial design, but all these other effects, including the squared effect of Factor A and these three two factor interactions were were real effects that were identified | |
by the definitive screen design. So that that got inserted in the paper, which it was eventually published by Technometrics. | |
So when would I use a DSD? | |
DSDs work best when most of the factors are continuous. | |
We do have a paper showing how to add up to three two factor or two level categorical factors, but as you continue adding more two level factors, | |
the DSD doesn't perform well. So you...so most of the factors need to be continuous. Also if you expect there to be curvature in your factor effects, then | |
like, for instance, if you think that there might be an active quadratic effect, then you would want to | |
use a DSD in preference to a two level screening design. And then you can also handle a few two factor interactions when using a DSD. So those are the, these are the times when you would use a DSD. | |
And that's all I have to say about them right now, but I want to move along to group orthogonal supersaturated designs or GO SSDs. | |
Excuse me um. | |
So this is a picture of a correlation cell plot for group orthogonal supersaturated design, and the cool thing about them is that the factors are | |
in a group. Here, one group is W1 through W4. Those factors are correlated but they're uncorrelated with any other group of factors, so the factors are are | |
in groups that are that are mutually uncorrelated between groups, but a little correlated within groups. And, of course, since the design is supersaturated, there has to be some correlation. We're just putting the correlation in very convenient places. | |
So here's a paper on | |
how we construct these group orthogonal supersaturated designs and and a cast of many coauthors, including Ryan Lekivetz, who is another | |
developer in the | |
JMP DOE group. | |
Here are pictures of my co authors. | |
Here's Ryan. | |
Chris is a | |
long-term, probably 30 years, colleague of mine. Dibyen Majumdar is a is an associate dean at the University of Illinois in Chicago, and Jon Stallrich is a professor at NC State. | |
So I'm going to talk about a little bit about the motivation for for doing supersaturated designs, tell you how to construct GO SSDs, analyze GO SSDs and then compare them to the our automatic analysis approach to other analytical approaches. | |
So a supersaturated design is a design that has more factors than runs. So, for example, you might have 20 factors that you're interested in, but runs are expensive, and you can only afford a budget for 12 runs. So the question is, is this a good idea? | |
One of my colleagues at SAS, when I went and asked him about what he thought about supersaturated design, said, "I think supersaturated designs are evil." | |
So | |
I felt my ears pin back a bit, but but I went at...went ahead and implemented supersaturated designs anyway. | |
I understand, though, why Randy felt the way he did. The problem with supersaturated design is is that the design matrix is singular, so you can't even do multiple regression, which is a standard tool in fit model. Also there is factor aliasing because | |
the factors are generally correlated. In fact, all the correlate in...in most supersaturated designs, all the factors are correlated with all the other factors and so there's this feeling that you, you shouldn't expect to get something for nothing. | |
In early literature | |
they were introduced by Satterthwaite in 1959 and then Booth and Cox in 1962 introduced this criterion | |
for | |
kind of optimally choosing a supersaturated design. | |
This this criterion here is basically minimizes the squared correlations, the sum of the squared correlations, in the design. So John Tukey it was a first to use this term, supersaturated, we think, in his discussion of the Satterthwaite paper. | |
It was...the paper was published with discussion. | |
And a lot of the discussants were very nasty. But from Tukey's discussion, he says, "Of course, Satterthwaite's method, which is called constant balance can only take us up to saturation | |
(one of George Box's well-chosen terms)" saturated design is a design, where all of the effects are taken up by parameter estimates. | |
I mean all the runs are taken up by parameter estimates. | |
But Tukey says, "I think it's perfectly natural and wise to do some supersaturated experiments," but basically the other discussants | |
largely panned this idea and nothing happened in this area for 30 years. | |
Then in 1993 two papers got | |
published, almost simultaneously. One in Biometrika by Jeff Wu and the other one in Technometrics by Dennis Lin. | |
Jeff, who is a professor at the Coca Cola Chair at Georgia Tech university and Dennis Lin is the Chair of the Department at Purdue now. | |
So when would you use a supersaturated design? Well, | |
you certainly want to use it when runs are expensive, because then you, know you, can have fewer runs than factors and therefore have a much | |
less expensive design than you would have to have if you're using a standard method, which always requires there at least...be at least as many runs as there are factors. | |
If you've done brainstorming of experiments, you know that when everybody gets their stickies out and writes down a factor they think is important, | |
you end up with maybe dozens of stickies on the wall and representing an impossible factor that at least one person thought was important. | |
And then, what what happens is that | |
subsequently, only a few...in the past, only a few of those factors are chosen to be experimented with and the other ones are kind of ignored. | |
And that seems unprincipled to me, eliminate...eliminating a bunch of factors in absence of the data seems like a bad idea. | |
Another another thing that you might want to do is is use them in computer experiments, because if you've got maybe a very complicated computer experiments with the one...with | |
dozens or even hundreds of tuning parameters that the...that the computer experiment has, you can do a supersaturated design. | |
And that's...that would be is especially good if the computer experiment takes a long time to run, so you wouldn't want to sit and wait for weeks in order to run a | |
very large | |
unsaturated experiment. | |
So how do we construct these GO SSDs? | |
This is a little math but | |
we start with a Hadamard matrix (H), and we have and we have another Hadamard matrix (T) that we've dropped a couple of | |
rows from, and then we take the Kronecker product (this this interesting symbol here is the Kronecker product) of those two matrices and that and that creates a GO SSD. | |
So when you form the X transpose X matrix of that, what you get is this thing which creates something that looks like this, where this this matrix is a block square matrix and | |
and all the other elements in the X transpose X or information matrix are zero. | |
So that's that's what creates the group orthogonality. So here's here's | |
a small example. Here, this H is an orthogonal Hadamard matrix, and this T is just H with the last row removed. | |
And so the Kronecker product of T and...of H and T just replaces every 1...every | |
element 1 in H by the matrix T, and every minus one in H by the matrix, | |
which is the negative of T. | |
So | |
since since H is four by four and T is three by four, we're putting a three by four matrix in place of each single | |
element of H and so what we come up with is a design with 16 columns and 12 rows. | |
So it expands | |
that example. Now here, this is what happens when you use...when you do what I just talked about. And so we have this 12 runs and 16 columns. The first column is the...is the column of ones which you would use to estimate the intercept, so we don't consider that a factor. | |
And here's the correlations...here are the correlations for this design. And the first three factors are...have higher correlations among among themselves and they're also unbalanced. | |
These columns have...don't have the same number of pluses and minuses. | |
The other...all the other factors have the same correlation structure, | |
and they all | |
have balanced columns. | |
One other interesting thing is that the their columns are | |
fold overs, so the main effects of of each set of factors are uncorrelated with two factor interactions of those factors. | |
Now what what my co authors and I have recommended is that we treat these first three factors...not not use them as factors, but instead because they're orthogonal to all the other factors, you can use these columns to estimate the variance | |
in a way that is unbiased by any main effect of any other of...any of the factors. So that way, we get an unbiased est...a relatively unbiased estimate of the variance which we can...which is very helpful in doing modeling. | |
So the design properties of this particular definitive screen....or GO SSD, sorry, GO SSD | |
are that we have three independent groups of four factors and so each factor group has a rank of three, so that means I can estimate three...the effects of three factors. | |
The fake factor columns in this case, have a rank of two because you are...you're using one of that...one of those columns to estimate the intercept, so you can estimate Sigma squared with two degrees of freedom. | |
And that estimate is unbiased, assuming that second order effects are negligible. | |
So now, I can estimate the effects of of... | |
out of each of the three groups of four factors, I can estimate three of them. | |
And I can, and I can test each group of factors, with a degree...with Sigma squared that has two degrees of freedom. | |
I've already pointed out that these these factor groups are fold overs and that that | |
provides substantial protection from two factor interactions. | |
So this table appears in the ...in the Technometrics paper. I'm happy to send you | |
a copy of that paper, if you just write me an email. | |
I'm Bradley.Jones@jmp.com. | |
Just my name @JMP.com. | |
So there, there are lots of choices that you can use for for creating these designs. It's | |
reasonably flexible. | |
So | |
in the paper, we talked about how to create the designs, but we also talked about how to analyze the designs. So those...the first group of factors, we want to use to estimate the variance instead of...instead of using them as actual factors. | |
We use the orthogonal group structure to active...identify active groups and then within the active groups, we do either stepwise or all possible regressions to | |
identify the active factors within a group. And as we go, we can pool the sum of squares for inactive groups into the estimate of Sigma squared, which gives us more power for for | |
finding active factors in the active groups of factors. | |
And if you can guess the direction of the signs of your effects, and often you know...you may not know how big the effects are going to be, but you know which direction they're going, that you | |
you can guess whether the effect is going to be positive or negative. If you if you have an an an effect that you think is negative, you can relate...relabel the pluses to minuses and minuses to pluses, so that the effect...the new effect will be positive. | |
And if you do that for all of the negative effects, so that all of the effects in each | |
in each group are positive, that maximizes your power for being able to detect an active group. So we recommend that. | |
So we did a simulation study to compare GO SSDs, our analysis procedure for GO SSD to standard | |
generic regression analyses...modern regression analyses, like the Lasso and the Dantzig selector, so we have three different analytical approaches. | |
The Dantzig Selector, the Lasso, and our two to two stage approach. We had three different designs, three different GO SSDs. We have the number of active groups that we... | |
either one active group to...and as you...as you get more active groups, of course, you get...make it harder for for the design to find everything. And then | |
the power is high for all of the methods, except when all the groups are active. So | |
in this in this design, there are only three groups and all of...and when you make all of them active, then the Dantzig Selector and the Lasso have very poor power. | |
But the two stage analysis that we're proposing has a basically a power of one for estimating everything that's estimable. | |
In in these two designs, there are seven...seven active groups and | |
so when all of them are active the Dantzig Selector and the Lasso | |
don't perform nearly as well as the two stage method that we...that we are recommending. So | |
what what our two stage method is is doing is using the structure of the design to inform our analytical approach. A Dantzig Selector, a Lasso don't...doesn't care what the structure of the data is. In fact they're used in arbitrary observational | |
analyses, as well as analyzing designed experiments. So you would expect that a generic procedure might not do as well as as a procedure that's constructive for the | |
pure purpose of analyzing this kind of design. | |
434 | |
258 | |
259 | |
260 | |
261 | |
262 | |
263 | |
264 | |
265 | |
266 | |
267 | |
268 | |
269 | |
270 | |
271 | |
272 | |
273 | |
274 | |
275 | |
276 | |
277 | |
278 | |
279 | |
280 | |
281 | |
282 | |
283 | |
284 | |
285 | |
286 | |
287 | |
288 | |
289 |