Re-Thinking the Design and Analysis of Experiments? (2021-EU-30MP-776)

12 Kudos

Level: Intermediate

Phil Kay, JMP Senior Systems Engineer, SAS
Peter Hersh, JMP Technical Enablement Engineer, SAS

Scientists and engineers design experiments to understand their processes and systems: to find the critical factors and to optimize and control with useful models. Statistical design of experiments ensures that you can do more with your experimental effort and solve problems faster. But practical implementation of these methods generally requires acceptance of certain assumptions (including effect sparsity, effect heredity, and active or inactive effects) that don’t always sit comfortably with knowledge and experience of the domain. Can machine learning approaches overcome some of these challenges and provide richer understanding from experiments without extra cost? Could this be a revolution in the design and analysis of experiments?

This talk will explore the potential benefits of using the autovalidation technique developed by Chris Gotwalt and Phil Ramsey to analyze industrial experiments. Case study examples will be presented to compare traditional tools with autovalidation techniques in screening for important effects and in building models for prediction and optimization.

We were not able to answer all questions in the Q&A so here goes...

I see weights are not between 0-1, how they are generated?

They are "fractionally weighted". This is the key trick that enables the method. In Gen Reg in JMP Pro we use this in the Freq role. You can use this addin for JMP Pro 15 to generate the duplicate table with weighting. We can recommend a talk by Chris Gotwalt and Phil Ramsey to explain the deep statistical details of the weighting.

When you duplicate rows, they are duplicated with the same output result? What is the purpose of this duplication?

Yes they are duplicated with the same response. It would be a bad idea just to do that and progress with the analysis. The key innovation here is that they are fractionally weighted such that they have low weighting in one portion of the validation set and a higher weighting in the other validation set. For example, a run with low weighting in "training" will have a duplicate pair that has a high weighting in "validation". Some duplicate pairs will have more equal weighting in training and validation.

We fit a model using these weightings to determine how much of each run goes into training and how much goes into validation. Then we redraw the weights and fit another model. We repeat this process hundreds of times. We then use the results across all these models, either to screen the important effects from their proportion non-zero, or to build a useful model by averaging all the individual models.

Is it possible to run the add-in in the normal, non-pro, JMP 15 version?

Quite possibly. You can run the add-in but you will need JMP Pro to utilize the validation column in the analysis - this is critical.

How does it work with categorical variables?

We presented examples with only continuous variables. A nice feature is that it would work for any model and with any variable types.

Is this a recognized method now?

In a word, no. Not yet. Our motivation is to make people aware of this method to stimulate people that have an interest to explore the method and provide critical feedback. As we made clear in the presentation, we do not recommend you to use this as your only analysis of an experiment. If you like the idea, try it and see how it compares with other methods.

Having said that, the ideas of model averaging and holdback validation are recognised in larger data situations. It seems that this should be beneficial in the world of smaller data that is designed experiments.

Do you do the duplicate step manually?

No, this is done as part of the autovalidation setup by the addin or by the platform in JMP Pro 16. You could easily create the duplicate rows yourself. The harder part is setting up the fractional weighting, without which the analysis will not work.

Bayesian uses also a lot of simulation (MCMC), would it also be applicable? Integrating Prior and posterior distributions?
Quite possibly. The general idea of fractionally-weighted validation with bootstrap model averaging might well be applicable in lots of other areas.

Auto-generated transcript...

Speaker	Transcript
Phil Kay	Okay, so we are going to talk about rethinking the design and analysis of experiments. I'm Phil Kay. I'm the learning manager in the global technical enablement team and I'm joined by Pete Hersh.
Peter Hersh	Yeah I'm a member of the global technical enablement team as well, and excited to talk about DOE and some of the new techniques we're...we're exploring today.
Phil Kay	So the first thing to say, is we're big fans of DOE. It's awesome. It's had huge value in science and engineering for a very long time.
	And having said that, there are some assumptions that we have to be okay with in order to use DOE a lot of time.
	And they don't always feel hugely comfortable, as a scientist or an engineer. So things like effect sparsity, so the idea that not everything you're looking at turned out to be important
	and actually only a few of the things that you're actually experimenting on turn out to be important. Or effect heredity is another assumption.
	So that means that we only expect to see complex behaviors or higher order effects for factors that are active in their simpler forms.
	And, and just this idea of active and inactive effects, so commonly, the sequential process of design of experiments is to screen out the active effects from the inactive effects.
	And that just feels sometimes like too much of a binary decision. It seems a bit crazy, I think, a lot of time, this idea that some of the things we are experimenting on a completely inactive,
	when really, I think we, we know that really everything is going to have some effect. It might be less important but
	it's still going to have some effect. And these are largely assumptions that we use in order to
	get around some of the challenges of designing experiments when you can't afford to do huge numbers of runs and yeah. Pete, I don't know if you want to comment on that from your perspective as well.
Peter Hersh	Yeah I completely agree. I think that thinking of something like temperature being inactive is is...I think hard to imagine that temperature has no effect on an experiment.
Phil Kay	Yeah it's kind of absurd, isn't it? So,
	yeah, if that's in your experiment, then it's it's always going to be active in some way, but maybe not as important as other things.
	So I'm just going to skip right to the results of
	of what we've looked at. So we've been looking at this auto-validation technique that Chris Gotwalt and Phil Ramsey essentially invented
	and using that in the analysis of designed experiments, and it's really provided results that are just crazy. We just didn't think they were possible.
	So, first of all I looked at a system with 13 active effects and analyzing a 13 run definitive screening design and from that I was able to identify that 13 effects were active, which is... that's what we call a saturated situation. We wouldn't...
	commonly, we've talked about definitive screening designs as being effective in identifying the active effects when we have effect sparsity, when there's only a small number of effects that are actually important or active. But in this case I managed to identify all of the 13 active effects.
	And not only that , was actually able to build a model with all those 13 active effects from this 13 run definitive screening design.
	So, again that's kind of incredible; we don't expect to be able to have a model with as many active effects as we've have rows of data that we're building it from.
	And Pete, you looked at some other things and got some other really interesting results.
Peter Hersh	Yeah, absolutely, and and Phil's results are very, very impressive. I think
	what the next step that we tried was making a super saturated design, which is more active effects than runs and we tried this with
	very small DOEs. So a six run DOE with seven active effects, which if we did in standard DOE techniques, there'd be no way to analyze
	that properly. And we looked at comparing that to eight and 10 run DOEs and how much that bought us. So we got fairly useful models, even from a six run DOE, which was
	better than I expected.
Phil Kay	Yeah, it's better than you've got any right to expect really, isn't it?
	And so we've got these really impressive results and the ability to identify a huge number of active effects from a small
	definitive screen design and actually build that model with all those active effects. And in Pete's case, I have been able to build a model with seven active effects from really small
	Designed experiments.
	So, how did we we do this? How does the auto-validation methodology work? Well, it's taking ideas from machine learning, and one of the really useful tools from machine learning is validation. So holdout validation is a really nice way of ensuring that you, you build
	the most useful model. So it's a model that's robust. So we hold out a part of the data, we use that to test different models that we build, and
	basically, the model that makes the best prediction of this data that we've held out is the model that we go with, and that's just really tried and tested.
	It's actually pretty much the gold standard for building the most useful models, but with DOE that's a bigger challenge, itn't it, Pete? It doesn't really obviously lend itself to that.
Peter Hersh	Yeah, yeah, the the whole idea behind DOE is exploring the design space as efficiently as possible. And if we start holding out runs or holding out
	analysis of runs, then we're going to miss part of that design space and
	we really can't do that with a lot of these DOE techniques like definitive screening designs.
Phil Kay	Right, right, so it'd be nice if there was some trick that we could get the benefits of this holdout validation and not suffer from holding out critical data. So that brings us to this auto-validation
	idea, and, Pete, do you want to describe a bit about how this works?
Peter Hersh	Absolutely, so this was a real clever idea developed by Chris and Phil Ramsey, and they they essentially take our original data from a DOE and they resample it, so you
	repeat the results. So if you notice that the table here at the bottom of the slide, the the runs in gray are the same results of runs in white. They're just repeated.
	And the way they get away with this is by making this weighting column that is paired together. So basically, if one has a high weight, the
	the repeated run of that has a low rate...weight and so on and so forth. And this is...this is...enables us to use the data with this validation and the weighting and we'll go into a little bit more detail about how that's done.
Phil Kay	Yeah, you'll kind of see how it happens when we go through the demos. So we've basically got two case studies of simulated examples that we use to illustrate this methodology. So this first case study I'm going to talk through,
	I emphasize it's a simulated example. And in some ways, it's kind of an unrealistic example, but I think it does a really nice job of demonstrating the power of this methodology.
	We've got six factors and to make it seem a bit more real, we've chosen some some real factors from a case study here where they were trying to grow some biological organism and optimize the nutrients that they feed into the system to optimize the growth rate.
	So we've got these six different nutrients and those are our factors. We can add in different amounts of those, so I designed a 13 run definitive screening design to explore those factors with this growth rate response.
	And the response data was completely simulated by me, and it was simulated such that there were 13 strongly active effects. So
	I simulated it so the all of the main effects, all all six of the the main effects, are active.
	And then, for each of those factors, the quadratic effects are active as well, so we've got six quadratic effects.
	Plus we've got an intercept that we always need to estimate, so there are 13 effects in total that are active, that are important in understanding the system.
	1 signal to noise, but that's still a real challenge with standard methodology in order to model that, and we'll come to that in the demo.
	So, really, the question is, can we identify all those important effects and, and if we can, then can we build a model with all those important effects as well? Because as I said, that would be really quite remarkable
	versus what we can do with standard methodology.
	And then case study #2, Pete?
Peter Hersh	Yeah, absolutely. Very, very similar to Phil's case study. Ssame idea with we're feeding different nutrients at different levels to an organism and and checking its growth rate. In this case I simplified what Phil had done and broke it down to just three nutrient factors. And this is
	building a different type of design, so an I-optimal supersaturated design where we're looking at a full response surface
	in a supersaturated manner and we looked at six, eight and 10 run
	designs. And so same same idea.
	The the
	effects were very, very
	high signal to noise ratio, so really wanted to be able to pick out those effects if they were active. And just like Phil's, I kept the main effects in the quadratics active, as well as the intercept and we're trying to pick those out.
	And same idea, so how many runs would we need to see these active effects and how accurate of a model can we make from these very small designs?
Phil Kay	Yeah because you know, like I said, you've really got no right to expect to be able to build a good model from such a small design.
Peter Hersh	Yeah, exactly. Okay.
Phil Kay	So I'll go into a demo now of case study #1.
	And I'm presenting this through a JMP project, so that's a really nice way to present your results. I'd recommend trying this out.
	And that's our
	design, so this is our 13 run definitive screening design, where we vary these nutrient factors, and we have the simulated growth rate response. As I said, that's been simulated such that
	the main effects, the quadratics of all of these factors are strongly active, plus we've got to estimate this intercept.
	Now, with a definitive screening design I've generally recommend you use fit definitive screening as a way of looking at the results as one of the analyses that you can do.
	It works really well when we have this effect sparsity principle being true. So as long as only a few of the effects are strongly active...are active and the rest of them are unimportant,
	then it will find those...the few important effects and separate them from the unimportant ones.
	But in this case I wasn't expecting it to work well and it doesn't work well. It does not identify that all six factors are active. In fact it only identifies one of the factors as being active here.
	So that's not a big surprise, this is too difficult, too challenging a situation for this type of analysis.
	If somehow we knew that all of these active effects are active and we try and fit a model with all six main effects, all six quadratic and the intercept,
	then that's a saturated model. We've got as many parameters to estimate as we have rows of data, so we can just about fit that model, but we don't get any statistics.
	And in any case, you know, aside from the fact of I've simulated this data, in a real life situation, we wouldn't know which ones are active, so we wouldn't even know which model to fit.
	Now, using the auto-validation method, I was able to actually very convincingly identify the active effects, and I'll talk through how we did this.
	And this is just a visualization of my results here. You don't necessarily need to visualize it in this way. This is for presentation purposes.
	I was able to identify that first of all, the intercept was active. I've got all my six main effects,
	and my quadratic effects, and then my two factor interactions, which I simulated to have zero effect. You can see they are well down
	versus the other ones. And there's actually a null factor here that we use that...so so a dummy factor. So anything less than the null factor we can declare as being unimportant or or inactive, if you like.
	And what we're actually...the metric we're looking at here is something called proportion nonzero and I'll explain what that means, as we go through this. That's kind of the metric we're using here to identify the strength of an effect, of the importance of an effect.
	So a bit about how I went through this. So I took my original 13 run definitive screening design and then I set it up for auto-validation so we've now got 26 rows we've duplicated.
	And there's an add-in for doing this, one of our colleagues, Mike Anderson. created an add-in that you can use to do this in JMP 15.
	In JMP 16 they're actually adding the capability in the predictive modeling tools in the validation column platform.
	And what that does, we get this duplicate set of our data, and then we get this weighting and as Pete said, we have...each row is in the training set and in the validation set.
	In the training set, if it has a low weighting, it'll have a high weighting in the validation. So if it has a high weighting in the training set, it'll have a low weighting in the validation set.
	And what we do actually is, we read...these have basically been randomly assigned.
	We reassign those and we were able to kind of iterate over this hundreds of times, fitting the models each time and then looking at the aggregate...aggregated results over many simulation runs. So what you would do
	is to fit the model and I'm using GenReg here in JMP Pro.
	And you'll need JMP Pro anyway, because you need to be able to specify this validation role, so we put...the train validation column goes into validation.
	And the weighting goes into frequency and then we set up everything else as we normally would with our response. And then I've got a model, which is the response surface model here with all these effects in, and then I would click run.
	And it will fit a model, and we can use forward selection or the Lasso. Here, I've used the Lasso.
	It's not hugely important in this case.
	And what's actually happened is we've identified only the intercept as being important in this case, so we've only actually got the intercept in the model.
	But if we change the weighting, if we go back to our data table resimulate these weightings, we will likely get a different result from the model.
	We weight different rows of data, different runs in the experiment, that changes the model that's fit. So we're going to do that hundreds of times over, and what I'm going to do is actually to use the simulate function in JMP Pro.
	And what we do is we switch out the weighting column and switch in a recalculated version of the weighting column. And you can do that a few hundred times. I actually did it 250 times in this case. I'm not going to actually let that run, because that will take a minute or two.
	Once you've done that, what you'll get is a table that looks like this.
	So now I've got the parameter estimate for every one of those 250 models for each of these effects. So
	in my first model in this run that I did, this was the parameter estimate for this citric acid main effect. In the next model when we resampled the weighting,
	citric acid main effect did not enter the model, so it was zero in that case.
	And you can actually run distributions on all of these parameter estimates. And one of the things you can do is to
	customize this statistics, the summary statistics, to look at the proportion nonzero. So you can see the intercept here,
	the estimates that we've had of the intercept. You can see with citric acid, a lot of the time it's been estimated as being zero so those the models were,
	citric acid main effect was not in the model, and then a lot of the time it's been estimated as around about 3, which is what I'd simulated it to be.
	So what we look at is the proportion of times that it is non zero and we can make a combined data table out of those. And I've already done that, and just done a little bit of...
	a little bit of additional augmentation here. I've just added a column for whether it's a main effect or whatnot, and then that was how I created
	this visualization here. So what you're looking at is the proportion of times each of those effects is non zero, so the proportion of times that each of the effects is in our model over all those
	250 simulation runs we've done, where we've resimulated the fractional weighting. And that's what we use to identify
	the active effects, and that's...and it's done a remarkable job. It's been able to do what our standard methods would not be able to do. It's identified 13 active effects from a 13 run
	definitive screening design.
	Now, what would you want to do next? We maybe want to actually fit that model with all those effects and I've been able to do that. And I'm comparing the model that I've fit here
	versus the true simulated response, and you can see how closely they match up. So I've been able to build a model with all these main effects, all these quadratics and the intercept.
	So I've got a 13 parameter model here that I've been able to fit to this 13 run definitive screen design, which again is just remarkable.
	And I'm not going to talk through exactly how I got to that part. I'll hand over to Pete now. He's going to talk a bit more about this idea of self validated ensemble models.
Peter Hersh	Absolutely. Thank you, Phil. Let's see. I'm going to share my screen here and we'll just take a look at this project. So
	you can see here
	in the same
	flow as Phil, we're looking at a project here and I have
	started with that six runs supersaturated DOE, and here you can see, I have three factors,
	what my my actual underlying model growth rate is and then what the growth rate...the simulated growth rate was and then like like Phil mentioned,
	I create this auto-validation column, which can be done with an add-in in JMP 15 that that Mike Anderson developed. Or in JMP 16, it's it's built right into the software and you can access that under the analyze predictive modeling platform make validation column.
	So just like Phil showed, he showed a excellent example of how we can find which factors are active, so a factor screening. And that is oftentimes our main goal with DOE, but if we want to take it a step further and build a model out of that,
	we'd go through this the same process, right. So we build our DOE, we get an auto-validation added to that DOE, we build our model, just like Phil showed, using generalized regression and one of the
	variable selection techniques. So Phil Ramsey and Chris Gotwalt have looked at many of these different techniques and they all seem to work fairly well. So whether you're using a Lasso or even a two stage forward selection, they all seem to have similar results and
	work fairly well. So once you set this up and launch it, you get a model, like like Phil had shown, and you know some of these models will have

Peter_Hersh · ‎03-09-2021

If you want to learn more about the technique this is a great follow-up talk. SVEM: A Paradigm Shift in Design and Analysis of Experiments (2021-EU-45MP-779)

russ_wolfinger · ‎03-09-2021

Thank you Phil and Pete for a nice presentation!

My current interpretation of SVEM / fractional weighting / autovalidation is as a smoothed version of repeated k-fold cross validation. Flipping this around one could envision a discretization of the fractional weights into 0s or 1s as a form of repeated k-fold. The power in both kinds of validation is in the ensembling. Here is a conjecture: a repeated 5-fold ensembling approach would also work well on the small 13-run design using something like a lasso or shallow boosted tree as a base model. Would you be willing to post your JMP project here so that I and others could try a few things?

A possible point of contention is I view this method as mainly only dealing with Analysis and not so much directly DOE, although certainly knowing how one is planning to analyze an experiment can influence design choices, e.g. selecting Bayesian I-optimal over other criteria.

In any case, promising results and investigations along these lines are a lot of fun and show off the great and steadily improving features of JMP Pro.

Phil_Kay · ‎03-10-2021

Thanks. @russ_wolfinger .

It would certainly be interesting to try averaging models from repeated k-folds. As you say, the power is in the ensemble.

I will post the example 13-run data set soon so that you can explore this.

While the method is clearly addressing the analysis, we think that this could (and I would stress "could") have important implications for the design of experiments. We didn't get time to properly expand on this line of thought in the presentation so here goes...

We design experiments with the analysis in mind. As we said at the start of our presentation there are assumptions like effects sparsity and effect heredity that we have to accept in order to build a useful model using standard analysis methods from a reasonably sized experiment. There is empirical evidence that these assumptions hold in many cases. But they don't always hold and I think scientists and engineers are uncomfortable with these ideas. When I was designing and running experiments in the lab I know that I would have preferred not to have to worry about these things.

It seems that autovalidation/SVEM might offer an alternative. Therefore we could design experiments without the same constraints of the standard assumptions. That would be a huge change when it comes to design of experiments.

I would again stress that this all still speculative. It seems promising from my limited explorations but I would wait for published results of rigorous simulation studies.

I absolutely agree that it is a lot of fun exploring these ideas. And JMP Pro means that you don't need to be a statistician or an expert coder to try it.

ckronig · ‎03-11-2021

Hi @Phil_Kay and @Peter_Hersh

yes very nice presentation, I would like to see the data set too.

I am particularly interested in how you generated all the models for averaging as that wasn't covered in your presentation.

I have used this autovalidation before to identify the active parameters using simulation, and then used these parameters to build a model. But what you presented goes a step further generating lots of models and averaging them. Was that part of it done using the SVEM add-in?

Peter_Hersh · ‎03-11-2021

@ckronig I must have gone over that part quickly. The key is to make a model then save that prediction formula. Then have JMP recalculate the weight formula (You can do this under the red triangle in the table menu re-run formulas). Re-run the model again with the new weighting column then save that prediction formula and repeat. I scripted a loop to have this repeat 100 times then took the average of those models. You don't need an add-in to have this done. I am far from an expert scripter and it was fairly simple to set up. I will include my data table and script in the post.

Phil_Kay · ‎03-12-2021

@ckronig I have also attached the 13-run DSD example as both the original table with the simulated "true" response and as a table setup for autovalidation. The second of these contains scripts for looping through redrawing the weights and then saving the prediction formula column from the recalculated model. Just a Pete described. I think you should be able to make sense of these but let us know if you don't understand anything.

There is one final step. When you average all the models to give the ensemble model you should find that there is a significant bias in predicted vs actual. It might be difficult to explain why that is here but it is due to the parameters for each "true" effect being shrunk towards zero from all the model instances in which they were zeroed (not in the model). I think this is particularly extreme in this case because all "active" effects are orthogonal in the DSD.

You can fix this simply by fitting a model in Fit Model with the ensemble model as the effect and the response as Y. You can then save this "de-biased" model prediction formula as your final model.

ckronig · ‎03-16-2021

Thanks @Peter_Hersh and @Phil_Kay. I have been able to get the script to work. I will try it with my own data. Thanks again!!

Phil_Kay · ‎03-17-2021

That is great, @ckronig . Let us know if we can help.

Stefan_Ivan · ‎03-28-2022

Hello @Phil_Kay and @Peter_Hersh.

I participated at the Discovery Summit Europe 2022 and attended the SVEM presentation. This really caught my attention and quickly watched the other presentations that were referenced. This one is truly the most instructive in terms of creating the actual SVEM model, especially regarding the creation of the "de-biased" model

However I have only one question: After you identify the active effects using the dummy variable "Null factor" shouldn't you repeat the SVEM procedure using only the true effects (and no Null factor, of course :))? I'm asking this as I saw that the script used for the creation of the 200 prediction formulas contained all the RSM terms for the 7-factor DoE.

Hopefully I stated the problem in a clear manner.

Thank you

Phil_Kay · ‎03-29-2022

Hi @Stefan_Ivan,

Glad that you enjoyed the presentation.

That is a good question. If you have time, I suggest you read the paper by Lemkus, Ramsey, Gotwalt, Weese. They did proper rigorous simulation studies to try variations on SVEM including (if I remember correctly) a version where the dummy factor is used for model selection. Basically, the dummy factor model selection approach was not the best.

The original idea behind SVEM required a dummy factor. But as it has evolved, with the addition of ensemble modelling, it seems that the dummy factor is probably not useful.

One of the reasons that @Peter_Hersh and myself like SVEM is because it does not require you to classify effects into the binary of active and nonactive (or "true"/"false"). As scientists/engineers we don't think that it makes sense to say that a factor effect is inactive. SVEM means that you can include all factor effects in the model. Some are more important than others. But all are active, which just makes more sense to us.

I hope that helps.

Stefan_Ivan · ‎03-29-2022

Thank you @Phil_Kay! Your answer is exactly what I needed

Looking forward to the next Discovery Summit which I hope will be in person as I really want to meet you all!