Re-Thinking the Design and Analysis of Experiments? (2021-EU-30MP-776)

Phil Kay, JMP Senior Systems Engineer, SAS
Peter Hersh, JMP Technical Enablement Engineer, SAS

Scientists and engineers design experiments to understand their processes and systems: to find the critical factors and to optimize and control with useful models. Statistical design of experiments ensures that you can do more with your experimental effort and solve problems faster. But practical implementation of these methods generally requires acceptance of certain assumptions (including effect sparsity, effect heredity, and active or inactive effects) that don’t always sit comfortably with knowledge and experience of the domain. Can machine learning approaches overcome some of these challenges and provide richer understanding from experiments without extra cost? Could this be a revolution in the design and analysis of experiments?

This talk will explore the potential benefits of using the autovalidation technique developed by Chris Gotwalt and Phil Ramsey to analyze industrial experiments. Case study examples will be presented to compare traditional tools with autovalidation techniques in screening for important effects and in building models for prediction and optimization.

We were not able to answer all questions in the Q&A so here goes...

I see weights are not between 0-1, how they are generated?

They are "fractionally weighted". This is the key trick that enables the method. In Gen Reg in JMP Pro we use this in the Freq role. You can use this addin for JMP Pro 15 to generate the duplicate table with weighting. We can recommend a talk by Chris Gotwalt and Phil Ramsey to explain the deep statistical details of the weighting.

When you duplicate rows, they are duplicated with the same output result? What is the purpose of this duplication?

Yes they are duplicated with the same response. It would be a bad idea just to do that and progress with the analysis. The key innovation here is that they are fractionally weighted such that they have low weighting in one portion of the validation set and a higher weighting in the other validation set. For example, a run with low weighting in "training" will have a duplicate pair that has a high weighting in "validation". Some duplicate pairs will have more equal weighting in training and validation.

We fit a model using these weightings to determine how much of each run goes into training and how much goes into validation. Then we redraw the weights and fit another model. We repeat this process hundreds of times. We then use the results across all these models, either to screen the important effects from their proportion non-zero, or to build a useful model by averaging all the individual models.

Is it possible to run the add-in in the normal, non-pro, JMP 15 version?

Quite possibly. You can run the add-in but you will need JMP Pro to utilize the validation column in the analysis - this is critical.

How does it work with categorical variables?

We presented examples with only continuous variables. A nice feature is that it would work for any model and with any variable types.

Is this a recognized method now?

In a word, no. Not yet. Our motivation is to make people aware of this method to stimulate people that have an interest to explore the method and provide critical feedback. As we made clear in the presentation, we do not recommend you to use this as your only analysis of an experiment. If you like the idea, try it and see how it compares with other methods.

Having said that, the ideas of model averaging and holdback validation are recognised in larger data situations. It seems that this should be beneficial in the world of smaller data that is designed experiments.

Do you do the duplicate step manually?

No, this is done as part of the autovalidation setup by the addin or by the platform in JMP Pro 16. You could easily create the duplicate rows yourself. The harder part is setting up the fractional weighting, without which the analysis will not work.

Bayesian uses also a lot of simulation (MCMC), would it also be applicable? Integrating Prior and posterior distributions?
Quite possibly. The general idea of fractionally-weighted validation with bootstrap model averaging might well be applicable in lots of other areas.

Auto-generated transcript...

Speaker	Transcript
Phil Kay	Okay, so we are going to talk about rethinking the design and analysis of experiments. I'm Phil Kay. I'm the learning manager in the global technical enablement team and I'm joined by Pete Hersh.
Peter Hersh	Yeah I'm a member of the global technical enablement team as well, and excited to talk about DOE and some of the new techniques we're...we're exploring today.
Phil Kay	So the first thing to say, is we're big fans of DOE. It's awesome. It's had huge value in science and engineering for a very long time.
	And having said that, there are some assumptions that we have to be okay with in order to use DOE a lot of time.
	And they don't always feel hugely comfortable, as a scientist or an engineer. So things like effect sparsity, so the idea that not everything you're looking at turned out to be important
	and actually only a few of the things that you're actually experimenting on turn out to be important. Or effect heredity is another assumption.
	So that means that we only expect to see complex behaviors or higher order effects for factors that are active in their simpler forms.
	And, and just this idea of active and inactive effects, so commonly, the sequential process of design of experiments is to screen out the active effects from the inactive effects.
	And that just feels sometimes like too much of a binary decision. It seems a bit crazy, I think, a lot of time, this idea that some of the things we are experimenting on a completely inactive,
	when really, I think we, we know that really everything is going to have some effect. It might be less important but
	it's still going to have some effect. And these are largely assumptions that we use in order to
	get around some of the challenges of designing experiments when you can't afford to do huge numbers of runs and yeah. Pete, I don't know if you want to comment on that from your perspective as well.
Peter Hersh	Yeah I completely agree. I think that thinking of something like temperature being inactive is is...I think hard to imagine that temperature has no effect on an experiment.
Phil Kay	Yeah it's kind of absurd, isn't it? So,
	yeah, if that's in your experiment, then it's it's always going to be active in some way, but maybe not as important as other things.
	So I'm just going to skip right to the results of
	of what we've looked at. So we've been looking at this auto-validation technique that Chris Gotwalt and Phil Ramsey essentially invented
	and using that in the analysis of designed experiments, and it's really provided results that are just crazy. We just didn't think they were possible.
	So, first of all I looked at a system with 13 active effects and analyzing a 13 run definitive screening design and from that I was able to identify that 13 effects were active, which is... that's what we call a saturated situation. We wouldn't...
	commonly, we've talked about definitive screening designs as being effective in identifying the active effects when we have effect sparsity, when there's only a small number of effects that are actually important or active. But in this case I managed to identify all of the 13 active effects.
	And not only that , was actually able to build a model with all those 13 active effects from this 13 run definitive screening design.
	So, again that's kind of incredible; we don't expect to be able to have a model with as many active effects as we've have rows of data that we're building it from.
	And Pete, you looked at some other things and got some other really interesting results.
Peter Hersh	Yeah, absolutely, and and Phil's results are very, very impressive. I think
	what the next step that we tried was making a super saturated design, which is more active effects than runs and we tried this with
	very small DOEs. So a six run DOE with seven active effects, which if we did in standard DOE techniques, there'd be no way to analyze
	that properly. And we looked at comparing that to eight and 10 run DOEs and how much that bought us. So we got fairly useful models, even from a six run DOE, which was
	better than I expected.
Phil Kay	Yeah, it's better than you've got any right to expect really, isn't it?
	And so we've got these really impressive results and the ability to identify a huge number of active effects from a small
	definitive screen design and actually build that model with all those active effects. And in Pete's case, I have been able to build a model with seven active effects from really small
	Designed experiments.
	So, how did we we do this? How does the auto-validation methodology work? Well, it's taking ideas from machine learning, and one of the really useful tools from machine learning is validation. So holdout validation is a really nice way of ensuring that you, you build
	the most useful model. So it's a model that's robust. So we hold out a part of the data, we use that to test different models that we build, and
	basically, the model that makes the best prediction of this data that we've held out is the model that we go with, and that's just really tried and tested.
	It's actually pretty much the gold standard for building the most useful models, but with DOE that's a bigger challenge, itn't it, Pete? It doesn't really obviously lend itself to that.
Peter Hersh	Yeah, yeah, the the whole idea behind DOE is exploring the design space as efficiently as possible. And if we start holding out runs or holding out
	analysis of runs, then we're going to miss part of that design space and
	we really can't do that with a lot of these DOE techniques like definitive screening designs.
Phil Kay	Right, right, so it'd be nice if there was some trick that we could get the benefits of this holdout validation and not suffer from holding out critical data. So that brings us to this auto-validation
	idea, and, Pete, do you want to describe a bit about how this works?
Peter Hersh	Absolutely, so this was a real clever idea developed by Chris and Phil Ramsey, and they they essentially take our original data from a DOE and they resample it, so you
	repeat the results. So if you notice that the table here at the bottom of the slide, the the runs in gray are the same results of runs in white. They're just repeated.
	And the way they get away with this is by making this weighting column that is paired together. So basically, if one has a high weight, the
	the repeated run of that has a low rate...weight and so on and so forth. And this is...this is...enables us to use the data with this validation and the weighting and we'll go into a little bit more detail about how that's done.
Phil Kay	Yeah, you'll kind of see how it happens when we go through the demos. So we've basically got two case studies of simulated examples that we use to illustrate this methodology. So this first case study I'm going to talk through,
	I emphasize it's a simulated example. And in some ways, it's kind of an unrealistic example, but I think it does a really nice job of demonstrating the power of this methodology.
	We've got six factors and to make it seem a bit more real, we've chosen some some real factors from a case study here where they were trying to grow some biological organism and optimize the nutrients that they feed into the system to optimize the growth rate.
	So we've got these six different nutrients and those are our factors. We can add in different amounts of those, so I designed a 13 run definitive screening design to explore those factors with this growth rate response.
	And the response data was completely simulated by me, and it was simulated such that there were 13 strongly active effects. So
	I simulated it so the all of the main effects, all all six of the the main effects, are active.
	And then, for each of those factors, the quadratic effects are active as well, so we've got six quadratic effects.
	Plus we've got an intercept that we always need to estimate, so there are 13 effects in total that are active, that are important in understanding the system.
	1 signal to noise, but that's still a real challenge with standard methodology in order to model that, and we'll come to that in the demo.
	So, really, the question is, can we identify all those important effects and, and if we can, then can we build a model with all those important effects as well? Because as I said, that would be really quite remarkable
	versus what we can do with standard methodology.
	And then case study #2, Pete?
Peter Hersh	Yeah, absolutely. Very, very similar to Phil's case study. Ssame idea with we're feeding different nutrients at different levels to an organism and and checking its growth rate. In this case I simplified what Phil had done and broke it down to just three nutrient factors. And this is
	building a different type of design, so an I-optimal supersaturated design where we're looking at a full response surface
	in a supersaturated manner and we looked at six, eight and 10 run
	designs. And so same same idea.
	The the
	effects were very, very
	high signal to noise ratio, so really wanted to be able to pick out those effects if they were active. And just like Phil's, I kept the main effects in the quadratics active, as well as the intercept and we're trying to pick those out.
	And same idea, so how many runs would we need to see these active effects and how accurate of a model can we make from these very small designs?
Phil Kay	Yeah because you know, like I said, you've really got no right to expect to be able to build a good model from such a small design.
Peter Hersh	Yeah, exactly. Okay.
Phil Kay	So I'll go into a demo now of case study #1.
	And I'm presenting this through a JMP project, so that's a really nice way to present your results. I'd recommend trying this out.
	And that's our
	design, so this is our 13 run definitive screening design, where we vary these nutrient factors, and we have the simulated growth rate response. As I said, that's been simulated such that
	the main effects, the quadratics of all of these factors are strongly active, plus we've got to estimate this intercept.
	Now, with a definitive screening design I've generally recommend you use fit definitive screening as a way of looking at the results as one of the analyses that you can do.
	It works really well when we have this effect sparsity principle being true. So as long as only a few of the effects are strongly active...are active and the rest of them are unimportant,
	then it will find those...the few important effects and separate them from the unimportant ones.
	But in this case I wasn't expecting it to work well and it doesn't work well. It does not identify that all six factors are active. In fact it only identifies one of the factors as being active here.
	So that's not a big surprise, this is too difficult, too challenging a situation for this type of analysis.
	If somehow we knew that all of these active effects are active and we try and fit a model with all six main effects, all six quadratic and the intercept,
	then that's a saturated model. We've got as many parameters to estimate as we have rows of data, so we can just about fit that model, but we don't get any statistics.
	And in any case, you know, aside from the fact of I've simulated this data, in a real life situation, we wouldn't know which ones are active, so we wouldn't even know which model to fit.
	Now, using the auto-validation method, I was able to actually very convincingly identify the active effects, and I'll talk through how we did this.
	And this is just a visualization of my results here. You don't necessarily need to visualize it in this way. This is for presentation purposes.
	I was able to identify that first of all, the intercept was active. I've got all my six main effects,
	and my quadratic effects, and then my two factor interactions, which I simulated to have zero effect. You can see they are well down
	versus the other ones. And there's actually a null factor here that we use that...so so a dummy factor. So anything less than the null factor we can declare as being unimportant or or inactive, if you like.
	And what we're actually...the metric we're looking at here is something called proportion nonzero and I'll explain what that means, as we go through this. That's kind of the metric we're using here to identify the strength of an effect, of the importance of an effect.
	So a bit about how I went through this. So I took my original 13 run definitive screening design and then I set it up for auto-validation so we've now got 26 rows we've duplicated.
	And there's an add-in for doing this, one of our colleagues, Mike Anderson. created an add-in that you can use to do this in JMP 15.
	In JMP 16 they're actually adding the capability in the predictive modeling tools in the validation column platform.
	And what that does, we get this duplicate set of our data, and then we get this weighting and as Pete said, we have...each row is in the training set and in the validation set.
	In the training set, if it has a low weighting, it'll have a high weighting in the validation. So if it has a high weighting in the training set, it'll have a low weighting in the validation set.
	And what we do actually is, we read...these have basically been randomly assigned.
	We reassign those and we were able to kind of iterate over this hundreds of times, fitting the models each time and then looking at the aggregate...aggregated results over many simulation runs. So what you would do
	is to fit the model and I'm using GenReg here in JMP Pro.
	And you'll need JMP Pro anyway, because you need to be able to specify this validation role, so we put...the train validation column goes into validation.
	And the weighting goes into frequency and then we set up everything else as we normally would with our response. And then I've got a model, which is the response surface model here with all these effects in, and then I would click run.
	And it will fit a model, and we can use forward selection or the Lasso. Here, I've used the Lasso.
	It's not hugely important in this case.
	And what's actually happened is we've identified only the intercept as being important in this case, so we've only actually got the intercept in the model.
	But if we change the weighting, if we go back to our data table resimulate these weightings, we will likely get a different result from the model.
	We weight different rows of data, different runs in the experiment, that changes the model that's fit. So we're going to do that hundreds of times over, and what I'm going to do is actually to use the simulate function in JMP Pro.
	And what we do is we switch out the weighting column and switch in a recalculated version of the weighting column. And you can do that a few hundred times. I actually did it 250 times in this case. I'm not going to actually let that run, because that will take a minute or two.
	Once you've done that, what you'll get is a table that looks like this.
	So now I've got the parameter estimate for every one of those 250 models for each of these effects. So
	in my first model in this run that I did, this was the parameter estimate for this citric acid main effect. In the next model when we resampled the weighting,
	citric acid main effect did not enter the model, so it was zero in that case.
	And you can actually run distributions on all of these parameter estimates. And one of the things you can do is to
	customize this statistics, the summary statistics, to look at the proportion nonzero. So you can see the intercept here,
	the estimates that we've had of the intercept. You can see with citric acid, a lot of the time it's been estimated as being zero so those the models were,
	citric acid main effect was not in the model, and then a lot of the time it's been estimated as around about 3, which is what I'd simulated it to be.
	So what we look at is the proportion of times that it is non zero and we can make a combined data table out of those. And I've already done that, and just done a little bit of...
	a little bit of additional augmentation here. I've just added a column for whether it's a main effect or whatnot, and then that was how I created
	this visualization here. So what you're looking at is the proportion of times each of those effects is non zero, so the proportion of times that each of the effects is in our model over all those
	250 simulation runs we've done, where we've resimulated the fractional weighting. And that's what we use to identify
	the active effects, and that's...and it's done a remarkable job. It's been able to do what our standard methods would not be able to do. It's identified 13 active effects from a 13 run
	definitive screening design.
	Now, what would you want to do next? We maybe want to actually fit that model with all those effects and I've been able to do that. And I'm comparing the model that I've fit here
	versus the true simulated response, and you can see how closely they match up. So I've been able to build a model with all these main effects, all these quadratics and the intercept.
	So I've got a 13 parameter model here that I've been able to fit to this 13 run definitive screen design, which again is just remarkable.
	And I'm not going to talk through exactly how I got to that part. I'll hand over to Pete now. He's going to talk a bit more about this idea of self validated ensemble models.
Peter Hersh	Absolutely. Thank you, Phil. Let's see. I'm going to share my screen here and we'll just take a look at this project. So
	you can see here
	in the same
	flow as Phil, we're looking at a project here and I have
	started with that six runs supersaturated DOE, and here you can see, I have three factors,
	what my my actual underlying model growth rate is and then what the growth rate...the simulated growth rate was and then like like Phil mentioned,
	I create this auto-validation column, which can be done with an add-in in JMP 15 that that Mike Anderson developed. Or in JMP 16, it's it's built right into the software and you can access that under the analyze predictive modeling platform make validation column.
	So just like Phil showed, he showed a excellent example of how we can find which factors are active, so a factor screening. And that is oftentimes our main goal with DOE, but if we want to take it a step further and build a model out of that,
	we'd go through this the same process, right. So we build our DOE, we get an auto-validation added to that DOE, we build our model, just like Phil showed, using generalized regression and one of the
	variable selection techniques. So Phil Ramsey and Chris Gotwalt have looked at many of these different techniques and they all seem to work fairly well. So whether you're using a Lasso or even a two stage forward selection, they all seem to have similar results and
	work fairly well. So once you set this up and launch it, you get a model, like like Phil had shown, and you know some of these models will have

Presented At Discovery Summit Europe 2021

Presenter

Phil Kay

Re-Thinking the Design and Analysis of Experiments? (2021-EU-30MP-776)

Presenter

Files

Advanced Statistical Modeling

Basic Data Analysis and Modeling

Consumer and Market Research

Data Blending and Cleanup

Data Exploration and Visualization

Design of Experiments

Mass Customization

Predictive Modeling and Machine Learning

Quality and Process Engineering

Reliability Analysis

Sharing and Communicating Results