Level: Intermediate
Christopher Gotwalt, JMP Director of Statistical R&D, SAS
There are often contraints among the factors in experiments that are important not to to violate, but are difficult to describe in mathematical form. These contraints can be important for many reasons. If you are baking bread, there are combinations of time and temperature that you know will lead to inedible chunks of carbon. Another situation is when there are factor combinations that are physically impossible, like attaining high pressure at low temperature. In this presentation, we illustrate a simple workflow of creating a simulated dataset of candidate factor values. From there, we use the interactive tools in JMP's data visualization platforms in combination with AutoRecalc to identify a physically realizabable set of potential factor combinations that is supplied to the new Candidate Set Design capability in JMP 16. This then identifies the optimal subset of these filtered factor settings to run in the experiment. We also illustrate the Candidate Set Designer's use on historical process data, achieving designs that maximize information content while respecting the internal correlation structure of the process variables. Our approach is simple and easy to teach. It makes setting up experiments with constraints much more accessible to practitioners with any amount of DOE experience.
Transcript |
Hello Chris Gotwalt here. |
Today, we're going to be |
constructing the history of |
graphic paradoxes and oh wait, |
wrong topic. Actually we're going |
to be talking about candidate |
set designs, tailoring DOE |
constraints to the problem. |
So industrial experimentation |
for product and process |
improvement has a long history |
with many threads that I admit I |
only know a tiny sliver |
of. The idea of using observation |
for product and process |
innovation is as old as humanity |
itself. It received renewed |
focus during the Renaissance and |
Scientific Revolution. During the |
subsequent Industrial |
Revolution, science and industry |
began to operate more and more |
in lockstep. In the early 20th |
century, Edison's lab was an |
industrial innovation on a |
factory scale, but it was done |
to my knowledge, outside of |
modern experimental traditions. |
Not long after RA Fisher |
introduced concepts like |
blocking and randomization, his |
associate and then son in law, |
George Box, developed what is |
now probably the dominant |
paradigm in design of |
experiments, with the most |
popular book being Statistics |
for Experimenters by Box, |
Hunter and Hunter. |
The method described in Box, Hunter |
and Hunter are what I call the |
taxonomical approach to design. |
So suppose you have a product |
or process you want to improve. |
You think through the things you |
can change. The knobs can turn |
like temperature, pressure, |
time, ingredients you can use or |
processing methods that you can |
use. These these things become |
your factors. Then you think |
about whether they are |
continuous or nominal, and if |
they are nominal, how many |
levels they take or the range |
you're willing to vary them. If a |
factor is continuous, then you |
figure out the name of the |
design that most easily matches |
up to the problem and resources |
that you...that fits your budget. |
That design will have... |
will have a name like a Box |
Behnken design, a fractional |
factorial, or a central |
composite signs, or possibly |
something like a Taguchi array. There will |
be restrictions on the numbers |
of runs, the level...the numbers |
of levels of categorical |
factors, and so on, so there |
will be some shoehorning the |
problem at hand into the design |
that you can find. For example, |
factors in the BHH |
approach, Box Hunter and Hunter |
approach, often need to be |
whittled down to two or three |
unique values or levels. |
Despite its limitations, the |
taxonomical approach has been |
fantastically successful. |
Over time, of course, some |
people have asked if we could |
still do better. |
And by better we mean to ask |
ourselves, how do we design our |
study to obtain the highest |
quality information pertinent to |
the goals of the improvement |
project? This line of |
questioning lead ultimately to |
optimal design. Optimal design is |
an academic research area. It was |
started in parallel with the Box |
school in the '50s and '60s, but |
for various reasons remained out |
of the mainstream of industrial |
experimentations, until the |
custom designer and JMP. |
The philosophy of the custom |
designer is that you describe |
the problem to the software. It |
then returns you the best design |
for your budgeted number of |
runs. You start out by declaring |
your responses along with their |
goals, like minimize, maximize, |
or match target, and then you |
describe the kinds of factors |
you have, continuous, categorical |
mixture, etc. Categorical |
factors can have any number of |
levels. You give it a model that |
you want to fit to the resulting |
data. The model assumes at least |
squares analysis and consists of |
main effects and interactions in |
polynomial terms. The custom |
designer make some default |
assumptions about the nature |
of your goal, such as whether |
you're interested in screening |
or prediction, which is |
reflected in the optimality |
criterion that is used. The |
defaults can be overridden |
with a red triangle menu |
option if you are wanting to |
do something different from |
what the software intends. |
The workflow in most |
applications is to set up |
the model. |
Then you choose your budget, |
click make design. Once that |
happens, JMP uses a mixed, |
continuous and categorical |
optimization algorithm, solving |
for the number of factors times |
the number of rows terms. |
Then you get your design data |
table with everything you need |
except the response data. This |
is a great workflow as the |
factors are able to be varied |
independent from one another. |
What if you can't? What if |
there are constraints? What |
if the value of some factors |
determine the possible ranges |
of other factors? |
Well then you can do....then |
you can define some factor |
constraints or use it |
disallowed combinations |
filter. |
Unfortunately, while these |
are powerful tools for |
constraining experimental |
regions, it can still be very |
difficult to characterize |
constraints using these. |
Brad Jones' DOE team, Ryan Lekivetz, |
Joseph Morgan and Caleb |
King have added an |
extraordinarily useful new |
feature that makes handling |
constraints vastly easier in |
JMP 16. These are called |
candidate or covariate runs. |
What you can do is, off on your |
own, create a table of all |
possible combinations of |
factor settings that you want |
the custom designer to |
consider. Then load them up |
here and those will be the |
only combinations of factor |
settings that the designer |
will... |
will look at. The original |
table, which I call a |
candidate table, is like a |
menu factor settings for |
the custom designer. |
This gives JMP users an |
incredible level of control over |
their designs. What I'm going to |
do today is go over several |
examples to show how you can use |
this to make the custom |
designer fulfill its potential |
as a tool that tailors the |
design to the problem at hand. |
Before I do that, I'm going to |
get off topic for a moment and |
point out that in the JMP Pro |
version of the custom designer, |
there's now a capability that |
allows you to declare limits of |
detection at design time. |
If you want a non missing values |
for the limits here the custom |
designer will add a column |
property that informs the |
generalized regression platform |
of the detection limits and it |
will then automatically get the |
analysis correct. This leads to |
dramatically higher power to |
detect effects and much lower |
bias in predictions, but that's |
a topic for another talk. |
Here are a bunch of applications |
that I can think of for the |
candidate set designer. The |
simplest is when ranges of a |
continuous factor depend on the |
level of one or more categorical |
factors. Another example is when |
we can't control the range of |
factors completely |
independently, but the |
constraints are hard to write |
down. There are two methods we |
can use for this. One is using |
historical process data as a |
candidate set, and then the |
other one is what I call filter |
designs where you create...design |
a giant initial data set using |
random numbers or a space |
filling design and then use row |
selections in scatter plots |
to pick off the points that |
don't satisfy the constraints. |
There's also the ability to |
really highly customize mixture |
problems, especially situations |
where you've got multilayer |
mixturing. This isn't |
something that I'm going to be |
able to talk about today, but in |
the future this is something |
that you should be looking to be |
able to do with this candidate |
set designer. You can also do |
nonlinear constraints with the |
filtering method, the same |
ways you can do other kinds of |
constraints. It's it's very |
simple and I'll have a quick |
example at the very end |
illustrating this. |
So let's consider our first |
example. Suppose you want to |
match a target response in an |
investigation of two factors. |
One is equipped...an equipment |
supplier, of which there are two |
levels and the other one is the |
temperature of the device. The |
two different suppliers have |
different ranges of operating |
temperatures. Supplier A's is more |
narrow of the two, going from |
150 to 170 degrees Celsius. |
But it's controllable to a |
finer level of resolution of |
about 5 degrees. Supplier B |
has a wider operating range |
going from 140 to 180 degrees |
Celsius, but is only |
controllable to 10 degrees |
Celsius. Suppose we want to do |
a 12 run design to find the |
optimal combination of these |
two factors. |
We enumerate all possible |
combinations of the two |
factors in 10 runs in the |
table here, just creating |
this manually ourselves. |
So here's the five possible |
values of machine type A's |
temperature settings. And then |
down here are the five possible |
values of Type B's temperature |
settings. We want the best |
design in 12 runs, which exceeds |
the number of rows in the |
candidate table. This isn't a |
problem in theory, but I |
recommend creating a copy of the |
candidate set just in case so |
that the number of runs that |
your candidate table has exceeds |
the number that you're looking |
for in the design. |
Then we go to the custom |
designer. |
Push select covariate |
factors button. |
Select the columns that we want |
loaded as candidate design |
factors. Now the candidate |
design is loaded and shown. |
Let's add the interaction effect, |
as well as the quadratic effect |
of temperature. Now we're at the |
final step before creating the |
design. I want to explain the |
two options you see in the |
design generation outline node. |
The first one, which will force |
in all the rows that are |
selected in the original table |
or in the listing of the |
candidates in the custom |
designer. So if you have |
checkpoints that are unlikely to |
be favored by the optimality |
criterion and want to force them |
into into the design, you can |
use this option. It's a little |
like taking those same rows and |
creating an augmented design |
based on just them, except that |
you are controlling the possible |
combinations of the factors in |
the additional rows. |
The second option, which I'm |
checking here on purpose, allows |
the candidate rows to be chosen |
more than once. This will give |
you optimally chosen |
replications and is probably a |
good idea if you're about to run |
a physical experiment. If, on |
the other hand, you are using an |
optimal subset of rows to find |
to try in a fancy new machine |
learning algorithm like SVEM, a |
topic of one of my other talks |
at the March Discovery |
Conference. You would not want |
to check this option if that was |
the case. Basically, if you |
don't have all of your response |
values already, I would check |
this box and if you already have |
the response values, then don't. |
Reset the sample size to 12 and |
click make design. The candidate |
design in all its glory will |
appear just like any other |
design made by the custom |
designer. As we see in the |
middle JMP window, JMP also |
selects the rows in the original |
table chosen by the candidate |
design algorithm. Note that 10 |
not 12 rows were selected. |
On the right we see the new |
design table, the rightmost |
column in the table indicates |
the row of origin for that |
run. Notice that original rows |
11 and 15 were chosen twice |
and are replicates. |
Here is a histogram view of the |
design. You can see that the |
different values of temperature |
were chosen by the candidate set |
algorithm for different machine |
types. Overall, this design is |
nicely balanced, but we don't |
have 3 levels of temperature in |
machine type A. Fortunately, |
we can select the rows we |
want forced into the design |
to ensure that we have 3 |
levels of temperature for |
both machine types. |
Just select the row you want |
forced into the design in the |
covariate table. Check include |
all selected covariant rows into |
the design option. |
And then if you go through |
all of that, you will see |
that now both levels of |
machine have at least three |
levels of temperature in the |
design. So the first design |
we created is on the left |
and the new design forcing |
there to be 3 levels of |
machine type A's |
temperature settings is over |
here to the right. |
My second example is based on a |
real data set from a |
metallurgical manufacturing |
process. The company wants to |
control the amount of shrinkage |
during the sintering step. They |
have a lot of historical data |
and have applied machine |
learning models to predict |
shrinkage and so have some idea |
what the key factors are. |
However, to actually optimize |
the process, you should really |
do a designed experiment. |
As Laura Castro-Schilo |
once pointed... |
As Laura Castro-Schilo once |
told me, causality is a |
property not of the data, but |
if the data generating |
mechanism, and as George Box |
says on the inside cover of |
Statistics for Experimenters, |
to find out what happens when |
you change something, it is |
necessary to change it. |
Now, although we can't use the |
historical data to prove |
causality, there is |
essential information about |
what combinations of factors |
are possible that we can use |
in the design. |
We first have to separate the |
columns in the table that |
represent controllable factors |
from the ones that are more |
passive sensor measurements |
or drive quantities that |
cannot be controlled directly. |
A glance at the scatter plot of |
the potential continuous factors |
indicates that there are |
implicit constraints that could |
be difficult to characterize as |
linear constraints or disallowed |
combinations. However, these |
represent a sample of the |
possible combinations that |
can be used with the |
candidate designer quite |
easily. |
To do this, we bring up the |
custom designer. Set up the |
response. I like to load up some |
covariate factors. Select the |
columns that we can control as |
factor...DOE factors and |
click OK. Now we've got them |
loaded. Let's set up a quadratic |
response surface model as our |
base model. Then select all of |
the model terms except the |
intercept. Then do a control |
plus right click and convert |
all those terms into if |
possible effects. This, in |
combination with response |
surface model chosen, means |
that we will be creating a |
Bayesian I-optimal candidate |
set design. |
Check the box that allows for |
optimally chosen replicates. |
Enter the sample size. |
It then creates the design |
for us. If we look at the |
distribution of the factors, |
we see that it is tried hard |
to pursue greater balance. |
On the left, we have a |
scatterplot matrix of the |
continuous factors from the |
original data and on the right |
is the hundred row design. We can see |
that in the sintering |
temperature, we have some |
potential outliers at 1220. |
One would want to make sure that |
those are real values. In |
general, you're going to need to |
make sure that the input |
candidate set it's clear of |
outliars and of missing values |
before using it as a candidate |
set design. In my talk with Ron |
Kennet this...in the March |
2021 Discovery conference, |
I briefly demo how you can use |
the outlier and missing value |
screening platforms to remove |
the outliers and replace the |
missing values so that you could |
use them at a subsequent |
stage like this. |
Now suppose we have a problem |
similar to the first example, |
where there are two machine |
types, but now we have |
temperature and pressure as |
factors, and we know that |
temperature and pressure cannot |
vary independently and that the |
nature of that dependence |
changes between machine types. |
We can create an initial space |
filling design and use the |
data filter to remove the |
infeasible combinations of |
factors setting separately for |
each machine type. Then we can |
use the candidate set designer |
to find the most efficient |
design for this situation. |
So now I've been through this, |
so now I've created my space |
filling design. It's got 1,000 |
runs and I can bring up the |
global data filter on it and |
use it to shave off different |
combinations of temperature |
and pressure so that we can |
have separate constraints by |
machine type. |
So I use the Lasso |
tool to cut off |
a corner in machine B. |
And I go back and I cut off |
another corner in machine B so |
machine B is the machine that has |
kind of a wider operating region |
in temperature and pressure. |
Then we switch over to machine |
A. And we're just going to use |
the Lasso tool |
to shave off |
the points that are outside its |
operating region. And we see |
that its operating region is a |
lot narrower than Machine A's. |
And here's our combined design. |
From there we can load that back |
up into the custom designer. |
Put an RSM model there, then |
set our number of runs to 32, |
allowing coviariate rows to be |
repeated. And it'll crank |
through. Once it's done that, it |
selects all the points that were |
chosen by the candidate set |
designer. And here we can see |
the points that were chosen. |
They've been highlighted and the |
original set of candidate points |
that were not selected are are |
are gray. |
We can bring up the new design |
in Fit Y by X and we can |
see a scatterplot where we see |
that the |
the the machine A design points |
are in red. They're in the interior |
of the space, and then the Type |
B runs are in blue. It had the |
wider operating region and |
that's how we see these points |
out here, further out for it. So |
we have quickly achieved a |
design with linear constraints |
that change with a categorical |
factor without going the |
annoying process of deriving the |
linear combination coefficients. |
We've simply used basic JMP 101 |
visualization and filtering |
tools. This idea generalizes |
to other nonlinear |
constraints and other complex |
situations fairly easily. |
So now we're going to use |
filtering and multivariate to |
set up a very unique new type of |
design that I assure you you |
have never seen before. |
Go to the |
Lasso tool. We're going |
to cut out a very unusual |
constraint. |
And we're going to invert |
selection. |
We're going to delete those rows. |
Then we can speed this up a |
little bit. We can go through |
and do the same thing |
for other combinations of X1 |
and the other variables. |
Carving out a very |
unusual shaped candidate set. |
We can load this up into |
the custom designer. Same |
thing as before. Bring our |
columns in as covariates, |
set up a design with all... |
all high order interactions made |
if possible, with a hundred |
runs. And now we see |
our design for this very |
unusual constrained region |
that is optimal given these |
constraints. |
So I'll leave you with this |
image. I'm very excited to |
hear what you were able to do |
with the new candidate set |
designer. Hats off to the DOE |
team for adding this |
surprisingly useful and |
flexible new feature. Thank |
you. |
Great tool, and definitely answers a need.
I have a question about how to use the Candidate run feature in Custom Design vs. the Augment Design option.
I understand that when we use Augment Design:
1) we need a response (I don't know how JMP uses that response, or if it is OK to put in a phony response if I don't have a real one)
2) we can easily add new runs (that is the point)
3) we can't add easily new factors (we can add phony factors to the original design, but that is not "easily").
On the other hand, with the Candidate run feature, we can easily add factors, but not runs.
Say I am building a new design. Problem 1: I want to include certain treatments because these are "reference runs" for me, either from an existing process or from previous experimentation. I might have a response for these runs, but I might not. Should I use Candidate runs, Augment, or a combination?
Problem 2: I want to specify which treatments to replicate (again, for a new design). How do I do this?
thanks!
Problem 1: I want to include certain treatments because these are "reference runs" for me, either from an existing process or from previous experimentation. I might have a response for these runs, but I might not. Should I use Candidate runs, Augment, or a combination?
It depends on what you want to do - Augment design is for when you have already collected some response values and want the most informative *new* combinations of factor settings to run. So, you only use Augment when you are performing a sequence of designs and are including all the existing rows in the next stage of the analysis.
Candidate Designs are a more general tool that ( in principle ) contains Augmentation. If you check the "Include all selected covariate rows in the design" option, then those rows are forced into the design, and this operates the same as Augmentation. The difference being that the set of potential combinations of factor settings is constrained to be those in the candidate table, whereas Augment Design will apply a coordinate exchange-type optimization to determine the new combinations of factor settings.
The Candidate Set designs have a wider set of application, and I see it as initially being a great way to deal with difficult-to-describe constraints, but there are other uses, as well. For example, you could use it like we do in SVEM to identify a good training set for a machine learning exercise.
Problem 2: I want to specify which treatments to replicate (again, for a new design). How do I do this?
You can force the replications in using the "Include all selected" mechanism, or you can check the "Allow covariate rows to be repeated option". When you do that the Candidate Set designer algorithm will sample from the candidate table with replacement. So it will find the optimal sets of treatments to replicate, in whichever optimal design sense you prefer. There are certainly good reasons to use each of these options.