cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Candidate Set Designs: Tailoring DOE Constraints to the Problem (2021-EU-30MP-784)

Level: Intermediate

 

Christopher Gotwalt, JMP Director of Statistical R&D, SAS

 

There are often contraints among the factors in experiments that are important not to to violate, but are difficult to describe in mathematical form. These contraints can be important for many reasons. If you are baking bread, there are combinations of time and temperature that you know will lead to inedible chunks of carbon. Another situation is when there are factor combinations that are physically impossible, like attaining high pressure at low temperature. In this presentation, we illustrate a simple workflow of creating a simulated dataset of candidate factor values. From there, we use the interactive tools in JMP's data visualization platforms in combination with AutoRecalc to identify a physically realizabable set of potential factor combinations that is supplied to the new Candidate Set Design capability in JMP 16. This then identifies the optimal subset of these filtered factor settings to run in the experiment. We also illustrate the Candidate Set Designer's use on historical process data, achieving designs that maximize information content while respecting the internal correlation structure of the process variables. Our approach is simple and easy to teach. It makes setting up experiments with constraints much more accessible to practitioners with any amount of DOE experience.

 

 

 

Auto-generated transcript...

 

Transcript

Hello Chris Gotwalt here.
Today, we're going to be
constructing the history of
graphic paradoxes and oh wait,
wrong topic. Actually we're going
to be talking about candidate
set designs, tailoring DOE
constraints to the problem.
So industrial experimentation
for product and process
improvement has a long history
with many threads that I admit I
only know a tiny sliver
of. The idea of using observation
for product and process
innovation is as old as humanity
itself. It received renewed
focus during the Renaissance and
Scientific Revolution. During the
subsequent Industrial
Revolution, science and industry
began to operate more and more
in lockstep. In the early 20th
century, Edison's lab was an
industrial innovation on a
factory scale, but it was done
to my knowledge, outside of
modern experimental traditions.
Not long after RA Fisher
introduced concepts like
blocking and randomization, his
associate and then son in law,
George Box, developed what is
now probably the dominant
paradigm in design of
experiments, with the most
popular book being Statistics
for Experimenters by Box,
Hunter and Hunter.
The method described in Box, Hunter
and Hunter are what I call the
taxonomical approach to design.
So suppose you have a product
or process you want to improve.
You think through the things you
can change. The knobs can turn
like temperature, pressure,
time, ingredients you can use or
processing methods that you can
use. These these things become
your factors. Then you think
about whether they are
continuous or nominal, and if
they are nominal, how many
levels they take or the range
you're willing to vary them. If a
factor is continuous, then you
figure out the name of the
design that most easily matches
up to the problem and resources
that you...that fits your budget.
That design will have...
will have a name like a Box
Behnken design, a fractional
factorial, or a central
composite signs, or possibly
something like a Taguchi array. There will
be restrictions on the numbers
of runs, the level...the numbers
of levels of categorical
factors, and so on, so there
will be some shoehorning the
problem at hand into the design
that you can find. For example,
factors in the BHH
approach, Box Hunter and Hunter
approach, often need to be
whittled down to two or three
unique values or levels.
Despite its limitations, the
taxonomical approach has been
fantastically successful.
Over time, of course, some
people have asked if we could
still do better.
And by better we mean to ask
ourselves, how do we design our
study to obtain the highest
quality information pertinent to
the goals of the improvement
project? This line of
questioning lead ultimately to
optimal design. Optimal design is
an academic research area. It was
started in parallel with the Box
school in the '50s and '60s, but
for various reasons remained out
of the mainstream of industrial
experimentations, until the
custom designer and JMP.
The philosophy of the custom
designer is that you describe
the problem to the software. It
then returns you the best design
for your budgeted number of
runs. You start out by declaring
your responses along with their
goals, like minimize, maximize,
or match target, and then you
describe the kinds of factors
you have, continuous, categorical
mixture, etc. Categorical
factors can have any number of
levels. You give it a model that
you want to fit to the resulting
data. The model assumes at least
squares analysis and consists of
main effects and interactions in
polynomial terms. The custom
designer make some default
assumptions about the nature
of your goal, such as whether
you're interested in screening
or prediction, which is
reflected in the optimality
criterion that is used. The
defaults can be overridden
with a red triangle menu
option if you are wanting to
do something different from
what the software intends.
The workflow in most
applications is to set up
the model.
Then you choose your budget,
click make design. Once that
happens, JMP uses a mixed,
continuous and categorical
optimization algorithm, solving
for the number of factors times
the number of rows terms.
Then you get your design data
table with everything you need
except the response data. This
is a great workflow as the
factors are able to be varied
independent from one another.
What if you can't? What if
there are constraints? What
if the value of some factors
determine the possible ranges
of other factors?
Well then you can do....then
you can define some factor
constraints or use it
disallowed combinations
filter.
Unfortunately, while these
are powerful tools for
constraining experimental
regions, it can still be very
difficult to characterize
constraints using these.
Brad Jones' DOE team, Ryan Lekivetz,
Joseph Morgan and Caleb
King have added an
extraordinarily useful new
feature that makes handling
constraints vastly easier in
JMP 16. These are called
candidate or covariate runs.
What you can do is, off on your
own, create a table of all
possible combinations of
factor settings that you want
the custom designer to
consider. Then load them up
here and those will be the
only combinations of factor
settings that the designer
will...
will look at. The original
table, which I call a
candidate table, is like a
menu factor settings for
the custom designer.
This gives JMP users an
incredible level of control over
their designs. What I'm going to
do today is go over several
examples to show how you can use
this to make the custom
designer fulfill its potential
as a tool that tailors the
design to the problem at hand.
Before I do that, I'm going to
get off topic for a moment and
point out that in the JMP Pro
version of the custom designer,
there's now a capability that
allows you to declare limits of
detection at design time.
If you want a non missing values
for the limits here the custom
designer will add a column
property that informs the
generalized regression platform
of the detection limits and it
will then automatically get the
analysis correct. This leads to
dramatically higher power to
detect effects and much lower
bias in predictions, but that's
a topic for another talk.
Here are a bunch of applications
that I can think of for the
candidate set designer. The
simplest is when ranges of a
continuous factor depend on the
level of one or more categorical
factors. Another example is when
we can't control the range of
factors completely
independently, but the
constraints are hard to write
down. There are two methods we
can use for this. One is using
historical process data as a
candidate set, and then the
other one is what I call filter
designs where you create...design
a giant initial data set using
random numbers or a space
filling design and then use row
selections in scatter plots
to pick off the points that
don't satisfy the constraints.
There's also the ability to
really highly customize mixture
problems, especially situations
where you've got multilayer
mixturing. This isn't
something that I'm going to be
able to talk about today, but in
the future this is something
that you should be looking to be
able to do with this candidate
set designer. You can also do
nonlinear constraints with the
filtering method, the same
ways you can do other kinds of
constraints. It's it's very
simple and I'll have a quick
example at the very end
illustrating this.
So let's consider our first
example. Suppose you want to
match a target response in an
investigation of two factors.
One is equipped...an equipment
supplier, of which there are two
levels and the other one is the
temperature of the device. The
two different suppliers have
different ranges of operating
temperatures. Supplier A's is more
narrow of the two, going from
150 to 170 degrees Celsius.
But it's controllable to a
finer level of resolution of
about 5 degrees. Supplier B
has a wider operating range
going from 140 to 180 degrees
Celsius, but is only
controllable to 10 degrees
Celsius. Suppose we want to do
a 12 run design to find the
optimal combination of these
two factors.
We enumerate all possible
combinations of the two
factors in 10 runs in the
table here, just creating
this manually ourselves.
So here's the five possible
values of machine type A's
temperature settings. And then
down here are the five possible
values of Type B's temperature
settings. We want the best
design in 12 runs, which exceeds
the number of rows in the
candidate table. This isn't a
problem in theory, but I
recommend creating a copy of the
candidate set just in case so
that the number of runs that
your candidate table has exceeds
the number that you're looking
for in the design.
Then we go to the custom
designer.
Push select covariate
factors button.
Select the columns that we want
loaded as candidate design
factors. Now the candidate
design is loaded and shown.
Let's add the interaction effect,
as well as the quadratic effect
of temperature. Now we're at the
final step before creating the
design. I want to explain the
two options you see in the
design generation outline node.
The first one, which will force
in all the rows that are
selected in the original table
or in the listing of the
candidates in the custom
designer. So if you have
checkpoints that are unlikely to
be favored by the optimality
criterion and want to force them
into into the design, you can
use this option. It's a little
like taking those same rows and
creating an augmented design
based on just them, except that
you are controlling the possible
combinations of the factors in
the additional rows.
The second option, which I'm
checking here on purpose, allows
the candidate rows to be chosen
more than once. This will give
you optimally chosen
replications and is probably a
good idea if you're about to run
a physical experiment. If, on
the other hand, you are using an
optimal subset of rows to find
to try in a fancy new machine
learning algorithm like SVEM, a
topic of one of my other talks
at the March Discovery
Conference. You would not want
to check this option if that was
the case. Basically, if you
don't have all of your response
values already, I would check
this box and if you already have
the response values, then don't.
Reset the sample size to 12 and
click make design. The candidate
design in all its glory will
appear just like any other
design made by the custom
designer. As we see in the
middle JMP window, JMP also
selects the rows in the original
table chosen by the candidate
design algorithm. Note that 10
not 12 rows were selected.
On the right we see the new
design table, the rightmost
column in the table indicates
the row of origin for that
run. Notice that original rows
11 and 15 were chosen twice
and are replicates.
Here is a histogram view of the
design. You can see that the
different values of temperature
were chosen by the candidate set
algorithm for different machine
types. Overall, this design is
nicely balanced, but we don't
have 3 levels of temperature in
machine type A. Fortunately,
we can select the rows we
want forced into the design
to ensure that we have 3
levels of temperature for
both machine types.
Just select the row you want
forced into the design in the
covariate table. Check include
all selected covariant rows into
the design option.
And then if you go through
all of that, you will see
that now both levels of
machine have at least three
levels of temperature in the
design. So the first design
we created is on the left
and the new design forcing
there to be 3 levels of
machine type A's
temperature settings is over
here to the right.
My second example is based on a
real data set from a
metallurgical manufacturing
process. The company wants to
control the amount of shrinkage
during the sintering step. They
have a lot of historical data
and have applied machine
learning models to predict
shrinkage and so have some idea
what the key factors are.
However, to actually optimize
the process, you should really
do a designed experiment.
As Laura Castro-Schilo
once pointed...
As Laura Castro-Schilo once
told me, causality is a
property not of the data, but
if the data generating
mechanism, and as George Box
says on the inside cover of
Statistics for Experimenters,
to find out what happens when
you change something, it is
necessary to change it.
Now, although we can't use the
historical data to prove
causality, there is
essential information about
what combinations of factors
are possible that we can use
in the design.
We first have to separate the
columns in the table that
represent controllable factors
from the ones that are more
passive sensor measurements
or drive quantities that
cannot be controlled directly.
A glance at the scatter plot of
the potential continuous factors
indicates that there are
implicit constraints that could
be difficult to characterize as
linear constraints or disallowed
combinations. However, these
represent a sample of the
possible combinations that
can be used with the
candidate designer quite
easily.
To do this, we bring up the
custom designer. Set up the
response. I like to load up some
covariate factors. Select the
columns that we can control as
factor...DOE factors and
click OK. Now we've got them
loaded. Let's set up a quadratic
response surface model as our
base model. Then select all of
the model terms except the
intercept. Then do a control
plus right click and convert
all those terms into if
possible effects. This, in
combination with response
surface model chosen, means
that we will be creating a
Bayesian I-optimal candidate
set design.
Check the box that allows for
optimally chosen replicates.
Enter the sample size.
It then creates the design
for us. If we look at the
distribution of the factors,
we see that it is tried hard
to pursue greater balance.
On the left, we have a
scatterplot matrix of the
continuous factors from the
original data and on the right
is the hundred row design. We can see
that in the sintering
temperature, we have some
potential outliers at 1220.
One would want to make sure that
those are real values. In
general, you're going to need to
make sure that the input
candidate set it's clear of
outliars and of missing values
before using it as a candidate
set design. In my talk with Ron
Kennet this...in the March
2021 Discovery conference,
I briefly demo how you can use
the outlier and missing value
screening platforms to remove
the outliers and replace the
missing values so that you could
use them at a subsequent
stage like this.
Now suppose we have a problem
similar to the first example,
where there are two machine
types, but now we have
temperature and pressure as
factors, and we know that
temperature and pressure cannot
vary independently and that the
nature of that dependence
changes between machine types.
We can create an initial space
filling design and use the
data filter to remove the
infeasible combinations of
factors setting separately for
each machine type. Then we can
use the candidate set designer
to find the most efficient
design for this situation.
So now I've been through this,
so now I've created my space
filling design. It's got 1,000
runs and I can bring up the
global data filter on it and
use it to shave off different
combinations of temperature
and pressure so that we can
have separate constraints by
machine type.
So I use the Lasso
tool to cut off
a corner in machine B.
And I go back and I cut off
another corner in machine B so
machine B is the machine that has
kind of a wider operating region
in temperature and pressure.
Then we switch over to machine
A. And we're just going to use
the Lasso tool
to shave off
the points that are outside its
operating region. And we see
that its operating region is a
lot narrower than Machine A's.
And here's our combined design.
From there we can load that back
up into the custom designer.
Put an RSM model there, then
set our number of runs to 32,
allowing coviariate rows to be
repeated. And it'll crank
through. Once it's done that, it
selects all the points that were
chosen by the candidate set
designer. And here we can see
the points that were chosen.
They've been highlighted and the
original set of candidate points
that were not selected are are
are gray.
We can bring up the new design
in Fit Y by X and we can
see a scatterplot where we see
that the
the the machine A design points
are in red. They're in the interior
of the space, and then the Type
B runs are in blue. It had the
wider operating region and
that's how we see these points
out here, further out for it. So
we have quickly achieved a
design with linear constraints
that change with a categorical
factor without going the
annoying process of deriving the
linear combination coefficients.
We've simply used basic JMP 101
visualization and filtering
tools. This idea generalizes
to other nonlinear
constraints and other complex
situations fairly easily.
So now we're going to use
filtering and multivariate to
set up a very unique new type of
design that I assure you you
have never seen before.
Go to the
Lasso tool. We're going
to cut out a very unusual
constraint.
And we're going to invert
selection.
We're going to delete those rows.
Then we can speed this up a
little bit. We can go through
and do the same thing
for other combinations of X1
and the other variables.
Carving out a very
unusual shaped candidate set.
We can load this up into
the custom designer. Same
thing as before. Bring our
columns in as covariates,
set up a design with all...
all high order interactions made
if possible, with a hundred
runs. And now we see
our design for this very
unusual constrained region
that is optimal given these
constraints.
So I'll leave you with this
image. I'm very excited to
hear what you were able to do
with the new candidate set
designer. Hats off to the DOE
team for adding this
surprisingly useful and
flexible new feature. Thank
you.

 

Comments
J_Asscher

Great tool, and definitely answers a need.

 

I have a question about how to use the Candidate run feature in Custom Design vs. the Augment Design option.

I understand that when we use Augment Design:

1) we need a response (I don't know how JMP uses that response, or if it is OK to put in a phony response if I don't have a real one)

2) we can easily add new runs (that is the point)

3) we can't add easily new factors (we can add phony factors to the original design, but that is not "easily"). 

On the other hand, with the Candidate run feature, we can easily add factors, but not runs. 

 

Say I am building a new design. Problem 1: I want to include certain treatments because these are "reference runs" for me, either from an existing process or from previous experimentation. I might have a response for these runs, but I might not. Should I use Candidate runs, Augment, or a combination? 

Problem 2: I want to specify which treatments to replicate (again, for a new design). How do I do this?

 

thanks!

@J_Asscher 

 

 Problem 1: I want to include certain treatments because these are "reference runs" for me, either from an existing process or from previous experimentation. I might have a response for these runs, but I might not. Should I use Candidate runs, Augment, or a combination? 

 

It depends on what you want to do - Augment design is for when you have already collected some response values and want the most informative *new* combinations of factor settings to run. So, you only use Augment when you are performing a sequence of designs and are including all the existing rows in the next stage of the analysis.

 

Candidate Designs are a more general tool that ( in principle ) contains Augmentation. If you check the "Include all selected covariate rows in the design" option, then those rows are forced into the design, and this operates the same as Augmentation. The difference being that the set of potential combinations of factor settings is constrained to be those in the candidate table, whereas Augment Design will apply a coordinate exchange-type optimization to determine the new combinations of factor settings.

 

The Candidate Set designs have a wider set of application, and I see it as initially being a great way to deal with difficult-to-describe constraints, but there are other uses, as well. For example, you could use it like we do in SVEM to identify a good training set for a machine learning exercise.

 

Problem 2: I want to specify which treatments to replicate (again, for a new design). How do I do this?

 

You can force the replications in using the "Include all selected" mechanism, or you can check the "Allow covariate rows to be repeated option". When you do that the Candidate Set designer algorithm will sample from the candidate table with replacement. So it will find the optimal sets of treatments to replicate, in whichever optimal design sense you prefer. There are certainly good reasons to use each of these options.