Candidate Set Designs: Tailoring DOE Constraints to the Problem (2021-EU-30MP-78...

Christopher Gotwalt, JMP Director of Statistical R&D, SAS

There are often contraints among the factors in experiments that are important not to to violate, but are difficult to describe in mathematical form. These contraints can be important for many reasons. If you are baking bread, there are combinations of time and temperature that you know will lead to inedible chunks of carbon. Another situation is when there are factor combinations that are physically impossible, like attaining high pressure at low temperature. In this presentation, we illustrate a simple workflow of creating a simulated dataset of candidate factor values. From there, we use the interactive tools in JMP's data visualization platforms in combination with AutoRecalc to identify a physically realizabable set of potential factor combinations that is supplied to the new Candidate Set Design capability in JMP 16. This then identifies the optimal subset of these filtered factor settings to run in the experiment. We also illustrate the Candidate Set Designer's use on historical process data, achieving designs that maximize information content while respecting the internal correlation structure of the process variables. Our approach is simple and easy to teach. It makes setting up experiments with constraints much more accessible to practitioners with any amount of DOE experience.

Auto-generated transcript...

Transcript

Hello Chris Gotwalt here.

Today, we're going to be

constructing the history of

graphic paradoxes and oh wait,

wrong topic. Actually we're going

to be talking about candidate

set designs, tailoring DOE

constraints to the problem.

So industrial experimentation

for product and process

improvement has a long history

with many threads that I admit I

only know a tiny sliver

of. The idea of using observation

for product and process

innovation is as old as humanity

itself. It received renewed

focus during the Renaissance and

Scientific Revolution. During the

subsequent Industrial

Revolution, science and industry

began to operate more and more

in lockstep. In the early 20th

century, Edison's lab was an

industrial innovation on a

factory scale, but it was done

to my knowledge, outside of

modern experimental traditions.

Not long after RA Fisher

introduced concepts like

blocking and randomization, his

associate and then son in law,

George Box, developed what is

now probably the dominant

paradigm in design of

experiments, with the most

popular book being Statistics

for Experimenters by Box,

Hunter and Hunter.

The method described in Box, Hunter

and Hunter are what I call the

taxonomical approach to design.

So suppose you have a product

or process you want to improve.

You think through the things you

can change. The knobs can turn

like temperature, pressure,

time, ingredients you can use or

processing methods that you can

use. These these things become

your factors. Then you think

about whether they are

continuous or nominal, and if

they are nominal, how many

levels they take or the range

you're willing to vary them. If a

factor is continuous, then you

figure out the name of the

design that most easily matches

up to the problem and resources

that you...that fits your budget.

That design will have...

will have a name like a Box

Behnken design, a fractional

factorial, or a central

composite signs, or possibly

something like a Taguchi array. There will

be restrictions on the numbers

of runs, the level...the numbers

of levels of categorical

factors, and so on, so there

will be some shoehorning the

problem at hand into the design

that you can find. For example,

factors in the BHH

approach, Box Hunter and Hunter

approach, often need to be

whittled down to two or three

unique values or levels.

Despite its limitations, the

taxonomical approach has been

fantastically successful.

Over time, of course, some

people have asked if we could

still do better.

And by better we mean to ask

ourselves, how do we design our

study to obtain the highest

quality information pertinent to

the goals of the improvement

project? This line of

questioning lead ultimately to

optimal design. Optimal design is

an academic research area. It was

started in parallel with the Box

school in the '50s and '60s, but

for various reasons remained out

of the mainstream of industrial

experimentations, until the

custom designer and JMP.

The philosophy of the custom

designer is that you describe

the problem to the software. It

then returns you the best design

for your budgeted number of

runs. You start out by declaring

your responses along with their

goals, like minimize, maximize,

or match target, and then you

describe the kinds of factors

you have, continuous, categorical

mixture, etc. Categorical

factors can have any number of

levels. You give it a model that

you want to fit to the resulting

data. The model assumes at least

squares analysis and consists of

main effects and interactions in

polynomial terms. The custom

designer make some default

assumptions about the nature

of your goal, such as whether

you're interested in screening

or prediction, which is

reflected in the optimality

criterion that is used. The

defaults can be overridden

with a red triangle menu

option if you are wanting to

do something different from

what the software intends.

The workflow in most

applications is to set up

the model.

Then you choose your budget,

click make design. Once that

happens, JMP uses a mixed,

continuous and categorical

optimization algorithm, solving

for the number of factors times

the number of rows terms.

Then you get your design data

table with everything you need

except the response data. This

is a great workflow as the

factors are able to be varied

independent from one another.

What if you can't? What if

there are constraints? What

if the value of some factors

determine the possible ranges

of other factors?

Well then you can do....then

you can define some factor

constraints or use it

disallowed combinations

filter.

Unfortunately, while these

are powerful tools for

constraining experimental

regions, it can still be very

difficult to characterize

constraints using these.

Brad Jones' DOE team, Ryan Lekivetz,

Joseph Morgan and Caleb

King have added an

extraordinarily useful new

feature that makes handling

constraints vastly easier in

JMP 16. These are called

candidate or covariate runs.

What you can do is, off on your

own, create a table of all

possible combinations of

factor settings that you want

the custom designer to

consider. Then load them up

here and those will be the

only combinations of factor

settings that the designer

will...

will look at. The original

table, which I call a

candidate table, is like a

menu factor settings for

the custom designer.

This gives JMP users an

incredible level of control over

their designs. What I'm going to

do today is go over several

examples to show how you can use

this to make the custom

designer fulfill its potential

as a tool that tailors the

design to the problem at hand.

Before I do that, I'm going to

get off topic for a moment and

point out that in the JMP Pro

version of the custom designer,

there's now a capability that

allows you to declare limits of

detection at design time.

If you want a non missing values

for the limits here the custom

designer will add a column

property that informs the

generalized regression platform

of the detection limits and it

will then automatically get the

analysis correct. This leads to

dramatically higher power to

detect effects and much lower

bias in predictions, but that's

a topic for another talk.

Here are a bunch of applications

that I can think of for the

candidate set designer. The

simplest is when ranges of a

continuous factor depend on the

level of one or more categorical

factors. Another example is when

we can't control the range of

factors completely

independently, but the

constraints are hard to write

down. There are two methods we

can use for this. One is using

historical process data as a

candidate set, and then the

other one is what I call filter

designs where you create...design

a giant initial data set using

random numbers or a space

filling design and then use row

selections in scatter plots

to pick off the points that

don't satisfy the constraints.

There's also the ability to

really highly customize mixture

problems, especially situations

where you've got multilayer

mixturing. This isn't

something that I'm going to be

able to talk about today, but in

the future this is something

that you should be looking to be

able to do with this candidate

set designer. You can also do

nonlinear constraints with the

filtering method, the same

ways you can do other kinds of

constraints. It's it's very

simple and I'll have a quick

example at the very end

illustrating this.

So let's consider our first

example. Suppose you want to

match a target response in an

investigation of two factors.

One is equipped...an equipment

supplier, of which there are two

levels and the other one is the

temperature of the device. The

two different suppliers have

different ranges of operating

temperatures. Supplier A's is more

narrow of the two, going from

150 to 170 degrees Celsius.

But it's controllable to a

finer level of resolution of

about 5 degrees. Supplier B

has a wider operating range

going from 140 to 180 degrees

Celsius, but is only

controllable to 10 degrees

Celsius. Suppose we want to do

a 12 run design to find the

optimal combination of these

two factors.

We enumerate all possible

combinations of the two

factors in 10 runs in the

table here, just creating

this manually ourselves.

So here's the five possible

values of machine type A's

temperature settings. And then

down here are the five possible

values of Type B's temperature

settings. We want the best

design in 12 runs, which exceeds

the number of rows in the

candidate table. This isn't a

problem in theory, but I

recommend creating a copy of the

candidate set just in case so

that the number of runs that

your candidate table has exceeds

the number that you're looking

for in the design.

Then we go to the custom

designer.

Push select covariate

factors button.

Select the columns that we want

loaded as candidate design

factors. Now the candidate

design is loaded and shown.

Let's add the interaction effect,

as well as the quadratic effect

of temperature. Now we're at the

final step before creating the

design. I want to explain the

two options you see in the

design generation outline node.

The first one, which will force

in all the rows that are

selected in the original table

or in the listing of the

candidates in the custom

designer. So if you have

checkpoints that are unlikely to

be favored by the optimality

criterion and want to force them

into into the design, you can

use this option. It's a little

like taking those same rows and

creating an augmented design

based on just them, except that

you are controlling the possible

combinations of the factors in

the additional rows.

The second option, which I'm

checking here on purpose, allows

the candidate rows to be chosen

more than once. This will give

you optimally chosen

replications and is probably a

good idea if you're about to run

a physical experiment. If, on

the other hand, you are using an

optimal subset of rows to find

to try in a fancy new machine

learning algorithm like SVEM, a

topic of one of my other talks

at the March Discovery

Conference. You would not want

to check this option if that was

the case. Basically, if you

don't have all of your response

values already, I would check

this box and if you already have

the response values, then don't.

Reset the sample size to 12 and

click make design. The candidate

design in all its glory will

appear just like any other

design made by the custom

designer. As we see in the

middle JMP window, JMP also

selects the rows in the original

table chosen by the candidate

design algorithm. Note that 10

not 12 rows were selected.

On the right we see the new

design table, the rightmost

column in the table indicates

the row of origin for that

run. Notice that original rows

11 and 15 were chosen twice

and are replicates.

Here is a histogram view of the

design. You can see that the

different values of temperature

were chosen by the candidate set

algorithm for different machine

types. Overall, this design is

nicely balanced, but we don't

have 3 levels of temperature in

machine type A. Fortunately,

we can select the rows we

want forced into the design

to ensure that we have 3

levels of temperature for

both machine types.

Just select the row you want

forced into the design in the

covariate table. Check include

all selected covariant rows into

the design option.

And then if you go through

all of that, you will see

that now both levels of

machine have at least three

levels of temperature in the

design. So the first design

we created is on the left

and the new design forcing

there to be 3 levels of

machine type A's

temperature settings is over

here to the right.

My second example is based on a

real data set from a

metallurgical manufacturing

process. The company wants to

control the amount of shrinkage

during the sintering step. They

have a lot of historical data

and have applied machine

learning models to predict

shrinkage and so have some idea

what the key factors are.

However, to actually optimize

the process, you should really

do a designed experiment.

As Laura Castro-Schilo

once pointed...

As Laura Castro-Schilo once

told me, causality is a

property not of the data, but

if the data generating

mechanism, and as George Box

says on the inside cover of

Statistics for Experimenters,

to find out what happens when

you change something, it is

necessary to change it.

Now, although we can't use the

historical data to prove

causality, there is

essential information about

what combinations of factors

are possible that we can use

in the design.

We first have to separate the

columns in the table that

represent controllable factors

from the ones that are more

passive sensor measurements

or drive quantities that

cannot be controlled directly.

A glance at the scatter plot of

the potential continuous factors

indicates that there are

implicit constraints that could

be difficult to characterize as

linear constraints or disallowed

combinations. However, these

represent a sample of the

possible combinations that

can be used with the

candidate designer quite

easily.

To do this, we bring up the

custom designer. Set up the

response. I like to load up some

covariate factors. Select the

columns that we can control as

factor...DOE factors and

click OK. Now we've got them

loaded. Let's set up a quadratic

response surface model as our

base model. Then select all of

the model terms except the

intercept. Then do a control

plus right click and convert

all those terms into if

possible effects. This, in

combination with response

surface model chosen, means

that we will be creating a

Bayesian I-optimal candidate

set design.

Check the box that allows for

optimally chosen replicates.

Enter the sample size.

It then creates the design

for us. If we look at the

distribution of the factors,

we see that it is tried hard

to pursue greater balance.

On the left, we have a

scatterplot matrix of the

continuous factors from the

original data and on the right

is the hundred row design. We can see

that in the sintering

temperature, we have some

potential outliers at 1220.

One would want to make sure that

those are real values. In

general, you're going to need to

make sure that the input

candidate set it's clear of

outliars and of missing values

before using it as a candidate

set design. In my talk with Ron

Kennet this...in the March

2021 Discovery conference,

I briefly demo how you can use

the outlier and missing value

screening platforms to remove

the outliers and replace the

missing values so that you could

use them at a subsequent

stage like this.

Now suppose we have a problem

similar to the first example,

where there are two machine

types, but now we have

temperature and pressure as

factors, and we know that

temperature and pressure cannot

vary independently and that the

nature of that dependence

changes between machine types.

We can create an initial space

filling design and use the

data filter to remove the

infeasible combinations of

factors setting separately for

each machine type. Then we can

use the candidate set designer

to find the most efficient

design for this situation.

So now I've been through this,

so now I've created my space

filling design. It's got 1,000

runs and I can bring up the

global data filter on it and

use it to shave off different

combinations of temperature

and pressure so that we can

have separate constraints by

machine type.

So I use the Lasso

tool to cut off

a corner in machine B.

And I go back and I cut off

another corner in machine B so

machine B is the machine that has

kind of a wider operating region

in temperature and pressure.

Then we switch over to machine

A. And we're just going to use

the Lasso tool

to shave off

the points that are outside its

operating region. And we see

that its operating region is a

lot narrower than Machine A's.

And here's our combined design.

From there we can load that back

up into the custom designer.

Put an RSM model there, then

set our number of runs to 32,

allowing coviariate rows to be

repeated. And it'll crank

through. Once it's done that, it

selects all the points that were

chosen by the candidate set

designer. And here we can see

the points that were chosen.

They've been highlighted and the

original set of candidate points

that were not selected are are

are gray.

We can bring up the new design

in Fit Y by X and we can

see a scatterplot where we see

that the

the the machine A design points

are in red. They're in the interior

of the space, and then the Type

B runs are in blue. It had the

wider operating region and

that's how we see these points

out here, further out for it. So

we have quickly achieved a

design with linear constraints

that change with a categorical

factor without going the

annoying process of deriving the

linear combination coefficients.

We've simply used basic JMP 101

visualization and filtering

tools. This idea generalizes

to other nonlinear

constraints and other complex

situations fairly easily.

So now we're going to use

filtering and multivariate to

set up a very unique new type of

design that I assure you you

have never seen before.

Go to the

Lasso tool. We're going

to cut out a very unusual

constraint.

And we're going to invert

selection.

We're going to delete those rows.

Then we can speed this up a

little bit. We can go through

and do the same thing

for other combinations of X1

and the other variables.

Carving out a very

unusual shaped candidate set.

We can load this up into

the custom designer. Same

thing as before. Bring our

columns in as covariates,

set up a design with all...

all high order interactions made

if possible, with a hundred

runs. And now we see

our design for this very

unusual constrained region

that is optimal given these

constraints.

So I'll leave you with this

image. I'm very excited to

hear what you were able to do

with the new candidate set

designer. Hats off to the DOE

team for adding this

surprisingly useful and

flexible new feature. Thank

you.

0 Comments

Presented At Discovery Summit Europe 2021

Presenter

Christopher Gotwalt

Candidate Set Designs: Tailoring DOE Constraints to the Problem (2021-EU-30MP-784)

Presenter

Advanced Statistical Modeling

Automation and Scripting

Basic Data Analysis and Modeling

Consumer and Market Research

Content Organization

Data Blending and Cleanup

Data Exploration and Visualization

Design of Experiments

Mass Customization

Quality and Process Engineering

Sharing and Communicating Results