Hello everyone.
My name is Micol Tresoldi.
Today my talk will be about coding with continuous and mixture variables
to explore more of the input space.
Before I jump into the topic though,
I'd like to give your brief outline of what my presentation would look like.
I'll start by sharing and presenting to you a little bit of a general idea
of what the object of the project was and the objective that was driving it,
and then I'll pass on to present the initial approach
that we took initially to pursue this objective.
I'll then show you though,
that following this initial approach, we do encounter some problem.
At that point at the problem stage,
we'll need to go back to the beginning of the problem setting,
and try to look at that from a slightly different perspective,
in a way that we can figure out an alternative way
of looking at our input variables.
In doing this, we'll be altering our data structure.
But I'll show you how we can actually build an equivalent to statistical model
in a way that we will not be in need of going and collect any additional data,
but actually we'll be able to re-analyse exact same data,
and still be able to hopefully overcome our initial problem
and find some useful directions to go.
This is the overview of the presentation.
Let me start by giving you the general idea of the project.
When the clients first reach out to us, they had something in mind
in terms of having some ingredients they needed to mix together,
in a way that the final formulation exhibited some optimal properties.
More specifically,
any formulation was going to be judged upon two properties,
and each of these properties had to meet some certain optimality criteria.
As I just stated, the problem itself, is pre- general in its nature .
We'll have some ingredients, using the common analogy,
we can think about ourselves in the kitchen having some ingredients
and having to figure out a way to mix them.
In way that, at the end, our cake will look nice
and also taste good.
This is the general framework.
Now, let me give you some more details about this specific cake.
The recipe calls for four ingredients,
Factor A, Factor B, Factor C, and Factor D.
For Factor A, B, and C,
actually the amount to put in the recipe is being predetermined.
We don't have freedom there.
On the other hand, what we need to decide though
is how we're going to make those ingredients, if you like.
There are multiple ways of making those ingredients
because we have multiple raw materials that we can employ
to arrive to those ingredients.
Then only after having these ingredients ready for using,
we can actually employ them in the final recipe.
This is for Factor A, B, and C.
For Factor D on the other hand, there is only one raw material we can use.
Only one way of making it.
What we need to decide is how much
we're going to put a factor D in the final recipe.
Just to recap, in terms of decision -making problem,
we'll need to decide four things,
how we make Factor A, how we make Factor B,
how we make Factor C,
and how much of Factor D we're going to put in the recipe.
Okay, now I'll need to be a little more specific
in giving you some more details
about how, what were these ways of making Factor A, B, and C.
The client, when they came to us,
they had relatively few options in mind for this.
For Factor A, they wanted to consider two raw materials.
either only using raw material A1, or only using raw material A2.
For factor B, once again, only two raw materials, B 1 and B 2.
The possible ways of making factor B,
it was either the two pure blends of B 1 and B 2,
or a 50-50 blend of B 1 and B2.
Factor C we are now three row materials are available for making it.
Again, either the three pure blends, C 1, C 1, C3,
or as a fourth options, are 50-50 blend of C 1 and C 2.
With respect to the Factor D quantity,
which I'm going to denote by from now on, by X 1,
they wanted to test four possible levels.
Four possible amounts, five, 10, 15, and 20.
Regarding the response variables, those are slightly more straightforward
in the sense that we only have two of them,
both our continuous variables.
Each of them, as I was mentioned in the beginning,
had to meet certain optimality threshold, optimality criteria.
For Y1 had to be above 17. For Y2 had to be above 2.6.
Now we have on our left,
our input variables that we need to decide how to maneuver and vary
in making the recipe.
On the right side,
we have the properties that we're interested in.
What we decided to do was
to propose our clients to do designed experiment
in a way that we would go out and make some of these recipes,
make some of these formulations and be able, from the collected data,
after recording the properties for [inaudible 00:06:32] ,
actual observed formulation,
to understand and infer the relationships undergoing
that were linking the input variables.
How we were making our recipe and response variables.
How the properties actually were executing themselves
for different combinations of the inputs.
Ultimately that the objective of the project was in fact,
to figure out whether there was an optimal recipe,
meaning a recipe that whose properties
both met their respective optimality criteria.
Given this framework, given this setting,
now it's pretty clear that X 1 is going to be a quantitative variable.
But how about Factor A, B, and C?
Given the fact that we can mix these raw materials .
Are we going to treat them as categorical or are we going to treat them as numeric?
At this stage, because the client was particularly interested in
observing the performance of these specific compositions of the raw materials
for making the various Factors A, B, and C,
we decided to accommodate their requests and coded them as categorical variables
in a way that we were sure that those specific compositions
were going to show up in the design of experiment.
Again, categorical variables means that, and in this case,
each level of the categorical correspond to a possible way
of making the ingredient or factor.
We end up with three categorical variables
with two, three, and four levels respectively.
Now, turns out that actually
this categorical coding approach was also pretty helpful in the discussion
of how we wanted to specify the statistical model
that, in principle, was supposed to,
or at least assumed to be comprehensive enough
to describe and capture the relationship undergoing
between the factors and the responses, the properties.
For the client, was particularly easy
for having this categorical coding to identify and specify
what kind of interaction turns who they were expecting to see
in terms of explaining and be relevant in explaining the relationships.
The final statistical model
that we ended up specifying the design of experiment,
comprised of main effects, two -way interactions, all of them,
quadratic and cubic terms for the continuance variable
with the addition of the interaction of the quadratic with one of the factors.
Now, of course,
we also had some constraint in the number of experiments available.
Because we obviously don't have infinite amount of resources,
so we put a constraint of 51 runs,
and this is the DOE that JMP gave us
able to estimate the statistical model we just specified,
and also be able to be within the constraints
that on our resources.
Now with this, the only thing that was left to do
was go and make this 51 formulations.
Imagine that we're super quick, and everything is magic,
and we have already got gun
and made all of our relations collected data .
Now we are in good shape for
estimating the Gaussian model that we specified.
These are the results for the first property, Y 1.
We can see that there is a pretty good fit between predicting and actual values.
A lso, if we look at the metrics, the reporting, the model summary,
those look pretty satisfactory.
The same is true if we look at now at the second property Y2,
again, pretty good fit.
We are happy with our models, and we think we did a good job
in capturing the relationship.
Now remember that what we really want to discover is, in fact,
there is any optimal recipe that can meet both criteria for our properties.
How are we going to do this?
How are we going to establish if such a optimal recipe exists or not?
Well, in JMP Pro 16, this is a super easy task,
because we can simulate thousands of potential alternatives recipes
by using the Profiler Feature options.
For each of these hypothetical recipes,
we can automatically have in the same table
the predicted mean value for the two properties,
so that it comes super natural and super easy
to see if there is any optimal recipe.
Just to give you an idea how quick that is,
I want to show you live, how we can do this.
This is my DOE categorical table, where I have my Factor A, B, and C.
X1 is my only quantitative input variables.
I have my recorded values for the two properties, Y1 and Y2.
Imagine now that we have already run the model, estimated model
and saved the prediction formulas for the two variables here.
We can go here and highlight these two columns.
Go to graph, select Profiler,
and then put those two prediction formulas in the Y prediction formula box
and click OK.
This is the usual way we get a profiler dialogue box.
In fact, we can , easily play around and changing the various,
but levels of the inputs in a way that we can actually see
how this impacts our predictions for the two properties.
However, what I want to show you today
is how we can actually ask, going to the red triangle,
ask JMP to output a random table, and we can make it as big as we like.
I'm going to start with 30,000 rows, just to start,
I'll show you, see,
didn't really, took no time for JMP to give us this 30,000 rows
where each table, where each row corresponds to a hypothetical recipe
that we haven't necessarily seen in the DOE.
This is the power of having this feature in JMP,
that we can explore the input space in literally no time.
Now, if we are interested in seeing
whether there is one recipe that is optimal,
then we can go here, Graph Builder,
and put the predictive values for Y1, predictive values for Y2.
And then just to aid our visualization,
I'm going to put a vertical axis
in correspondence of the optimal threshold for Y2,
and likewise horizontal line marking the optimal threshold for Y1.
This upper quadrant denotes the optimal region,
because both properties are satisfying the optimality criteria.
Unfortunately, that we can see from here
that we don't find any recipe that is, in fact,
able to satisfy both the criteria.
This is like, okay, not very good news.
Now let me go back to my presentation very quick.
We can see, in fact,
that we don't have any properties line this quadrant
with the happy green smiley.
What do we do at this point?
Do we give up? Of course not.
What we can do is, in fact, go back to the beginning of the problem
and try to see if we can change any of our initial choices
that we first made in approaching the problem.
In particular, you might be remembering that we were undecided
whether we would treat the Factor A, B, and C
as categorical or as numeric.
So far we have treated them as categorical.
So far, factor A as being a categorical variable, with two levels,
either only using A1 or only using A2.
However, because in fact, the client were open
to mix the raw materials to make Factor A.
So that was an option.
Then what we can think of
is substituting this Factor A with variable
that now I call A1 Content, which is a quantitive variable,
which represents how much of A1 I'm going to put
into the mixture of A1 and A2 for making Factor A.
The translation,
the conversion between categorical levels and numerical values,
it's almost immediate .
If I'm only using A1, I'm going to use 100% of A1 in my mixture.
so I can code A1 Content to be equal to one.
On the opposite side, if I'm only using A2,
this means that I have zero A1 Content in my mixture,
and therefore A1 Content is going to be equal to zero.
You might have guessed that implicitly,
we are also defining A2 Content to be equal to 1 - A1 Content.
But we don't really need that
because we are only looking at two mixture variables.
Why are we doing this?
Well, the advantage is clear .
With Factor A,
we were constrained in looking at either A1 Content to be equal to zero or one.
Now that we're considering continuance coding,
the A1 Content can take any value between zero and one.
This, of course,
represents an enormous jump in the flexibility of our model
and an infinite in the sense that now we are open
to literally infinite more mixtures and infinite more ways of making Factor A.
Likewise, Factor B is categorical with three levels.
So far it's been this way,
coded only B1, only B2 or 50-50 blend.
But following the similar logic,
we can now introduce a B1 Content, continuous variable.
A gain, the conversion is going to be exactly the same.
50-50 blend of B 1 and B 2 will be converted in 0.5
because I'm using 50 % of B 1 and 50 % of B 2.
Again, B 2 Content is 1 - B 1 Content.
A gain, the advantage is that we're not bound to jump from zero to 0.5,
or to zero to one necessarily,
but we can explore the whole spectrum of values
from zero to one.
Factor C is likely more tricky,
because we do have three possible raw materials to mix up.
A t this stage,
we need to introduce not just one, but actually three continuous variables
that besides being continuous, have also the mixture constraints.
Meaning at all times, they need to be something to one.
But the conversion
between the levels of Factor C and the three new mixture variables
follows exactly the same logic.
That's super easy.
This is just a visualization of how we do the conversion of the levels.
This is how the DOE points that we already have the data on.
We don't need anything else.
Are seat within the continuous coding space.
Now, the only more
involved steps in passing from the categorical coding
to use a continuous coding is how, in fact,
we convert this the statistical model
that we use to design the experiment and then to analyze the data.
How are we going to do this?
Well, the easier way is to just do it in many small steps.
What we're going to do is start with our main effects model,
a little by little at the different factors.
We start with Factor A, which had only two levels.
Now in the continuous coding, what we're going to put is A1 Content.
We're only going to put the linear term of this A1 Content.
In fact, we only had one coefficient for Factor A in the category coding model.
Likewise, now we're going to have one single coefficient for A1 content.
Now if you don't believe me, this, it's an equivalent model.
I'm going to show you a couple of examples.
Imagine that we want to figure out
the impact of using only A2 for making Factor A,
then that means that A1 content is zero. Fine.
From the categorical coding model,
we're going to just look at the intercept term,
because this extra term refers to when we use A1.
On this other side,
for continuous coding model,
we're going to put the intercept, of course,
and then the A1 Content coefficient, but now we will multiply it by zero
because A1 Content is zero.
Not even doing any math,
you can really see that these two numbers are exactly the same.
Similarly, if we want to see , what's the impact of using only A1
now at this time, A1 content is going to be equal to one.
Now for categorical,
I'm going to sum up the intercept term plus the Factor A coefficient accord
accounting for the difference and the levels of the factor.
On this other side though,
we are going to always include the intercept.
A t this point, we'll multiply the A1 Content coefficient by one
because the content is one.
Again, not even any math,
the two numbers here are the same as the two numbers here.
Exactly equivalent.
Now with Factor B, we had three levels.
How are we going to do that?
Well, because it has three levels, now we can't just add the linear term,
but we also need to add the quadratic term.
We had two coefficients before, and we're going to have two coefficients
also now with the continuous coding.
A gain, if you don't believe me, this is an equivalent model,
we can work out at least one example, which works exactly as befor e.
If I only have B2, B 1 Content in zero,
means that two coefficients are going to have zero weight
in computing the impact.
Therefore the two numbers are only just two that are, in fact the same.
I'm not going to go into this again,
only B 1 is equivalent to B1 Content equal to one.
The most interesting is,
this that at least requires you to do some summation.
Where B 1 Content is going to be 0.5, because we are considering 50-50 blend.
You can verify easily
that these two numbers here summed up
are equivalent to this other side of the equation
where we put 0.5 and 0.5 squared,
because now our B 1 Content is equal to 0.5.
Now for Factor C, we had four levels.
We particularly remember, we had three possible raw materials.
We had to introduce three mixture variables.
Every time we do have to deal with mixture variables things,
it's slightly complicated
because they become perfectly cleaner with any constant term.
In putting the C 1, C 2, and C 3,
the sum of them deletes or requires us to delete the constant term.
But other than that, everything follows pretty much the same.
We had three coefficients here,
and we're still going to have three coefficient here
because we have four, but we are getting rid of the intercept.
So still the same balance.
A gain, I'm not going to go through all of the examples,
but you're more than welcome to look at the slides offline
and check that those are, in fact, gives you always the same answers.
These are all the examples.
Now with so much work, we have found the conversion of the main effects.
How we actually convert each separate factors
into using the new continuous variables?
Now our original model, though, included more than just main effects.
In fact, we had the two -way interactions.
Now the idea here is that every time Factor A appears,
I'm going to substitute it with A1 Content.
Every time Factor B appears,
I'm going to substitute it
with the two B1 Content and B 1 Content squared.
Likewise for Factor C,
I'm going to substitute it with the four terms that I've put here.
The same holds when I'm interacting with X 1,
and everything is very much in the same flavor,
logically follows the same scheme.
The only caution that you want to be aware of
and be particularly attentive about
is that every time you interact a three mixture variables,
where those are your three mixture variables,
those main effects that you originally had now need to be excluded from the model,
otherwise, the model won't be feasible.
That's the only caution that you need to be careful about.
Other than that, we're ready to go.
We've got our equivalent continuous model.
Now what we can do is, in fact, again, verify that everything is still same.
I get exactly the same predictions,
either using the categorical coding or going and using the continuous coding.
Now you might ask myself, why are you going into so much trouble
and going, doing so much mess if things are exactly the same?
Well, the advantage is immediate to see,
and you can really appreciate it if you start looking at the profilers.
This is the profilers, how it looks, when you use the categorical coding .
You have to jump between the different levels.
You don't have the faintest idea what can happen in between.
With the continuous coding on the other hand,
that's exactly what you can do.
You can explore way more of the different possible ways
of making the various ingredients Factor A, B, and C
in a way that before it was just out of bounds.
In technical terms, means that we have way more power of interpolation.
This doesn't come free, of course.
What you pay, the price of is in fact,
that you are implicitly making some assumptions.
The assumptions regards the way that the various new continuous variables
that we have introduced are related to the responses.
In a way, we are implicitly assuming that
the relations between A 1 Content and our properties is linear.
The relationship between B1 C ontent is quadratic and so forth.
If you think that those assumptions don't really hold in your case,
then of course, the whole procedure is questionable.
You don't want to pursue this.
But if you don't have any reason why you wouldn't believe this,
or at least why you wouldn't at least explore this possibility,
then, now we can go back and do the same exercise
and explore again the input space, but with way more flexibility.
Again, let's see if we can find that an optimal recipe
with this new continuous mixture coding.
How we're going to do? Well, exactly same ways.
I'm going to use the JMP profiler feature
and use the simulation and see if we can find anything.
Now let me go here.
This is my DOE continuous table.
Continuous, because now you can see that
these are all coded as continuous variables .
They have the blue triangle next to themselves .
The C 1, C 2, C 3 are also these stars,
indicating that they're coded as mixture variables in JMP.
Now imagine that again,
we have already fitted our model with the fit model platform.
We saved our prediction formulas now with the continuous coding.
What we're going to do same thing,
Graph Profiler, select those, and here we go.
Here is our prediction profiler.
Now we can play way more with the profiler and see all different combinations
without having to jump between different options.
Now, once again, red triangle, output random table.
Just for making things fair, I'm going to ask 3,000 rows.
Again, no time, literally blink of an eye.
JMP gives you 3,000 row tables
where now every recipe is again, sorry.
Every row is again, a potential hypothetical recipe
that we haven't really seen, necessarily seen in our DOE
but it still feasible,
because it still respects the constraint that we had at the beginning.
Once again,
to figure out whether something good is happening,
or at least whether within this 30,000 formulation,
we do find something that is optimal.
I'm going to construct the same graph.
Now you can see that our points are all disperse and are not aligned anymore.
Again, fitting the axis just to aid our visualization.
This is the nice thing.
With this way of coding and looking at more of the input space,
we do find few formulations that seem to be promising.
Of course, we need to keep in mind that this our predictive values.
Everything is still relying on our data, on our statistical model analysis,
but is still more promising than before.
We do find something in the optimal region defined by these two axis.
Quickly, going back to my presentation,
I want to draw a final conclusion here,
which is, in fact, that, using the categorical coding,
we couldn't find any recipe that at least on the predictive side,
could, in fact, meet both the optimality criteria.
Well, once we turn to,
figuring out how to code these different categorical variables
into continuous and mixture variables
and exploit the JMP power of giving us thousands and thousands of formulations,
we do find a few that in fact meet the specs.
We were happy that at least we could go back to our clients say,
look, instead of giving up on your project,
try to make these formulations and see how, in fact,
whether the actual properties do meet your criteria or not,
but at least it gives us some directions of improvement where to go.
With this, I'd like to end my presentation.
Thank my colleague, Xinjie Tong
and all of my collaborators at Dow Chemical.
Thank you, all of you for watching my presentation.
I'll be more than happy
to answer any questions you might have at this point.
Thank you.