Choose Language Hide Translation Bar

Controlling Extrapolation in the Prediction Profiler in JMP® Pro 16 (2021-EU-45MP-751)

Laura Lancaster, JMP Principal Research Statistician Developer, SAS
Jeremy Ash, JMP Analytics Software Tester, SAS
Chris Gotwalt, JMP Director of Statistical Research and Development, SAS

 

Uncontrolled model extrapolation leads to two serious kinds of errors: (1) the model may be completely invalid far from the data, and (2) the combinations of variable values may not be physically realizable. Using the Profiler to optimize models that are fit to observational data can lead to extrapolated solutions that are of no practical use without any warning. JMP Pro 16 introduces extrapolation control into many predictive modeling platforms and the Profiler platform itself. This new feature in the Prediction Profiler alerts the user to possible extrapolation or completely avoids drawing extrapolated points where the model may not be valid. Additionally, the user can perform optimization over a constrained region that avoids extrapolation. In this presentation we discuss the motivation and usefulness of extrapolation control, demonstrate how it can be easily used in JMP, and describe details of our methods.

 

 

Auto-generated transcript...

 

Speaker

Transcript

Hi, I'm Chris Gotwalt. My co
presenters, Laura Lancaster and
Jeremy Ash, and I are presenting
an useful new JMP Pro
capability called Extrapolation
Control. Almost any model that
you would ever want to predict
with has a range of
applicability, a region of the
input space, where the
predictions are considered to be
reliable enough. Outside that
region, we begin to extrapolate
the model to points far from the
data used to fit the model. Using
the predictions from that model
at those points could lead to
completely unreliable
predictions. There are two
primary sources of
extrapolation statistical
extrapolation and domain based
extrapolation. Both types are
covered by the new feature.
Statistical extrapolation occurs
when one is attempting to
predict using a model at an x
that isn't close to the values
used to train that model.
Domain based extrapolation
happens when you try to evaluate
at an x that is impossible due
to scientific or engineering
based constraints. The example
here illustrates both kinds of
extrapolation in one example.
Here we see a profiler from a
model of a metallurgy production
process. The prediction reads
out says -2.96 with no
indication that we're evaluating
at a combination of temperature
and pressure that is impossible
in a domain sense to attain for
this machine. We also have
statistical extrapolation as it
is far from the data used to fit
the model as seen in the scatter
plot of the training data on the
right. In JMP Pro 16, Jeremy,
Laura and I have collaborated to
add a new capability that can
give a warning when the profiler
thinks you might be
extrapolating. Or if you turn
extrapolation control on, it
will restrict the set of points
that you see to only those that
it doesn't think are
extrapolating. We have two types
of extrapolation control. One is
based on the concept of leverage
and uses a least squares model.
This first type is only
available in the Pro version of
Fit Model least squares. The
other type we call general
machine learning extrapolation
control and is available in the
Profiler platform and several of
the most common machine learning
platforms in JMP Pro. Upon
request, we could even add it to
more. Least squares
extrapolation control uses the
concept of leverage, which is
like a scaled version of the
prediction variance. It is model-
based and so it uses information
about the main effects
interactions in higher order
terms to determine the
extrapolation. For the general
machine learning extrapolation
control case, we had to come up
with our own approach. We
wanted a method that would be
robust to missing values, linear
dependencies, faster compute,
could handle mixtures of
continuous and categorical input
variables, and we also
explicitly wanted to separate
the extrapolation model from the
model used to fit the data. So
when we have general
extrapolation control turned on,
there's only one supervised
model that is...that fits the
input variables to the responses
that we see in the profiler
traces. The profiler comes
up with a quick and dirty
unsupervised model to describe
the training set axes, and this
unsupervised model is used
behind the scenes by the
profiler to determine the
extrapolation control
constraint. So I'm having to
switch because PowerPoint and my
camera aren't getting along
right now for some reason. We
know that risky extrapolations
are being made every day by
people working in data science
and are confident that the use
of extrapolations leads to poor
predictions and ultimately to
poor business outcomes.
Extrapolation control places
guardrails on model predictions
and will lead to quantifiably
better decisions by JMP Pro
users. When users see an extrapolation
occurring, the user must make a
decision about whether the
prediction should be used or not
used based on their domain
knowledge and familiarity with
the problem at hand. If you
start seeing extrapolation
control warnings happen quite
often, it is likely the end of
the life cycle for that model in
time to refit it to new data
because the distribution of the
inputs has shifted away from
that of the training data. We
are honestly quite surprised and
alarmed that the need for
identifying extrapolation isn't
better appreciated by the data
science community and have made
controlling extrapolation as
easy and automatic as possible.
Laura, who developed it in JMP
Pro, will be demonstrating the
option up next. Then Jeremy, who
did a lot of research on our
team, will go into the math
details and statistical
motivation for the approach.
Hello, my name is Laura
Lancaster and I'm here to do a
demo of the extrapolation
control that was added to JMP
Pro 16. I wanted to start off
with a fairly simple example
using the fit model least
squares platform. I'm gonna
use some data that may be
familiar; it's the Fitness data
that's in sample data and I'm
going to use Oxygen Uptake as
my response and Run Time, Run
Pulse and Max Pulse as my
predictors. And I wanted to
reiterate that in fit model,
fit least squares the
extrapolation metric that's
used is leverage. So let's go
ahead and start to JMP.
So now I have the fitness data
open in JMP and I have a script
saved to the data table to
automatically launch my fit
least squares model. So I'm
going to go ahead and run that
script, it launches the least
squares platform. And I have the
profiler automatically open. And
we can see that the profiler
looks like it always has in the
past, where the factor boundaries
are defined by the range of each
factor individually, giving us
rectangular bound constraints.
And when I change the factor
settings, because of these bound
constraints, it can be really
hard to tell if you're moving
far outside the correlation
structure of the data.
And this is why we wanted to add
the extrapolation control. So
this has been added to several
of the platforms in JMP Pro
16, including fit least squares.
And to get to the extrapolation
control, you go to the menu under
the profiler menu. So if I look
here, I see there's a new option
called Extrapolation Control.
It's set to off by default,
but I can turn it to either
on or warning on to turn on
extrapolation control. If I
turn it to on, notice that
it restricts my profile
traces to only go to values
where I'm not extrapolating.
If I were to turn it to warning
on, I would see the full profile
traces, but I would get a
warning when I go to a region
where it would be considered
to be extrapolation.
I can also turn on extrapolation
details, which I find really
helpful, and that gives me a
lot more information. First of
all, it tells me that my
metric that I'm using to
define extrapolation is
leverage, which is true in the
fit least squares platform.
And the threshold that's being
used by default initially is
going to be maximum leverage,
but this is something I can
change and I will show you that
in a minute. Also, I can see
what my extrapolation metric
is for my current settings.
It's this number right here,
which will change as I change
my factor settings.
Anytime this number is greater
than the threshold, I'm going to
get this warning that I might be
extrapolating. If it goes below,
I will no longer get that
warning. This threshold is not
going to change unless I change
something in the menu to adjust
my threshold. So let me go ahead
and do that right now. So I'm going
to go to the menu
and I'm going to go to set
threshold criterion. So
in fit least squares, you have two
options for the threshold
initially,it's set to maximum
leverage, which is going to keep
you within the convex hull of
the data, or you can switch to a
multiplier times the average
leverage or model terms over
observations. And I want to
switch to that threshold. So it's
set to 3 as the multiplier
by default. So this is going to
be 3 times the average leverage
and I click OK, and notice that
my threshold is going to change.
It actually got smaller, so this
is a more conservative
definition of extrapolation.
And I'm going to turn it back to
on to restrict my profile traces.
And now I can only go to
regions where I'm within 3
times the average leverage.
Now we have also
implemented optimization
that obeys the
extrapolation
constraints. So now if I
turn on set desirability
and I do the optimization,
I will get an optimal value that
satisfies the extrapolation
constraint. Notice that this
metric is less than or equal to
the threshold. So now when I go
to my next slide, which is going
to compare in a graph, a scatterplot
matrix, the difference
between the optimal value with
extrapolation control turned on
and with it turned off.
So this is the scatterplot
matrix that I created with JMP,
and it shows the original
predictor variable data, as well
as the predictor variable values
for the optimal solution using
no extrapolation control, in
blue, and the optimal solution using
extrapolation control in red.
And notice how the unconstrained
solution here in blue,
right here, violates the
correlation structure for the
original data for run pulse and
Max pulse, thus increasing the
uncertainty of this prediction.
Whereas the optimal solution
that did use extrapolation
control is much more in line
with the original data.
Now let's look at an example
using the more generalized
extrapolation control method,
which we refer to as a
regularized T squared method. As
Chris mentioned earlier, we
developed this method for models
other than least squares models.
So we're going to look at a
neural model for the Diabetes
data that is also in the sample
data. The response is a measure
of disease progression, and the
predictors are the baseline
variables. Once again, the
extrapolation metric used for
this example is the
regularized T square that
Jeremy will be describing in
more detail in a few minutes.
So I have the Diabetes data open in
JMP and I have a script saved
of my neural model fits. I'm
going to go ahead and run that
script. It launches the neural
platform, and notice that I am
using validation method, random
hold back. I just wanted to note
that anytime you use a
validation method, the
extrapolation control is based
only on the training data
and not your validation
or test data.
So I have the profiler open and
you can see that it's using the
full traces. Extrapolation
control is not turned on. Let's
go ahead and turn it on.
And I'm also going to
turn on the details.
You can see that the traces have
been restricted and the metric
is the regularized T square. The
threshold is 3 times the
standard deviation of the sample
regularized T squared. Jeremy is
going to talk more about what
all that means exactly in a few
minutes. And I just wanted to
mention that when we're using
the regularized T squared
method, there's only one choice
for threshold, but you can
adjust the multiplier. So if you
go to extrapolation control, set
threshold, you can adjust this
multiplier, but I'm going to
leave it at 3. And now I
want to run optimization using
extrapolation control. So I'm
just going to maximize and
remember. Now I have an
optimal solution with
extrapolation control turned
on. And so now I want to look
at our scatterplot matrix, just
like we looked at before, with
the original data, as well as
with the optimal values with
and without extrapolation
control.
So this is a scatterplot matrix
of the Diabetes data that I
created in JMP. It's got the
original predictor values, as
well as the optimal solution
using extrapolation control in
red, and optimal solution without
extrapolation control in blue.
And you can see that the red
dots appear to be much more
within the correlation structure
of the original data than the
blue, and that's particularly
true when you look at this LDL
versus total cholesterol.
So now let's look at an example
using the profiler that's under
the graph menu, which I'll call
the graph profiler. It also uses
the regularized T squared method
and it allows us to use
extrapolation control on any
type of model that can be
created and saved as a JSL
formula. It also allows us to
have extrapolation control on
more than one model at a time.
So let's look at an example
for a company that uses powder
metallurgy technology to
produce steel drive shafts for
the automotive industry.
They want to be able to find
optimal settings for their
production that will minimize
shrinkage and also minimize...
minimize failures due to bad
service conditions. So we have
two responses shrinkage (which is
continuous and we're going to
fit a least squares model for
that) and surface condition (which
is pass/fail and we're going to
fit a nominal logistic model for
that one). And our predictor
variables are just some key
process variables in production.
And once againm the extrapolation
metric is the regularized T square.
So I have the powder
metallurgy data open in JMP
and I've already fit a least
squares model for my shrinkage
response, and I've already fit a
nominal logistic model for the
surface condition pass/fail
response, and I've saved the
prediction formulas to the data
table so that they are ready to
be used in the graph profiler.
So if I go to the graph menu
profiler, I can load up the
prediction formula for shrinkage
and my prediction formula is for
the surface condition.
Click OK. And now I have
both of my models launched into
the graph profiler.
And before I turn on
extrapolation control, you
can see that I have the full
profile traces. Once I turn on
extrapolation control
you can see that the traces
shrink a bit, and I'm also going
to turn on the details,
just to show that indeed I am
using the regularized T square
here in this method.
So what I really want to do is I
want to find the optimal
conditions where I minimize
shrinkage and I minimize
failures with extrapolation
control and I want to make sure
I'm not extrapolating. I want to
find a useful solution. And
before I can do the optimization,
I actually need to set my
desirabilities. So I'm going to
set desirabilities. It's already
correct for shrinkage, but I
need to set them for the service
condition. I'm going to try to maximize
passes and minimize failures.
K.
And now I should be able to do
the optimization with
extrapolation controls on.
Do maximize and remember.
And now I have my optimal
solution with extrapolation
control on. So now let's look
once again at the
scatterplot matrix of the
original data, along with the
solution with extrapolation
control on in the solution,
with the extrapolation control
off.
So this is a scatterplot matrix
of the powder metallurgy data
that I created in JMP. And it
also has the optimal solution
with extrapolation control as a
red dot, and the optimal
solution with no extrapolation
control as a blue dot. And once
again you can see that when we
don't enact the extrapolation
control, the optimal solution
is pretty far outside of the
correlation structure of the
data. We can especially see
that here with ratio versus
compaction pressure.
So now I want to hand over
the presentation to Jeremy
to go into a lot more
detail about our methods.
Hi, so here are a number of
goals for extrapolation control
that we laid out at the
beginning of the project. We
needed an extrapolation metric
that could be computed quickly
with a large number of
observations and variables, and
we needed a quick way to assess
whether the metric indicated
extrapolation or not. This was
to maintain the interactivity of
the profiler traces and
we needed this to
perform optimization.
We wanted to be able to
support the various variable
types available in the
profiler. These are
essentially continuous,
categorical and ordinal.
We wanted to utilize
observations with missing cells,
because some modeling methods
will include these observations
in ???.
We wanted a method that was
robust to linear dependencies in
the data. These occur when the
number of variables is larger
than the number of observations,
for example. And we wanted
something that was easy to
automate without the need for a
lot of user input.
For least squares models, we
landed on leverage, which is
often used to identify outliers
in linear models. The leverage
for new prediction point is
computed according to this
formula. There are many
interpretations for leverage.
One interpretation is that it's
the multivariate distance of a
prediction point from the center
of the training data. Another
interpretation is that it is a
scaled prediction variance. So
as prediction point moves
further away from the center
of the data, the uncertainty
of prediction increases. And we
use two common thresholds in
the statistical literature for
determining if this distance
is too large. The first is
maximum leverage, prediction
points beyond this threshold
or outside the convex hull of
the training data.
And the second is 3 times the
average of the leverages. It
can be shown that this is
equivalent to three times the
number of model terms divided
by the number of observations.
And as Laura described
earlier, you can change the
multiplier of these
thresholds.
Finally, when desirabilities
are being optimized, the
extrapolation constraint is a
nonlinear constraint, and
previously the profiler allowed
constrained optimization with
linear constraints. This type of
optimization is more
challenging, so Laura implemented
a genetic algorithm. And if you
aren't familiar with these,
genetic algorithms use the
principles of molecular
evolution to optimize
complicated cost functions.
Next, I'll talk about the
approach we used to generalize
extrapolation control to models
other than linear models. When
you're constructing a predictive
model in JMP, you start with a
set of predictor variables and a
set of response variables. Some
supervised model is trained, and
then a profiler can be used to
visualize the model surface.
There are numerous variations in
the profiler in JMP. You can
use the profiler internally in
modeling platforms. You can
output prediction formulas and
build a profiler for multiple
models. As Laura demonstrated,
you can construct profilers for
ensemble models. We wanted an
extrapolation control method
that would generalize all these
scenarios, so instead of
tying our method to a
specific model, we're going
to use an unsupervised
approach.
And we're only going to flag a
prediction point as
extrapolation if it's far
outside where the data are
concentrated in the predictor
space. And this allows us to
be consistent across
profilers so that our
extrapolation control method
will plug into any profiler.
The multivariate distance
interpretation of leverage
suggested Hotelling's T squared as
a distance for general
extrapolation control. In fact,
some algebraic manipulation will
show that Hotelling's T squared is
just leverage shifted and
scaled. This figure shows how
Hotelling's T squared measures
which ellipse an observation
lies on, where the ellipses are
centered at the mean of the
data, and the shape is defined
by the covariance matrix.
Since we're no longer in
linear models, this metric
doesn't have the same
connection to prediction
variance. So instead of
relying on thresholds used
back in linear models, we're
going to make some
distributional assumptions
to determine if T squared
for prediction point should
be considered extrapolation.
Here I'm showing the formula for
Hotelling's T squared. The mean and
covariance matrix is estimated
using the training data for the
model. If P is less than N,
where P is the number of
predictors, N is the number
of observations and if the
predictors of multivariate
normal, then T squared for
addiction point has an F
distribution. However, we wanted
a method to generalize the
data sets with complicated data
types, like a mix of continuous
and categorical data sets where P
is larger than N, data sets with
missing values. So instead of
working out the distributions
analytically in each case, we
used a simple conservative
control limit that we found
works well in practice. This is
a three Sigma control limit
using the empirical distribution
of T squared from the training
data and, as Laura mentioned, you
can also tune this multiplier.
One complication is that when P
is larger than N, Hotelling's T
squared is undefined. There are
too many parameters in the
covariance matrix to estimate
with the available data, and
this often occurs in typical use
cases for extrapolation control
like in partial least squares.
So we decided on a novel
approach to computing Hotelling's T
squared, which deals with these
cases, and we're calling it a
regularized T squared.
To compute the covariance
matrix we use a regularized
estimator originally
developed by Schafer and
Strimmer for high
dimensional genomics data.
It's just a weighted
combination of the full
sample covariance matrix,
which is U here and a
constraint target matrix
which is D.
For the Lambda weight
parameter, Schafer and Strimmer
derived an analytical
expression that minimizes the
MSE, the estimator
asymptotically.
Schafer and Strimmer proposed
several possible target
matrices. The target matrix we
chose was a diagonal matrix with
the sample variances of the
predictor variables on the
diagonal. This target matrix has
a number of advantages for
extrapolation control. First, we
don't assume any correlation
structure between the variables
before seeing the data, which
works well as a general prior.
Also, when there's little data
to estimate the covariance
matrix, either due to small N or
a large fraction missing, the
elliptical constraint is
expanded by a large weight on
the diagonal matrix, and this
results in a more conservative
test for extrapolation control.
We found this was necessary to
obtain reasonable control of the
false positive rate. To put this
more simply, when there's
limited training data, the
regularized T squared is less
likely to label predictions as
extrapolation, which is what you
want, because you're more
likely to observe covariances
by chance. We have some
simulation results
demonstrating these details,
but I don't have time to go
into all that. Instead on
the Community webpage, we put a
link to a paper on archive and
we plan to submit this to the
Journal of Computational
Graphical Statistics.
This next slide shows some other
important details we needed to
consider. We needed to figure
out how to deal with categorical
variables. We are just
converting them into indicator-
coded dummy variables. This is
comparable to a multiple
correspondence analysis. Another
complication is how to compute
Hotelling's T squared when
there's missing data. Several
JMP predictive modeling
platforms use observations with
missing data to train their
models. These include naive
Bayes and Bootstrap forest. And
these formulas are showing the
pairwise deletion method we
used to estimate the covariance
matrix. It's more common to use
row wise deletion. This means
all observations with missing
values are deleted before
computing the covariance matrix.
And this is simplest, but it can
result in throwing out useful
data if the sample size of the
training data is small. With
pairwise deletion observations
and deleted only if there are
missing values in the pair of
variables used to compute the
corresponding entry and that's
what these formulas are showing.
Seems like a simple thing to do.
You're just using all the data
that's available, but it
actually can lead to a host of
problems because there are
different observations used to
compute each entry. This can
cause weird things to happen,
like covariance matrices with
negative eigenvalues, which is
something we had to deal with.
Here are a few advantages of
the regularized T squared we
found when comparing to other
methods in our evaluations. One
is that the regularization
works the way regularization
normally works. It strikes a
balance between overfitting the
training data and over biasing
the estimator. This makes the
estimator more robust to noise
and model misspecification.
Next, Schafer and Strimmer
showed in their paper that
regularization results in a
more accurate estimator in
high dimensional settings.
This helps with the cursive
dimensionality which plauges
most distance based methods
for extrapolation control.
Then in the fields that have
developed the methodology for
extrapolation control,
often they have both high
dimensional data and highly
correlated predictors. For
example in cheminformatics and
chemometrics, the chemical
features are often highly
correlated. Extrapolation control
is often used in combination
with PCA and PLS models, where
T squared DModX are used to
detect violations of correlation
structure. This is similar to
what we do in model driven
multivariate control chart.
Since this is a common use case,
we wanted to have an option that
didn't deviate too far from
these methods. Our regularized T
squared provides the same type
of extrapolation control, but it
doesn't require projection step
which has some advantages.
We found that this allows us to
better generalized other types
of predictive models. Also, in
our evaluations we observed that
if a linear projection doesn't
work well for your data, like
you have nonlinear relationships
between predictors, the errors
can inflate the control limits
of projection based methods,
which will lead to poor
protection against
extrapolation, and our approach
is more robust than this.
And then another important point
is that we found the
single extrapolation metric
was much simpler to use and
interpret.
And here is a quick summary of
the features of extrapolation
control. The method provides better
visualization of feasible
regions in high dimensional
models in the profiler.
A new genetic algorithm has
been implemented for flexible
constrained optimization.
Our regularized T squared
handles messy observational
data, cases like P larger
than N, and continuous and
categorical variables.
The method is available in most
of the predictive models in JMP
16 Pro and supports many of
their idiosyncracies. It's also
available in the profiler in
graph, which really opens up its
utility because you can operate
on any prediction formula.
And then as a future direction,
we're considering implementing
a K-nearest neighbor based
constraint that would go beyond
the current correlation
structure constraint. Often
predictors are generated by
multiple distributions resulting
in clustering in the predictor
space. And a K-nearest neighbors
based approach would enable
us to control extrapolation
between clusters.
So thanks to everyone who
tuned in to watch this and
here are our emails if you have
any further questions.

 

Presenter