Controlling Extrapolation in the Prediction Profiler in JMP® Pro 16 (2021-EU-45MP-751)

6 Kudos

Level: Intermediate

Laura Lancaster, JMP Principal Research Statistician Developer, SAS
Jeremy Ash, JMP Analytics Software Tester, SAS
Chris Gotwalt, JMP Director of Statistical Research and Development, SAS

Uncontrolled model extrapolation leads to two serious kinds of errors: (1) the model may be completely invalid far from the data, and (2) the combinations of variable values may not be physically realizable. Using the Profiler to optimize models that are fit to observational data can lead to extrapolated solutions that are of no practical use without any warning. JMP Pro 16 introduces extrapolation control into many predictive modeling platforms and the Profiler platform itself. This new feature in the Prediction Profiler alerts the user to possible extrapolation or completely avoids drawing extrapolated points where the model may not be valid. Additionally, the user can perform optimization over a constrained region that avoids extrapolation. In this presentation we discuss the motivation and usefulness of extrapolation control, demonstrate how it can be easily used in JMP, and describe details of our methods.

Auto-generated transcript...

Speaker	Transcript
	Hi, I'm Chris Gotwalt. My co
	presenters, Laura Lancaster and
	Jeremy Ash, and I are presenting
	an useful new JMP Pro
	capability called Extrapolation
	Control. Almost any model that
	you would ever want to predict
	with has a range of
	applicability, a region of the
	input space, where the
	predictions are considered to be
	reliable enough. Outside that
	region, we begin to extrapolate
	the model to points far from the
	data used to fit the model. Using
	the predictions from that model
	at those points could lead to
	completely unreliable
	predictions. There are two
	primary sources of
extrapolation	statistical
	extrapolation and domain based
	extrapolation. Both types are
	covered by the new feature.
	Statistical extrapolation occurs
	when one is attempting to
	predict using a model at an x
	that isn't close to the values
	used to train that model.
	Domain based extrapolation
	happens when you try to evaluate
	at an x that is impossible due
	to scientific or engineering
	based constraints. The example
	here illustrates both kinds of
	extrapolation in one example.
	Here we see a profiler from a
	model of a metallurgy production
	process. The prediction reads
	out says -2.96 with no
	indication that we're evaluating
	at a combination of temperature
	and pressure that is impossible
	in a domain sense to attain for
	this machine. We also have
	statistical extrapolation as it
	is far from the data used to fit
	the model as seen in the scatter
	plot of the training data on the
	right. In JMP Pro 16, Jeremy,
	Laura and I have collaborated to
	add a new capability that can
	give a warning when the profiler
	thinks you might be
	extrapolating. Or if you turn
	extrapolation control on, it
	will restrict the set of points
	that you see to only those that
	it doesn't think are
	extrapolating. We have two types
	of extrapolation control. One is
	based on the concept of leverage
	and uses a least squares model.
	This first type is only
	available in the Pro version of
	Fit Model least squares. The
	other type we call general
	machine learning extrapolation
	control and is available in the
	Profiler platform and several of
	the most common machine learning
	platforms in JMP Pro. Upon
	request, we could even add it to
	more. Least squares
	extrapolation control uses the
	concept of leverage, which is
	like a scaled version of the
	prediction variance. It is model-
	based and so it uses information
	about the main effects
	interactions in higher order
	terms to determine the
	extrapolation. For the general
	machine learning extrapolation
	control case, we had to come up
	with our own approach. We
	wanted a method that would be
	robust to missing values, linear
	dependencies, faster compute,
	could handle mixtures of
	continuous and categorical input
	variables, and we also
	explicitly wanted to separate
	the extrapolation model from the
	model used to fit the data. So
	when we have general
	extrapolation control turned on,
	there's only one supervised
	model that is...that fits the
	input variables to the responses
	that we see in the profiler
	traces. The profiler comes
	up with a quick and dirty
	unsupervised model to describe
	the training set axes, and this
	unsupervised model is used
	behind the scenes by the
	profiler to determine the
	extrapolation control
	constraint. So I'm having to
	switch because PowerPoint and my
	camera aren't getting along
	right now for some reason. We
	know that risky extrapolations
	are being made every day by
	people working in data science
	and are confident that the use
	of extrapolations leads to poor
	predictions and ultimately to
	poor business outcomes.
	Extrapolation control places
	guardrails on model predictions
	and will lead to quantifiably
	better decisions by JMP Pro
	users. When users see an extrapolation
	occurring, the user must make a
	decision about whether the
	prediction should be used or not
	used based on their domain
	knowledge and familiarity with
	the problem at hand. If you
	start seeing extrapolation
	control warnings happen quite
	often, it is likely the end of
	the life cycle for that model in
	time to refit it to new data
	because the distribution of the
	inputs has shifted away from
	that of the training data. We
	are honestly quite surprised and
	alarmed that the need for
	identifying extrapolation isn't
	better appreciated by the data
	science community and have made
	controlling extrapolation as
	easy and automatic as possible.
	Laura, who developed it in JMP
	Pro, will be demonstrating the
	option up next. Then Jeremy, who
	did a lot of research on our
	team, will go into the math
	details and statistical
	motivation for the approach.
	Hello, my name is Laura
	Lancaster and I'm here to do a
	demo of the extrapolation
	control that was added to JMP
	Pro 16. I wanted to start off
	with a fairly simple example
	using the fit model least
	squares platform. I'm gonna
	use some data that may be
	familiar; it's the Fitness data
	that's in sample data and I'm
	going to use Oxygen Uptake as
	my response and Run Time, Run
	Pulse and Max Pulse as my
	predictors. And I wanted to
	reiterate that in fit model,
	fit least squares the
	extrapolation metric that's
	used is leverage. So let's go
	ahead and start to JMP.
	So now I have the fitness data
	open in JMP and I have a script
	saved to the data table to
	automatically launch my fit
	least squares model. So I'm
	going to go ahead and run that
	script, it launches the least
	squares platform. And I have the
	profiler automatically open. And
	we can see that the profiler
	looks like it always has in the
	past, where the factor boundaries
	are defined by the range of each
	factor individually, giving us
	rectangular bound constraints.
	And when I change the factor
	settings, because of these bound
	constraints, it can be really
	hard to tell if you're moving
	far outside the correlation
	structure of the data.
	And this is why we wanted to add
	the extrapolation control. So
	this has been added to several
	of the platforms in JMP Pro
	16, including fit least squares.
	And to get to the extrapolation
	control, you go to the menu under
	the profiler menu. So if I look
	here, I see there's a new option
	called Extrapolation Control.
	It's set to off by default,
	but I can turn it to either
	on or warning on to turn on
	extrapolation control. If I
	turn it to on, notice that
	it restricts my profile
	traces to only go to values
	where I'm not extrapolating.
	If I were to turn it to warning
	on, I would see the full profile
	traces, but I would get a
	warning when I go to a region
	where it would be considered
	to be extrapolation.
	I can also turn on extrapolation
	details, which I find really
	helpful, and that gives me a
	lot more information. First of
	all, it tells me that my
	metric that I'm using to
	define extrapolation is
	leverage, which is true in the
	fit least squares platform.
	And the threshold that's being
	used by default initially is
	going to be maximum leverage,
	but this is something I can
	change and I will show you that
	in a minute. Also, I can see
	what my extrapolation metric
	is for my current settings.
	It's this number right here,
	which will change as I change
	my factor settings.
	Anytime this number is greater
	than the threshold, I'm going to
	get this warning that I might be
	extrapolating. If it goes below,
	I will no longer get that
	warning. This threshold is not
	going to change unless I change
	something in the menu to adjust
	my threshold. So let me go ahead
	and do that right now. So I'm going
	to go to the menu
	and I'm going to go to set
	threshold criterion. So
	in fit least squares, you have two
	options for the threshold
	initially,it's set to maximum
	leverage, which is going to keep
	you within the convex hull of
	the data, or you can switch to a
	multiplier times the average
	leverage or model terms over
	observations. And I want to
	switch to that threshold. So it's
	set to 3 as the multiplier
	by default. So this is going to
	be 3 times the average leverage
	and I click OK, and notice that
	my threshold is going to change.
	It actually got smaller, so this
	is a more conservative
	definition of extrapolation.
	And I'm going to turn it back to
	on to restrict my profile traces.
	And now I can only go to
	regions where I'm within 3
	times the average leverage.
	Now we have also
	implemented optimization
	that obeys the
	extrapolation
	constraints. So now if I
	turn on set desirability
	and I do the optimization,
	I will get an optimal value that
	satisfies the extrapolation
	constraint. Notice that this
	metric is less than or equal to
	the threshold. So now when I go
	to my next slide, which is going
	to compare in a graph, a scatterplot
	matrix, the difference
	between the optimal value with
	extrapolation control turned on
	and with it turned off.
	So this is the scatterplot
	matrix that I created with JMP,
	and it shows the original
	predictor variable data, as well
	as the predictor variable values
	for the optimal solution using
	no extrapolation control, in
	blue, and the optimal solution using
	extrapolation control in red.
	And notice how the unconstrained
	solution here in blue,
	right here, violates the
	correlation structure for the
	original data for run pulse and
	Max pulse, thus increasing the
	uncertainty of this prediction.
	Whereas the optimal solution
	that did use extrapolation
	control is much more in line
	with the original data.
	Now let's look at an example
	using the more generalized
	extrapolation control method,
	which we refer to as a
	regularized T squared method. As
	Chris mentioned earlier, we
	developed this method for models
	other than least squares models.
	So we're going to look at a
	neural model for the Diabetes
	data that is also in the sample
	data. The response is a measure
	of disease progression, and the
	predictors are the baseline
	variables. Once again, the
	extrapolation metric used for
	this example is the
	regularized T square that
	Jeremy will be describing in
	more detail in a few minutes.
	So I have the Diabetes data open in
	JMP and I have a script saved
	of my neural model fits. I'm
	going to go ahead and run that
	script. It launches the neural
	platform, and notice that I am
	using validation method, random
	hold back. I just wanted to note
	that anytime you use a
	validation method, the
	extrapolation control is based
	only on the training data
	and not your validation
	or test data.
	So I have the profiler open and
	you can see that it's using the
	full traces. Extrapolation
	control is not turned on. Let's
	go ahead and turn it on.
	And I'm also going to
	turn on the details.
	You can see that the traces have
	been restricted and the metric
	is the regularized T square. The
	threshold is 3 times the
	standard deviation of the sample
	regularized T squared. Jeremy is
	going to talk more about what
	all that means exactly in a few
	minutes. And I just wanted to
	mention that when we're using
	the regularized T squared
	method, there's only one choice
	for threshold, but you can
	adjust the multiplier. So if you
	go to extrapolation control, set
	threshold, you can adjust this
	multiplier, but I'm going to
	leave it at 3. And now I
	want to run optimization using
	extrapolation control. So I'm
	just going to maximize and
	remember. Now I have an
	optimal solution with
	extrapolation control turned
	on. And so now I want to look
	at our scatterplot matrix, just
	like we looked at before, with
	the original data, as well as
	with the optimal values with
	and without extrapolation
	control.
	So this is a scatterplot matrix
	of the Diabetes data that I
	created in JMP. It's got the
	original predictor values, as
	well as the optimal solution
	using extrapolation control in
	red, and optimal solution without
	extrapolation control in blue.
	And you can see that the red
	dots appear to be much more
	within the correlation structure
	of the original data than the
	blue, and that's particularly
	true when you look at this LDL
	versus total cholesterol.
	So now let's look at an example
	using the profiler that's under
	the graph menu, which I'll call
	the graph profiler. It also uses
	the regularized T squared method
	and it allows us to use
	extrapolation control on any
	type of model that can be
	created and saved as a JSL
	formula. It also allows us to
	have extrapolation control on
	more than one model at a time.
	So let's look at an example
	for a company that uses powder
	metallurgy technology to
	produce steel drive shafts for
	the automotive industry.
	They want to be able to find
	optimal settings for their
	production that will minimize
	shrinkage and also minimize...
	minimize failures due to bad
	service conditions. So we have
two	responses shrinkage (which is
	continuous and we're going to
	fit a least squares model for
	that) and surface condition (which
	is pass/fail and we're going to
	fit a nominal logistic model for
	that one). And our predictor
	variables are just some key
	process variables in production.
	And once againm the extrapolation
	metric is the regularized T square.
	So I have the powder
	metallurgy data open in JMP
	and I've already fit a least
	squares model for my shrinkage
	response, and I've already fit a
	nominal logistic model for the
	surface condition pass/fail
	response, and I've saved the
	prediction formulas to the data
	table so that they are ready to
	be used in the graph profiler.
	So if I go to the graph menu
	profiler, I can load up the
	prediction formula for shrinkage
	and my prediction formula is for
	the surface condition.
	Click OK. And now I have
	both of my models launched into
	the graph profiler.
	And before I turn on
	extrapolation control, you
	can see that I have the full
	profile traces. Once I turn on
	extrapolation control
	you can see that the traces
	shrink a bit, and I'm also going
	to turn on the details,
	just to show that indeed I am
	using the regularized T square
	here in this method.
	So what I really want to do is I
	want to find the optimal
	conditions where I minimize
	shrinkage and I minimize
	failures with extrapolation
	control and I want to make sure
	I'm not extrapolating. I want to
	find a useful solution. And
	before I can do the optimization,
	I actually need to set my
	desirabilities. So I'm going to
	set desirabilities. It's already
	correct for shrinkage, but I
	need to set them for the service
	condition. I'm going to try to maximize
	passes and minimize failures.
	K.
	And now I should be able to do
	the optimization with
	extrapolation controls on.
	Do maximize and remember.
	And now I have my optimal
	solution with extrapolation
	control on. So now let's look
	once again at the
	scatterplot matrix of the
	original data, along with the
	solution with extrapolation
	control on in the solution,
	with the extrapolation control
	off.
	So this is a scatterplot matrix
	of the powder metallurgy data
	that I created in JMP. And it
	also has the optimal solution
	with extrapolation control as a
	red dot, and the optimal
	solution with no extrapolation
	control as a blue dot. And once
	again you can see that when we
	don't enact the extrapolation
	control, the optimal solution
	is pretty far outside of the
	correlation structure of the
	data. We can especially see
	that here with ratio versus
	compaction pressure.
	So now I want to hand over
	the presentation to Jeremy
	to go into a lot more
	detail about our methods.
	Hi, so here are a number of
	goals for extrapolation control
	that we laid out at the
	beginning of the project. We
	needed an extrapolation metric
	that could be computed quickly
	with a large number of
	observations and variables, and
	we needed a quick way to assess
	whether the metric indicated
	extrapolation or not. This was
	to maintain the interactivity of
	the profiler traces and
	we needed this to
	perform optimization.
	We wanted to be able to
	support the various variable
	types available in the
	profiler. These are
	essentially continuous,
	categorical and ordinal.
	We wanted to utilize
	observations with missing cells,
	because some modeling methods
	will include these observations
	in ???.
	We wanted a method that was
	robust to linear dependencies in
	the data. These occur when the
	number of variables is larger
	than the number of observations,
	for example. And we wanted
	something that was easy to
	automate without the need for a
	lot of user input.
	For least squares models, we
	landed on leverage, which is
	often used to identify outliers
	in linear models. The leverage
	for new prediction point is
	computed according to this
	formula. There are many
	interpretations for leverage.
	One interpretation is that it's
	the multivariate distance of a
	prediction point from the center
	of the training data. Another
	interpretation is that it is a
	scaled prediction variance. So
	as prediction point moves
	further away from the center
	of the data, the uncertainty
	of prediction increases. And we
	use two common thresholds in
	the statistical literature for
	determining if this distance
	is too large. The first is
	maximum leverage, prediction
	points beyond this threshold
	or outside the convex hull of
	the training data.
	And the second is 3 times the
	average of the leverages. It
	can be shown that this is
	equivalent to three times the
	number of model terms divided
	by the number of observations.
	And as Laura described
	earlier, you can change the
	multiplier of these
	thresholds.
	Finally, when desirabilities
	are being optimized, the
	extrapolation constraint is a
	nonlinear constraint, and
	previously the profiler allowed
	constrained optimization with
	linear constraints. This type of
	optimization is more
	challenging, so Laura implemented
	a genetic algorithm. And if you
	aren't familiar with these,
	genetic algorithms use the
	principles of molecular
	evolution to optimize
	complicated cost functions.
	Next, I'll talk about the
	approach we used to generalize
	extrapolation control to models
	other than linear models. When
	you're constructing a predictive
	model in JMP, you start with a
	set of predictor variables and a
	set of response variables. Some
	supervised model is trained, and
	then a profiler can be used to
	visualize the model surface.
	There are numerous variations in
	the profiler in JMP. You can
	use the profiler internally in
	modeling platforms. You can
	output prediction formulas and
	build a profiler for multiple
	models. As Laura demonstrated,
	you can construct profilers for
	ensemble models. We wanted an
	extrapolation control method
	that would generalize all these
	scenarios, so instead of
	tying our method to a
	specific model, we're going
	to use an unsupervised
	approach.
	And we're only going to flag a
	prediction point as
	extrapolation if it's far
	outside where the data are
	concentrated in the predictor
	space. And this allows us to
	be consistent across
	profilers so that our
	extrapolation control method
	will plug into any profiler.
	The multivariate distance
	interpretation of leverage
	suggested Hotelling's T squared as
	a distance for general
	extrapolation control. In fact,
	some algebraic manipulation will
	show that Hotelling's T squared is
	just leverage shifted and
	scaled. This figure shows how
	Hotelling's T squared measures
	which ellipse an observation
	lies on, where the ellipses are
	centered at the mean of the
	data, and the shape is defined
	by the covariance matrix.
	Since we're no longer in
	linear models, this metric
	doesn't have the same
	connection to prediction
	variance. So instead of
	relying on thresholds used
	back in linear models, we're
	going to make some
	distributional assumptions
	to determine if T squared
	for prediction point should
	be considered extrapolation.
	Here I'm showing the formula for
	Hotelling's T squared. The mean and
	covariance matrix is estimated
	using the training data for the
	model. If P is less than N,
	where P is the number of
	predictors, N is the number
	of observations and if the
	predictors of multivariate
	normal, then T squared for
	addiction point has an F
	distribution. However, we wanted
	a method to generalize the
	data sets with complicated data
	types, like a mix of continuous
	and categorical data sets where P
	is larger than N, data sets with
	missing values. So instead of
	working out the distributions
	analytically in each case, we
	used a simple conservative
	control limit that we found
	works well in practice. This is
	a three Sigma control limit
	using the empirical distribution
	of T squared from the training
	data and, as Laura mentioned, you
	can also tune this multiplier.
	One complication is that when P
	is larger than N, Hotelling's T
	squared is undefined. There are
	too many parameters in the
	covariance matrix to estimate
	with the available data, and
	this often occurs in typical use
	cases for extrapolation control
	like in partial least squares.
	So we decided on a novel
	approach to computing Hotelling's T
	squared, which deals with these
	cases, and we're calling it a
	regularized T squared.
	To compute the covariance
	matrix we use a regularized
	estimator originally
	developed by Schafer and
	Strimmer for high
	dimensional genomics data.
	It's just a weighted
	combination of the full
	sample covariance matrix,
	which is U here and a
	constraint target matrix
	which is D.
	For the Lambda weight
	parameter, Schafer and Strimmer
	derived an analytical
	expression that minimizes the
	MSE, the estimator
	asymptotically.
	Schafer and Strimmer proposed
	several possible target
	matrices. The target matrix we
	chose was a diagonal matrix with
	the sample variances of the
	predictor variables on the
	diagonal. This target matrix has
	a number of advantages for
	extrapolation control. First, we
	don't assume any correlation
	structure between the variables
	before seeing the data, which
	works well as a general prior.
	Also, when there's little data
	to estimate the covariance
	matrix, either due to small N or
	a large fraction missing, the
	elliptical constraint is
	expanded by a large weight on
	the diagonal matrix, and this
	results in a more conservative
	test for extrapolation control.
	We found this was necessary to
	obtain reasonable control of the
	false positive rate. To put this
	more simply, when there's
	limited training data, the
	regularized T squared is less
	likely to label predictions as
	extrapolation, which is what you
	want, because you're more
	likely to observe covariances
	by chance. We have some
	simulation results
	demonstrating these details,
	but I don't have time to go
	into all that. Instead on
	the Community webpage, we put a
	link to a paper on archive and
	we plan to submit this to the
	Journal of Computational
	Graphical Statistics.
	This next slide shows some other
	important details we needed to
	consider. We needed to figure
	out how to deal with categorical
	variables. We are just
	converting them into indicator-
	coded dummy variables. This is
	comparable to a multiple
	correspondence analysis. Another
	complication is how to compute
	Hotelling's T squared when
	there's missing data. Several
	JMP predictive modeling
	platforms use observations with
	missing data to train their
	models. These include naive
	Bayes and Bootstrap forest. And
	these formulas are showing the
	pairwise deletion method we
	used to estimate the covariance
	matrix. It's more common to use
	row wise deletion. This means
	all observations with missing
	values are deleted before
	computing the covariance matrix.
	And this is simplest, but it can
	result in throwing out useful
	data if the sample size of the
	training data is small. With
	pairwise deletion observations
	and deleted only if there are
	missing values in the pair of
	variables used to compute the
	corresponding entry and that's
	what these formulas are showing.
	Seems like a simple thing to do.
	You're just using all the data
	that's available, but it
	actually can lead to a host of
	problems because there are
	different observations used to
	compute each entry. This can
	cause weird things to happen,
	like covariance matrices with
	negative eigenvalues, which is
	something we had to deal with.
	Here are a few advantages of
	the regularized T squared we
	found when comparing to other
	methods in our evaluations. One
	is that the regularization
	works the way regularization
	normally works. It strikes a
	balance between overfitting the
	training data and over biasing
	the estimator. This makes the
	estimator more robust to noise
	and model misspecification.
	Next, Schafer and Strimmer
	showed in their paper that
	regularization results in a
	more accurate estimator in
	high dimensional settings.
	This helps with the cursive
	dimensionality which plauges
	most distance based methods
	for extrapolation control.
	Then in the fields that have
	developed the methodology for
	extrapolation control,
	often they have both high
	dimensional data and highly
	correlated predictors. For
	example in cheminformatics and
	chemometrics, the chemical
	features are often highly
	correlated. Extrapolation control
	is often used in combination
	with PCA and PLS models, where
	T squared DModX are used to
	detect violations of correlation
	structure. This is similar to
	what we do in model driven
	multivariate control chart.
	Since this is a common use case,
	we wanted to have an option that
	didn't deviate too far from
	these methods. Our regularized T
	squared provides the same type
	of extrapolation control, but it
	doesn't require projection step
	which has some advantages.
	We found that this allows us to
	better generalized other types
	of predictive models. Also, in
	our evaluations we observed that
	if a linear projection doesn't
	work well for your data, like
	you have nonlinear relationships
	between predictors, the errors
	can inflate the control limits
	of projection based methods,
	which will lead to poor
	protection against
	extrapolation, and our approach
	is more robust than this.
	And then another important point
	is that we found the
	single extrapolation metric
	was much simpler to use and
	interpret.
	And here is a quick summary of
	the features of extrapolation
	control. The method provides better
	visualization of feasible
	regions in high dimensional
	models in the profiler.
	A new genetic algorithm has
	been implemented for flexible
	constrained optimization.
	Our regularized T squared
	handles messy observational
	data, cases like P larger
	than N, and continuous and
	categorical variables.
	The method is available in most
	of the predictive models in JMP
	16 Pro and supports many of
	their idiosyncracies. It's also
	available in the profiler in
	graph, which really opens up its
	utility because you can operate
	on any prediction formula.
	And then as a future direction,
	we're considering implementing
	a K-nearest neighbor based
	constraint that would go beyond
	the current correlation
	structure constraint. Often
	predictors are generated by
	multiple distributions resulting
	in clustering in the predictor
	space. And a K-nearest neighbors
	based approach would enable
	us to control extrapolation
	between clusters.
	So thanks to everyone who
	tuned in to watch this and
	here are our emails if you have
	any further questions.