Choose Language Hide Translation Bar

Controlling Extrapolation in the Prediction Profiler in JMP Pro 16 (2021-US-45MP-838)

Level: Intermediate

 

Laura Lancaster, JMP Principal Research Statistician Developer, SAS
Jeremy Ash, JMP Analytics Software Tester, SAS
Christopher Gotwalt, JMP Director of Statistical Research and Development, SAS

 

Uncontrolled model extrapolation leads to two serious kinds of errors: (1) the model may be completely invalid far from the data, and (2) the combinations of variable values may not be physically realizable. Using the Profiler to optimize models that are fit to observational data can lead to extrapolated solutions that are of no practical use without any warning. JMP Pro 16 introduces extrapolation control into many predictive modeling platforms and the Profiler platform itself. This new feature in the Prediction Profiler alerts the user to possible extrapolation or completely avoids drawing extrapolated points where the model may not be valid. Additionally, the user can perform optimization over a constrained region that avoids extrapolation. In this presentation, we discuss the motivation and usefulness of extrapolation control, demonstrate how it can be easily used in JMP, and describe details of our methods.

 

 

Auto-generated transcript...

 


Speaker

Transcript

Tim Pierce Go.
  Right I'm gonna go on mute to.
  Hi, I'm Chris Gotwalt. My co presenters Laura Lancaster and Jeremy Ash and I are presenting a useful new JMP Pro capability called extrapolation control.
  Almost any model that you would ever want to predict with has a range of applicability,
  a region of the input space where their predictions are considered to be
  reliable enough. Outside that region, we begin to extrapolate the model two points, far from the data used to fit the model. Using the predictions from that model at those points could lead to completely unreliable predictions.
There are two primary sources of extrapolation statistical extrapolation and domain-based extrapolation. Both types are covered by the new feature.
  Statistical extrapolation occurs when one is attempting to predict using a model at an X that isn't close to the values used to train that model.
  Domain-based extrapolation happens when you try to evaluate at an X that is impossible due to scientific-
  or engineering-based constraints. The example here illustrates both kinds of extrapolation in one example. Here, we see a profiler from a model of a metallurgy production process.
  The prediction reads out at -2.96 with no indication that we are evaluating at a combination of temperature and pressure that is impossible
  in a domain sense to attain for this machine. We also have statistical extrapolation as it as far from the data used
  to fit the model, as seen in the scatter plot of the training data on the right. In JMP Pro 16, Jeremy, Laura and I have collaborated to add a new capability that can give a warning when the profiler thinks you might be extrapolating.
  Or if you turn extrapolation control on, it will restrict the set of points that you see to only those that it doesn't think are extrapolating.
  We have two types of extrapolation control.
  One is based on the concept of leverage and uses a least squares model. This first type is only available in the Pro version of fit model least squares.
  The other type we call general machine learning extrapolation control, and it is available in the profiler platform and several of the most common machine learning platforms in JMP Pro.
  Upon request, we could even add it to mor. Least squares extrapolation control uses the concept of leverage, which is like a scaled version of the prediction variance. It is model based and so
  it uses information about the main effects interactions and higher order terms to determine the extrapolation.
  For the general machine learning extrapolation control case, we have to come up with their own approach. We wanted a method that would be robust in missing values,
  linear dependencies, faster compute, can handle mixtures of continuous and categorical input variables, and we also explicitly wanted to separate the extrapolation model from the model used to fit the data.
  So when we have general extrapolation control turned on, there's only one supervised model that is...that fits the input variables to the responses that we see in the profiler traces.
  The profile that comes up with a quick and dirty unsupervised model to describe the training set Xs, and this unsupervised model is used behind the scenes by the profiler to determine the extrapolation control constraint.
  So I'm having to switch because PowerPoint my camera aren't getting along right now, for some reason.
  We know that risky extrapolations are being made every day by people working in data science and are confident that the use of extrapolations leads to poor predictions and ultimately to poor business outcomes.
  Extrapolation control places guardrails on model predictions and will lead to quantifiably better decisions by JMP Pro users.
  When users see an extrapolation occurring, the user must make a decision about whether the prediction should be used or not used, based on their domain knowledge and familiarity with the problem at hand.
  If you start seeing extrapolation control warnings happen, quite often it is likely the end of the life cycle for that model and time to refit it to new data, because the distribution of the inputs has shifted away from that of the training data.
  We are honestly quite surprised and alarmed that the need for identifying extrapolation isn't better appreciated by the data science community
  and have made controlling extrapolation as easy and automatic as possible. Laura, who developed it in JMP Pro, will be demonstrating the option up next. Then Jeremy, who did a lot of the research on our team, will go into the math details and statistical motivation for the approach.
  Hello. My name is Laura Lancaster and I'm here to do a demo of the extrapolation control that was added to JMP Pro 16. I wanted to start off with a fairly simple example using the fit model least squares platform.
  I'm going to use some data that may be familiar. It's the fitness data that's in sample data.
  And I'm going to use oxygen uptake as my response and run time, run pulse and max pulse as my predictors. And I wanted to reiterate that in fit model fit least squares, the extrapolation metric that's used is leverage, so let's go ahead and start JMP.
  So now, I have the fitness data open in JMP and I have a script to save to the data table to automatically launch my fit least squares model.
  So I'm going to go ahead and run that script. It launches the least squares platform.
  And I have the profilerautomatically open and we can see that the profiler looks like it always has in the past,
  where the factor boundaries are defined by the range of each factor individually, giving us rectangular bound constraints.
  And when I changed the factor settings, because of these bound constraints, it can be really hard to tell if you're moving far outside the correlation structure of the data.
  And this is why we wanted to add the extrapolation control. So this has been added to several the platforms in JMPs Pro 16,
  including fit least squares. And to get to the extrapolation control, you go to the menu under the profiler menu. So if I look here, I see there's a new option called extrapolation control.
  It is set to off by default, but I can turn it to either on or warning on to turn on extrapolation control. If I turn it to on, notice that it restricts my profile traces to only go to values where I'm not extrapolating.
  If I were to turn it to warning on, I would see the full profile traces, but I would get a warning when I go to a region where it would be considered to be extrapolation.
  I can also turn on extrapolation details, which I find really helpful,
  and that gives me a lot more information. First of all, it tells me that my metric that I'm using to define extrapolation is leverage, which is true in the fit least squares platform,
  and the threshold that's being used by default initially is going to be maximum leverage, but this is something I can change and I'll show you that in a minute.
  Also, I can see what my extrapolation metric is for my current settings. That's this number right here, which will change as I change my factor settings.
  Anytime this number is greater than the threshold, I'm going to get this warning that I might be extrapolating. If it goes below, I will no longer get that warning.
  This threshold is not going to change unless I change something in the menu to adjust my threshold. So let me go ahead and do that right now. I'm going to go to the menu,
  and I'm going to go to set thresholds criterion. So in fit least squares, you have two options for the threshold. Initially, it's set to maximum leverage, which is going to keep you within the convex hole of the data,
  or you can switch to a multiplier times the average leverage or model terms over observations. And I want to switch to that threshold, so it's set to three
  as the multiplier by default, so this is going to be three times the average leverage. Click OK and notice that my threshold is going to change. It actually got smaller, so this is a more conservative definition of extrapolation.
  And I'm going to turn it back to on to restrict my profile traces, and now I can only go to regions
  where I'm within three times the average leverage.
  Now we have also implemented optimization that obeys the extrapolation constraints. So now, if I turn on set desirability
  and I do the optimization,
  I will get an optimal value that satisfies the extrapolation constraint. Notice that this metric is less than or equal to the threshold.
  So now, when I go to my next slide, which is going to compare in a graph (a scatter plot matrix), the difference between the optimal value with extrapolation control turned on and with it turned off.
  So this is the scatter plot matrix that I created with JMP, and it shows the original predictor variable data, as well as the predictor variable values
  for the optimal solution using no extrapolation control in blue and optimal solution using extrapolation control in red. And notice how the unconstrained solution here in blue, right
  here, violates the correlation structure for the original data for run pulse and max pulse, thus increasing the uncertainty of this prediction, whereas the optimal solution that did use extrapolation control is much more in line with the original data.
  Now let's look at an example using the more generalized extrapolation control method, which we refer to as a regularized T squared method.
  As Chris mentioned earlier, we developed this method for models other than least squares models. We're going to look at a neural model for the diabetes data that is also in the sample data.
  The response is a measure of disease progression and the predictors are the baseline variables. Once again, the extrapolation metric used for this example is the regularized T squared that Jeremy will be describing in more detail in a few minutes.
  So I have a diabetes data open in JMP and I have a script saved of my neural model fits. I'm going to go ahead and run that script.
  It launches the neural platform. And notice that I am using validation method random hold back. I just wanted to note that anytime you use a validation method, the extrapolation control is based only on the training data
  and not your validation or test data.
  So I have the profiler open and you can see that it's using the full traces. Extrapolation control is not turned on.
  Let's go ahead and turn it on.
  And I'm also going to turn on the details.
  You can see that the traces have been restricted and the metric is the regularized T square.
  The threshold is three times the standard deviation of the sample regularized T square. Jeremy's going to talk more about what all that means exactly in a few minutes.
  And I just wanted to mention that, when we're using the regularized T squared method, there's only one choice for threshold, but you can adjust the multiplier. So if you go to
  extrapolation controls, set thresholds, you can adjust this multiplier, but I'm going to leave it at three. And now I want to run optimization using
  extrapolation control, so I'm just going to do maximize and remember. Now I have an optimal solution with extrapolation control turned on.
  And so now, I want to look at our scatter plot matrix, just like we looked at before with the original data, as well as with the optimal values with and without extrapolation control.
  So this is a scatter plot matrix of the diabetes data that I created in JMP. It's got the original predictor values,
  as well as the optimal solution using extrapolation control in red and optimal solution without extrapolation control in blue. And you can see that
  the red dots appear to be much more within the correlation structure of the original data than the blue. And that's particularly true when you look at this LDL versus total cholesterol.
  So now let's look at an example using the profiler that's under the graph menu, which I'll call the graph profiler.
  It also uses the regularized T squared method and it allows us to use extrapolation control on any type of model that can be created and saved as a JSL formula. It also allows us to have extrapolation control on more than one model at a time.
  So let's look at an example for a company that uses powder metallurgy technology to produce steel drive shafts for the automotive industry.
  They want to be able to find optimal settings for their production that will minimize shrinkage and also minimize failures due to bad surface conditions. So we have two responses
  shrinkage, which is continuous and we're going to fit least squares model for that, and surface condition, which is pass/fail and we're going to fit a nominal logistic model for that one.
  And our predictor variables are just some key process variables in production. And once again, the extrapolation metric is the regularized T square.
  So I have the powder metallurgy data open in JMP, and I've already fit a least squares model for my shrinkage response,
  and I've already fit a nominal logistic model for the surface condition pass/fail response. And I've saved the prediction formulas to the data table so that they are ready to be used in the graph profiler.
  So if I go to the graph menu profiler, I can load up the prediction formula for shrinkage, and my prediction formula is for the surface condition.
  Click OK.
  And now I have both of my models launched into the graph profiler.
  And, before I turn on extrapolation control, you can see that I have the full profile traces. Once I turn on extrapolation control,
  you can see that the traces shrink a bit. And I'm also going to turn on the details
  just to show that, indeed, I am using the regularized T square here in this method.
  So what I really want to do is, I want to find the optimal conditions where I minimize shrinkage and I minimize failures
  with extrapolation control on. I want to make sure I'm not extrapolating. I want to find a useful solution.
  And before I can do the optimization, I actually need to set my desirabilities. I'm going to do set desirabilities; it's already correct for shrinkage, but I need to set them for the surface condition.
  I'm going to try to maximize passes and minimize failures.
  OK.
  And now I should be able to do the optimization with extrapolation controls on, do maximize and remember.
  Now I have my optimal solution
  with extrapolation control on. So now let's look at once again at the scatter plot matrix of the original data, along with the solution with extrapolation control on and the solution with the extrapolation control off.
  So this is a scatter plot matrix of the powder metallurgy data that I created in JMP and it also has the optimal solution
  with extrapolation control as a red dot and the optimal solution with no extrapolation control as a blue dot. And, once again, you can see that when we don't enact the extrapolation control,
  the optimal solution is pretty far outside of the correlation structure of the data. We can especially see that here with ratio versus compaction pressure.
  So now, I want to hand over the presentation to Jeremy to go into a lot more detail about our methods.
  Hi. So here are a number of goals for extrapolation control that we laid out at the beginning of the project.
  We needed an extrapolation metric that could be computed quickly with a large number of observations and variables, and we needed a quick way to assess whether the metric indicated extrapolation or not. This was to maintain the interactivity of the profiler traces and
  we needed this to perform optimization.
  We want it to be able to support the various variable types available in the profiler. These are essentially continuous, categorical, and ordinal.
  We wanted to utilize observations with missing cells because some modeling methods will include these observations and ???.
  We wanted a method that was a robust to linear dependencies in the data. These occur when the number of variables is larger than the number of observations, for example.
  And we wanted something that was easy to automate without the need for a lot of user input.
  For least squares models, we landed on leverage, which is often used to identify outliers in linear models.
  The leverage for new prediction point is computed according to this formula. There are many interpretations for leverage. One interpretation is that
  it's the multivariate distance of a prediction point from the center of the training data. Another interpretation is that it is a scaled prediction variance, so as prediction point moves further away from the center of the data, the uncertainty of prediction increases.
  And we use two common thresholds in the statistical literature for determining if this distance is too large.
  The first is maximum leverage. Prediction points beyond this threshold are outside the convex hole of the training data.
  And the second is three times the average of the leverages. It can be shown that this is equivalent to three times the number of model terms devided by the number of observations.
  And as Laura described earlier, you can change the multiplier of these thresholds.
  Finally, when desirabilities are being optimized, the extrapolation constraint is a nonlinear constraint. And previously the profiler allowed constrained optimization with linear constraints,
  but this type of optimization is more challenging so Laura implemented a genetic algorithm. And if you aren't familiar with these, genetic algorithms use principles of molecular evolution to optimize complicated cost functions.
  Next I'll talk about the approach we use to generalize extrapolation control two models other than linear models.
  When you're constructing a predictive model in JMP, you start with a set of predictor variables and a set of response variables.
  Some supervised model is trained and then a profiler can be used to visualize the model surface.
  There are numerous variations on the profiler in JMP. You can use the profiler internally in modeling platforms, you can output prediction formulas and build
  profiler for multiple models, as Laura demonstrated. You can construct profilers for ensemble models. We wanted an extrapolation control method that would generalize to all these scenarios, so instead of tying our method to a specific model, we're going to use and unsupervised approach.
  And we're only going to flag a prediction point as extrapolation if it's far outside where the data is concentrated in the predictor space. And this allows us to be consistent across profilers so that our extrapolation control method will plug into any profiler.
  The multivariate distance interpretation of leverage suggested Hotelling's T squared as a distance for general extrapolation control. In fact, some algebraic manipulation will show that Hotelling's T squared is just leverage shifted and scaled.
  This figure shows how Hotelling's T squared measures which ellipse an observation lies on, where the ellipses are centered at the mean of the data and the shape is defined by the covariance matrix.
  Since we're no longer in linear models, this metric doesn't have the same connection to prediction variance. So instead of relying on thresholds used back in linear models, we're going to make some distributional assumptions to determine if
  T squared for prediction point should be considered extrapolation.
  Here I'm showing the formula for Hotelling's T squared. The mean and covariance matrix is estimated using the training data for the model.
  If P is less than N, where P is the number of predictors and N the number of observations,
  and if the predictor's in multivariate normal, then T squared for a prediction point has an F distribution. However, we wanted a method to
  generalize the data sets with complicated data types, like a mix of continuous and categorical data sets, where P is larger than N;
  data sets with missing values. So instead of working out the distribution analytically in each case, we use a simple conservative control limit that we found works well in practice.
  This is a three Sigma control limit using the empirical distribution of T squared from the training data and, as Laura mentioned, you can also tune this multiplier.
  One complication is that when P is larger than N, Hotelling's T squared is undefined. There are too many parameters in the covariance matrix to estimate with the available data.
  And this often occurs in typical use cases for extrapolation control, like in partial least squares. So we decided on a novel approach to computing Hotelling's T squared, which deals with these cases and we're calling it a regularized T squared.
  To compute the covariance matrix, we use a regularized estimator originally developed by Schafer and Strimmer for high dimensional genomics data. This is just a weighted combination of the full sample covariance matrix, which is U here and a constrain target matrix, which is D.
  For the Lambda weight parameter, Schafer and Strimmer derived an analytical expression that minimizes the MSE, the estimator asymptotically.
  Schafer and Strimmer proposed several possible target matrices. The target matrix we chose was a diagonal matrix with the sample variances of the predictor variables on the diagonal. This target matrix has a number of advantages for extrapolation control.
  First, we don't assume any correlation structure between the variables before seeing the data, which works well as a general prior.
  Also, when there's little data to estimate the covariance matrix, either due to small N or a large fraction missing,
  the elliptical constraint is expanded by a large weight on the diagonal matrix, and this results in a more conservative test for extrapolation control. We found this is necessary to obtain reasonable control of the false positive rate.
  To put this more simply, when there's limited training data, the regularized T squared is less likely to label predictions as extrapolation, which is what you want because you're more likely to observe covariances by chance.
  We have some simulation results demonstrating these details, but I don't have time to go into all that. Instead, on the Community web page, we put a link to a paper in archive and we plan to submit this to the Journal of Computational Graphical Statistics.
  This next slide shows some other important details we needed to consider. We needed to figure out how to deal with categorical variables.
  We are just converting them into indicator coded dummy variables. This is comparable to a multiple correspondence analysis.
  Another complication is how to compute Hotelling's T squared when there's missing data. Several JMP predictive modeling platforms use observations with missing data to train the models.
  These include Naive Bayes and bootstrap forest, and these formulas are showing the pairwise deletion method we use to estimate the covariance matrix.
  It's more common to use row wise deletion, this means all observations with missing values are deleted before computing the covariance matrix. And this is simplest, but it can result in throwing out useful data if the sample size of the training data is small.
  With pairwise deletion, observations are deleted only if there are missing values in the pair of variables used to compute a corresponding entry.
  And that's what these formulas are showing. It seems like a simple thing to do, you're just using all the data that's available, but it actually can lead to a host of problems, because there are different observations used to compute each entry.
  This can cause weird things to happen, like covariance matrices with negative eigenvalues, which is something we had to deal with.
  Here are a few advantages of the regularized T squared we found when comparing to other methods in our evaluations.
  One is that the regularization works the way regularization normally works. It strikes the balance between overfitting the training data and over biasing the estimator. This makes the estimator more robust to noise and model misspecification.
  Next, Schafer and Strimmer showed in their paper that regularization results in a more accurate estimator in high dimensional settings. This helps with the curse of dimensionality which plauges most distance based methods for extrapolation control.
  And then, in the fields that have developed the methodology for extrapolation control,
  often they have both high dimensional data and highly correlated predictors. For example in cheminformatics and chemo metrics, the chemical features are often highly correlated.
  Extrapolation control is often used in combination with PCA and PLS models where T squared and DModX are used to detect violations of correlation structure. This is similar to what we do in model driven multivariate control chart.
  Since this is a common use case, we wanted to have an option that didn't deviate too far from these methods.
  Our regularized T squared provides
  same type of extrapolation control, but it doesn't require a projection step, which has some advantages.
  We found that this allows us to better generalize to other types of predictive models.
  Also in our evaluations, we observed that if a linear projection doesn't work well for your data, like you have nonlinear relationships between predictors,
  the errors can inflate the control limits of projection-based methods, which will lead to poor protection against extrapolation, and our approach is more robust than this.
  And then another important point is that we found the
  single extrapolation metric was much simpler to use and interpret.
  And here is a quick summary of the features of extrapolation control. The method provides better visualization of feasible regions in
  high dimensional models in the profiler. A new genetic algorithm has been implemented for flexible constrained optimization.
  And our regularized T squared handles messy observational data in cases like P larger than N and continuous and categorical variables.
  The method is available in most of the predictive models in JMP 16 Pro and supports many of their idiosyncrasies. It's also available in the profiler in graph, which really opens up its utility because it can operate on any prediction formula.
  And then as a future direction, we're considering implementing a K-nearest neighbor based constraint that would go beyond the current correlation structure constraint.
  Often predictors are generated by multiple distributions resulting in clustering in the predictors space and a K-nearest neighbors based approach would enable us to control extrapolation between clusters.
  So, thanks to everyone who tuned in to watch this, and here are emails if you have any further questions.
Jeremy Ash, JMP You guys still there.
Chris Gotwalt yeah I am.
Laura Lancaster Just.
Comments
MC17

I have question as to why strictly a Hotelling T2 value would ensure the prediction integrity?  We have to agree that Hotelling T2 really defines the design space of the data that we used in the model.    It does not mean that data structure itself has a correct correlation structure.   I often see instances where I will have prediction values within the Hotelling T2 design region, however the data structure used to predict the values do not match the data structure that the orignal model.  Therefore, the prediction is highly questionable.  A simple example of this would be a simple model predicting oranage juices based on the incoming oranges.  However, if someone accidently places lemons in the line, the model will still predict orange juice.  The amount of juice would likely be still within the range of the Hotelling T2 region.   However we all know it is not orange juice we are getting.  The data structure changed on us.  That is why in a PLS type of model we must keep the DmodX = 0 to ensure any future optimization value stays within the orignal model data structure used.

 

If you can explain to me how the regularized T2 meets the model data structure used and not model data region, then I would be more confident in using this new tool.

@MC17 I am glad you raised this important question.  I touched on this some in the talk, but I probably did not stress this enough.

 

So what I believe you are asking is why we are only controlling T2 and not DModX, as is common in PCA/PLS models.  This is what we do in the MDMCC, for example (https://www.jmp.com/support/help/en/16.1/index.shtml#page/jmp/model-driven-multivariate-control-char...). If we were to take a similar approach, we would probably place a control limit on DModX, because the training data will all have DModX > 0, so the trick is to allow predictions to deviate the way the training data do, without getting too crazy.  If you want DModX = 0, then you could take Isaac Himanga’s approach (https://community.jmp.com/t5/Discovery-Summit-Americas-2021/Optimizing-Multivariate-Models-From-Data...).

 

The short answer is that our Regularized T2 is computed on the original predictor variables, and does not require you to split your data into a model component and error component.  For PCA/PLS models, the regularization results in a constraint that is similar to controlling T2 and DModX, but you can do it with one metric, which is simpler and easier to interpret. 

 

In PCA/PLS models, T2 flags prediction with large deviations from your training data within the model plane, while DModX flags predictions with large deviations from the model plane, as your scenario describes.  This makes sense when your modeling method assumes a low dimensional linear projection of your data.  However, most of our modeling methods don’t perform such a projection.  We want an approach that is consistent across platforms for a variety of reasons (ensembling, etc). Performing a linear projection may not be reasonable if you have nonlinear relationships between variables, categorical variables, etc.  Also, how do you chose the number of components for these models?  We found in our evaluations that if a linear projection isn’t a good fit for your data, the errors will inflate your DModX control limit and lead to poor protection against extrapolation. We found a better general approach was to place extrapolation constraints using T2 on the original predictor space, which protects against violation of correlation structure in all directions of the data. 

 

The regularization of T2 puts large importance on highly correlated variables and low importance on low correlated variables when controlling extrapolation, and works when p>n.  This results in a similar constraint to T2 and DModX in PCA and PLS models. I have some simulation results showing how the method works in the attached white paper.  I hope to update this with a comparison to a T2 and DModX approach soon, and add some more discussion.  I started with just showing that our method works, that paper is still a work in progress.

 

Does this make sense? I am happy to discuss more, and to hear your opinions/concerns about this.