Model Screening: Streamlining the Predictive Modeling Workflow (2021-EU-30MP-781)
Mia Stephens, JMP Principal Product Manager, SAS
Predictive modeling is all about finding the model, or combination of models, that most accurately predicts the outcome of interest. But, not all problems (and data) are created equal. For any given scenario, there are several possible predictive models you can fit, and no one type of model works best for all problems. In some cases a regression model might be the top performer; in others it might be a tree-based model or a neural network.
In the search for the best-performing model, you might fit all of the available models, one at a time, using cross-validation. Then, you might save the individual models to the data table, or to the Formula Depot, and then use Model Comparison to compare the performance of the models on the validation set to select the best one. Now, with the new Model Screening platform in JMP Pro 16, this workflow has been streamlined. In this talk, you’ll learn how to use Model Screening to simultaneously fit, validate, compare, explore, select and then deploy the best-performing predictive model.
Speaker |
Transcript |
Mia Stephens | model screening. |
If you do any work with predictive modeling, you'll find that model screening helps you to streamline your predictive modeling workflow. | |
So in this talk I'm going to provide an overview of predictive modeling, talk about the different types of predictive models we can use in JMP. | |
We'll talk about the predictive modeling workflow within the broader analytics workflow, and we'll see how model screening can help us to streamline this workflow. | |
I'll talk about some metrics for comparing competing models using validation data and we'll see a couple of examples in JMP Pro. | |
First let's talk a little bit about predictive modeling and what is predictive modeling. | |
You've probably been exposed to regression analysis, and regression is an example of explanatory modeling. And in regression we're typically interested in building a model | |
for a response, or Y, as a function of one or more Xs, and we might have different modeling goals. We might be interested in identifying important variables. So what are the key Xs, or key input variables, | |
for example, in a problem solving setting that we might focus on to address the issue? | |
We might be interested in understanding how the response changes, on average, as a function of the input variables. | |
a one unit change in X is associated with a five unit change in Y. So this is classical explanatory modeling and if you've taken statistics in school, this is probably how you learn about regression. | |
Now to contrast, predictive modeling has a slightly different goal. In predictive modeling our goal is to accurately predict or classify future outcomes. | |
So if our response is continuous, we want to be able to predict the next observation, the next outcome, as precisely or accurately as possible. | |
And if our response is categorical, then we're interested in classification. And again we're interested in using current data to predict what's going to happen at the individual level in the future. | |
And we might fit and compare many different models, and in predictive modeling we might also use some more advanced models. We might use some machine learning techniques like neural networks. | |
And some of these models might not be as easy to interpret and many of them have a lot of different tuning parameters that we can set. And as a result, with predictive modeling we can have a problem with overfitting. | |
What overfitting means is that we fit a model that's more complex than it needs to be. | |
So with predictive modeling we generally use validation and there are several different forms of validation we can use. | |
We use validation for model comparison and selection, and fundamentally, it protects against overfitting but also underfitting. | |
And underfitting is when we fit a model that's not as complex as it needs to be. It's not really capturing the structure in our data. | |
Now, in the appendix at the end of the slides, I've pulled some some slides that illustrate why validation is important, but for the focus of this talk I'm simply going to use validation when I fit predictive models. | |
There are many different types of models we can fit in JMP Pro, and this is not by any means an exhaustive list. | |
We can fit several different types of regression models. If we have a continuous response, we can fit a linear regression model. | |
If our response is categorical, a logistic regression model, but we can also fit generalized linear models and penalize regression methods, and these are all from the fit model platform. | |
There are many options under the predictive modeling platform, so neural nets, neural nets with boosting, and different numbers of layers and nodes, | |
classification and regression trees, and more advanced tree based methods, and several other techniques. | |
And there are also a couple of predictive modeling options from the multivariate methods platform, so discriminate analysis | |
and partial least squares are two differen...two additional types of models we can use for predictive modeling. And by the way, partial least squares is also available from fit model. | |
And why do we have so many models? | |
In predictive modeling, you'll often find that that no one model | |
or modeling type always works the best. In certain situations, a neural network might be best and neural networks are generally are pretty...pretty good performers. | |
But you might find in other cases that a simpler model actually performs best, so the type of model that fits your data best and predicts most accurately is based largely on your response, but also on the structure of your data. | |
So you might fit several different types of models and compare these models before you find the model that fits most accurately or predicts most accurately. | |
So, within the broader analytic workflow, where you start off with some sort of a problem that you're trying to solve, and you compile data, you prepare the data, and explore the data, | |
predictive modeling is down in analyze and build models. And the typical predictive modeling workflow might look something like this, where you fit a model with validation. | |
Then you save that formula to a data table or publish it to the formula depot. | |
And then you fit another model and you repeat this, so you may fit several different models | |
and then compare these models. And in JMP Pro, you can use the model comparison platform to do this, and you compare the performance of the models on the validation data, and then you choose the best model or the best combination models and then you deploy the model. | |
And what's different with model screening is that all of the model fitting and comparison, this selection is done within one platform, the model screening platform. | |
So we're going to use an example that you might be familiar with, and there is a blog on model screening using these data that's posted in the Community, and these are the diabetes data. | |
So the scenario is that researchers want to predict the rate of disease progression, one year after baseline. | |
So there are several baseline measurements and then there are two different types of responses. | |
The response Y is the quantitative measure, a continuous measure, so this is the rate of disease progression and then there's a second response, Y Binary, which is high or low. | |
So Y Binary can represent a high high rate of progression or a low rate of progression. | |
And the goal of predictive modeling here is to predict patients who are most likely to have a high rate of disease progression, so that the the corrective actions can be taken | |
to prevent this. So we're going to...we're going to take a look at fitting models in JMP Pro for both of these responses, and we'll see how how to fit the same models using model screening. | |
And before I go to JMP, I just want to talk a little bit about how we compare predictive models. | |
We compare predictive models on the validation set or test set. So so basically what we're doing is we fit a model to a subset of our data called our training data. | |
And then we fit that model to data that were held out (typically we call these validation data) to see how well the model performs. | |
And if we have continuous responses, we can use measures of error, so root means square error (RMSE) or RASE, which is route average squared error, so this is the measure of prediction error. | |
AAE, MAD, MAE, these are these are measures of average error and there's different R square measures we might use. | |
To get categorical responses, we're most often interested in measures of error or misclassification. | |
We might also be interested in looking at an ROC curve, look at AUC (area under the curve) or sensitivity, or specificity, false positives, false negative rate. | |
the F1-Score and MCC, which is Matthews correlation coefficient. | |
So let me switch over to to JMP. | |
Tuck that this away and I'll open up the diabetes data. | |
And let me make this big so you can see it. | |
So these are the data. There are...there's information on 442 patients, and again, we've got a response Y, which is continuous and this is the the amount of disease progression after one year. And we'll start off looking at Y, but then we also have the second variable, Y Binary. | |
We've got baseline information. | |
And there's a column validation. So again validation, when we fit models, we're going to fit the models only using the training data and we're going to use the validation data to tell us when to stop growing the model and to give us measures of how good the model actually fits. | |
Now to build this column, there is a utility under predictive modeling called model validation column. | |
And this gives us a lot of options for building a validation column, so we can partition your data into training validation and a test set. | |
And, in most cases, if we're using this sort of technique of partitioning our data into subsets, | |
having a test set is recommended. Having a test set allows you to have an assessment of model performance on data that wasn't used for building the models or for stopping model growth, so so I'd recommend that, even though we don't have a test set in this case. | |
So let's say that I want to...I want to find a model that most accurately predicts the response. So as you saw there are a lot of different models choose from. | |
I'll start with fit model. | |
And this is usually a good starting point. | |
So I'm going to build a regression model with Y as our response, the Xs as model effects, and I'll only put in main effects at this point. And I'm going to add a validation column. | |
Now from fit model, the default here is going to be standard least squares, but there are a lot of different types of models we can fit. | |
I'm simply going to run least squares model. | |
A couple of things to point out here. Notice the marker here, V. | |
Remember that we fit our model to the training data, but we also want to see how well the model performs on the validation data, so all of these markers with a V are observations in the validation set. | |
Because we have validation, there is a crossvalidation section here, so we can look at R square on the training set and also on the validation set and then RASE. | |
And oftentimes what you'll see is that the validation statistics will be somewhat worse than the training statistics, and the farther off, they are, the better indication that you model is overfit or underfit. | |
I want to point out one other thing here that's really beyond the scope of this talk, and that's this prediction profiler. | |
And the prediction profiler | |
is actually one of the reasons I first started using JMP. It's a really powerful way of understanding your model. | |
And so I can change the values of any X and see what happens to the predicted response, and this is the average, so with | |
these models, we're predicting the average. But notice how these bands fan out for total cholesterol and LDL, HDL, right. And this is because we don't have any data out in those regions. | |
So that the new feature, I want to point out really quickly and again this is beyond the scope of this talk, is this thing called extrapolation control. And if I turn on a warning and drag | |
total cholesterol, notice that it's telling me there's a possible extrapolation. This is telling me I'm trying to predict out in a region where I really don't have any data, and if I turn | |
this extrapolation control on, notice that it truncates this bar, this line, so it's basically saying you can't make predictions out in that region. So it's something you might want to check out if you're fitting models. It's a really powerful tool. | |
So so let's say that I've done all the diagnostic work. I've reduced this model, and I want to be able to save my results. | |
Well, there are a couple of ways to do this. I can go to the red column, save columns, save the prediction formula. | |
So this saves the linear model I've just built out to the data table. So you can see it here in the background. And then, if I add new values to this data table, it'll automatically predict the response. | |
But I might want to save the model out in another form, so to do this, I might publish | |
this formula out to the formula depot. And the formula depot is actually independent of my data table, but what this allows me to do is I can copy the script and paste it into another data table with new data to score new data. | |
Or I might want to generate code in a different language to allow me to deploy this within some sort of a production system. | |
I'm going to go ahead and close this. This is just one model. Now is it possible, if I fit a more complicated or sophisticated model, that it might get better performance? | |
So I might fit another model. So, for example, I might...I'm just gonna hit recall. I might change the personality from standard least squares to generalize regression. | |
And this allows me to specified different response distributions. And I'll just stick with normal and click run. | |
So this will allow me to fit different penalized methods and also use different variable selection techniques. And if you haven't checked out generalized regression, | |
it's super powerful and super flexible modern modeling platform. I'm just going to click go. And let's say that I fit this model | |
and I want to be able to compare this model to the model I've already saved. So I might save the prediction formula out to the data table. So now I have another column in the background in the data table. | |
Or I might again want to publish this | |
to | |
the formula detail, so now I've got two different models here. And I can keep going. So this is just just one model from generalized regression. I can also fit several different types of | |
predictive models from the predictive modeling menu, so for example, neural networks or partition or bootstrap forest or boosted trees. | |
Now, typically what I would have to do is fit all these models, save the results out either to the data table or to the formula depot, and if I save them to | |
the data table, I can use this model comparison platform to compare the different competing models. And I might have many models, here I only have two. | |
And I don't actually even have to specify what the models are, I only need to specify validation. And I actually kind of like to put validation down here in the by field. | |
So this gives me my fit statistics for both the training set and the validation set, and I'm only going to look at the statistics for the validation step. So I would use this to help me pick the best performing model. | |
And what I'm interested in is a higher value of R square, a lower value of RASE (the root average squared error), and a lower average absolute error. And between these two models, it looks like this fit least squares regression model is the best. | |
Now, if I were to fit all the possible models, this can...this can be quite time-consuming. So instead of doing this, what's new in JMP Pro 16 is a platform called model screening. | |
And when I launch model screening, it has a dialogue at the top, just like we've seen, so I'll go ahead and populate this. | |
And I'll add a validation but over on the side, what you see is that I can select several different models and fit these different models all at one time. | |
So decision tree, bootstrap forest, boosted tree, K nearest neighbors, right, I can fit all of these models. | |
And it won't run models that don't make sense, so I have a continuous response logistic regression as one of my options, it won't run it a logistic regression model | |
with a continuous response. Notice that I've also got this option down here, XGBoost. | |
And the reason that appears is there is an add-in and it's actually uses open source libraries, and if you install this add-in (it's available on our JMP user community, | |
it's called XGBoost and it only works in JMP Pro), but if you install this add-in, it'll automatically appear in the model screening dialogue. | |
So I'm just going to click OK, and when I click OK, what it's going to do is it's going to go out and launch each of these platforms. | |
And then it's going to pull all the results into one window. So I clicked okay. I don't have a lot of data here, so it's very fast. | |
And under the details, these are all of the individual models that were fit. And if I open up any one of these dialogues, I can see the results, and I have access to additional options that will be available from that menu. | |
So I'm going to tuck away the details. And by default, what it's done is it's reported statistics for the training data, but it shows me the results for the validation data, so I can see R square and I can see RASE. | |
And by default it's sorting in order of RASE where lowest is best. | |
But I've got a few little shortcut options down here, so if I wanted to find the best models and it could be that R square is best for some models, but RASE is better for others, I'm going to click select dominant. | |
And this case, it selected neural boosted so across all of these models, the best model is neural boosted. And if I want to take a closer look at this model, I can either look at the details up here under neural | |
or, I can simply say, run selected. | |
Now I didn't talk about this, but in the dialogue window there's an option to set a random seed. | |
And if I set that random seed then the results that launch here will be identical to what I see here. So this is a...this is a | |
neural model with three nodes using the TanH function, but it's also using boosting. So in designing this platform, the developers did a lot of simulations to determine the best starting point and the best default settings for the different models. | |
So so neural boosted is the best. | |
And | |
if I want to be able to deploy this model, now what I can do is I can save the script | |
or I could run it, or I can save it out to the formula deeper. | |
So this is with a continuous response and there's some other options under the red triangle. What if I have a categorical response? For a categorical response, I can use the same platform. So again, I'll go to model screening. | |
I'll click recall, but instead of using Y, I'll use Y Binary. | |
And I'm not going to change any of the default methods. I will put in a random seed, so for example, 12345, I'm just grabbing random number. | |
And what this will do is give me repeatability. So if I save any model out to the data table or to the formula depot, | |
the statistics will be the same in the model, fit will be the same. A few other options here. We might want to turn off some things like the reports. | |
We might want to use a different cross validation method, so this platform includes K fold validation, but it also uses nested K fold cross validation. | |
And we can repeat this. So so really nice. Sometimes partitioning our data into training validation and test isn't the best and K fold | |
can actually be a little bit better. And there are some additional options at the bottom. So we might want to add two way interactions. We might want to add quadratic effects. | |
Under additional methods, this will fit additional generalized regression methods. So I'm just going to go ahead and click OK. | |
OK. It runs all the models and again, this is a small data set. It's very quick. | |
Right, the look and feel are the same, but now the statistics are different. So I've got the misclassification rate. I've got an area under the curve. I've also got some R square measures and then root average squared error. | |
I'm going to click select dominant, and again, the dominant method is neural boosted. | |
Now, what if I want to be able to explore some of these different models? So that misclassification right here is a fair amount lower than it is for stepwise. | |
The AUC is kind of similar, it's lowest overall but but maybe not that much better. And let me grab a few of these. So if I click on a few of these, maybe I'll...maybe I'll select these four, these five. | |
I can look at ROC curves, if I'm interested in looking at ROC curves to compare the models. | |
And there's some nice controls here to allow me to to turn off models and focus on on certain models. | |
And a new feature that I'm really excited about is this thing called a decision threshold. | |
And what the decision threshold allows us to look at, | |
and it starts by looking at the training data, is it's giving us a snapshot of our data. | |
And misclassification rate is based on a cut off of .5 for classification. So for each of the models, it's showing me the points that were actually high and low. And if we're focusing in on the high, the green dots were correctly classified as high. | |
And the ones in the red area were misclassified, so it's showing us correct classifications and incorrect classifications, and then it's reporting all the statistics over here on the side | |
And then down below we see several different metrics plus a lot of graphs for allowing us to look at false classifications and also true classifications. | |
I'm going to close this and look at the | |
validation data. | |
So why is this useful? Well, you might have a particular scenario where you're interested in maximizing sensitivity | |
while maintaining a certain specificity. And sensitivity...and there are some definitions over here. Sensitivity is the true positive rate; | |
specificity is the true negative rate. This is a...this is a scenario where we want to look at disease progression, so we want to make sure we are...we are maintaining a high sensitivity rate | |
while also making sure that our specificity is high, alright. So what we can do with this is there's a slider here, and we can grab this slider, | |
and we can see how the classifications change as we change the cutoff for classification. | |
So I think this is a really powerful tool when you're looking at competing models, because you might have some models that have the overall with a cut off of .5, they might have the best misclassification rate. | |
But you might also have scenarios where, if you change the cut off of classification, different models might perform differently. So, for example, if I'm in a certain region here, I might find that the stepwise model is actually better. | |
Now to further illustrate this, I want to open up a different example. | |
And this example is called credit card marketing. | |
And if I go back to my slides just to introduce this scenario. | |
This is a scenario where | |
we've got a lot of data based on market research on the acceptance of credit card offers. | |
The response is, was the offer accepted. And this is a scenario where only 5.5% of the offers in the study were actually accepted. | |
Now there are factors that we're interested in, so there are different types of rewards that are offered, | |
and there are different mailer types. So this is actually a designed experiment. We're going to, kind of, really, kind of, ignore that that part...that aspect of this study. | |
And there's also financial information, so we're going to stick on...stick to one goal in this example, and that's the goal of identifying customers who are most likely to accept the offer. | |
And if we can identify customers that are most likely, in this scenario we might send offers only to the subset that is more likely to accept the offer and ignore the rest. So that's the scenario here. | |
And I'm going to open these data. | |
I've got 10,000 observations and my response is is offer accepted. | |
And I've already saved the script to the data table, so I've got air miles, mailer type, income level, credit rating, and a few other pieces of financial information. | |
It's going through...I ran a save script. It's going through running all the models. And neural, in this case, will take the most time because it's running a boosted neural. | |
It will take a few more seconds. It's running support vector machines. Support vector machines will timeout and actually won't run if I have more than 10,000 observations. | |
I'm going to give it another second. I'm using validation so I'm using standard validation for this, where I've got a validation column. And in this case, I've got a column of zeros and ones and JMP will recognize the zeros for the training data and the ones for the validation data. | |
Okay. Tthere we go. Okay, so it ran, and if you're dealing with a large data table, there is a...there is a report you can run | |
to look at elapsed times. And for this scenario, support vector machine actually took the longest time, and this is why, at times, why it won't run if we have more than 10,000 observations. | |
So let's look at these. So our best model, if I select dominant, is a neural boosted and a decision tree, but I want to point something out here. Notice the misclassification rate. | |
The misclassification rate is identical for all of the models, except support vector machines. And why is this the case? Well, if I run a distribution of offer accepted | |
(let me make this a little bit bigger so we can see it) and just focus in on the validation data, notice that point .063% of our of our observations were Yes. This is exactly what our model predicted. | |
And why is it doing this? | |
I'm going to again ask for decision threshold. | |
And focusing on this graph here, and this graph has a lot of uses, and in this case, what it shows us is our cutoff for classification is .5 | |
but none of our fitted probabilities were even close to that, right. So as a result, the model either classified the no's correctly as no's | |
or classified the yeses as no's. It never classified anything as a yes, because none of the probabilities were greater than .5. So if I stretch this guy out, | |
right, I can see the difference in these two models. So that the top probability was around .25 for the neural boosted and for decision tree it was about .15. | |
And notice that for decision tree, decision tree is basically doing a series of binary splits and basically I've got several predictive values, whereas a neural boosted it's showing me a nice random pattern in the points. So let me change this to something like .12. | |
Right, and it cut off at .12, in fact, if I slide this around, notice that the lower I get, I actually start getting (I'm going to turn on the metrics here) I start getting some true positives. | |
And I start getting some false positives. So as I drag this, you can see it in the bar, but the bar is kind of small, right. Neural boosted, I'm starting to see some true positives and some false positives. | |
And now you start seeing them. As soon as I get past this first set of points, I start seeing it for decision tree. | |
So, using a cut off of .5 doesn't make sense for these data, and again I might...I might try to find a cut off that gives me the most sensitivity, while maintaining a decent level for specificity. | |
In this case, I'm going to point out these two other statistics. | |
F1 is the F1 score, and this is really a measure of how well we're able to capture the true positives. | |
MCC is the Matthews correlation coefficient, and this is a good measure of how well it classifies within each of the four possibilities. So I've got... | |
I can have a false positive, a false negative, a true positive, a true negative. | |
And I didn't actually say that corresponding to the boxes here, but I've got four different options. MCC is a correlation coefficient that falls between minus one and plus one, that measures how will I'm predicting in each four...in each one of those four boxes. | |
So I might want to explore cut off that gives you the maximum F1 value or the maximum MCC value. And let's say that I drag this way down. Notice it that | |
the sensitivity is growing quickly and specificity is starting to drop, so maybe it around .5, right, I reach a point where I'm starting to drop off too far in specificity. | |
I might find a cut off and at the bottom there's this option to set a profit matrix. If I set this as my profit matrix, basically, what's going to happen is it will allow me to score new data using this cut off. | |
So if I set this here and hit okay, right, any future predictions that I make, if I save this out to the data table or to the formula depot, will use that cut off. | |
And this is a scenario where I might actually have some financial information that I could build into the profit matrix. | |
So, for example, instead of using the slider to pick the cutoff, maybe I have some knowledge of the dollar value associated with my classifications. And maybe if the actual response is a no, | |
but I send them an offer, so there's a yeah, I think they're going to be yes and I send them an offer. Maybe this costs me $5, right, so I have a negative value there. | |
And maybe I have some idea of the potential profit, so maybe the profits...potential profit over this time period is $100 and maybe I've got some lost opportunity. Maybe I say, you know, it's -100 if | |
the person would actually have responded, but I didn't send them the offer. So maybe this is lost opportunity and sometimes we leave this blank. Now if I use this instead, | |
I have some additional information that shows up, so it recognizes that I have a profit matrix. And now if I look at the metrics, I can make decisions on my best model based on this profit. | |
So I'm bringing this additional information into the decision making, and sometimes we have a profit matrix and we can use that directly and sometimes we don't. And this is one of those cases where I can see that the neural boosted model is going to give me the best overall profits. | |
So this is a sneak peek at model screening and let me go back to my slides. | |
And what have we seen here? Well, we talked about predictive modeling | |
and how predictive modeling has a different goal than explanatory modeling. So our goal here is to accurately predict or classify future outcomes, so we want to score future observations. | |
And we're typically going to fit and compare many models using validation and pick the model or the combination of models that does the best job, that predicts most accurately. | |
Model screening really streamlines this workflow so you can fit many different models, at the same time, with one platform. | |
And I really only went with the default, so I can fit much more sophisticated models than I actually fit there. It makes it really easy to select the dominant models, explore the model details. We can fit new models | |
from the details. This decision threshold, if you're dealing with categorical data, allows you to explore cut offs for classification and also integrates the ability to include a profit matrix. | |
And for any selected model, we can deploy the model out to the formula depot or save it to the data table. So a really powerful new new tool. | |
For more information. The classification metrics, I know before I saw the F1 score and the Matthews correlation coefficient, those those statistics were relatively new to me. | |
To make sense of sensitivity and specificity this Wikipedia post has some really, really nice examples and a really nice discussion. | |
There are also some really nice resources for predictive modeling and also model screening. In the JMP user Community, there's a new path, Learn JMP, that has access to videos, | |
Mastering JMP series videos. There is a really nice talk last year at the JMP Discovery in Tucson by by Ruth Hummel and Mary Loveless on which model when. | |
And this does a nice job of talking about different modeling goals and when you might want to use each of the models. | |
If you're brand new to predictive modeling, our free online course, STIPS, which is Statistical Thinking for Industrial Problem Solving, Module 7 is an introduction to predictive modeling. | |
So I'd recommend this. There is a model screening blog that uses the diabetes data that I'll point out, and I also want to point out that there's a second edition of this book, Building Better Models with JMP Pro, | |
that's coming out within the next couple months. They don't have a new cover yet, but they're going to...they include model screening in that book. | |
So so that's all I have. Please feel free to post comments for this post or ask questions, and I hope you enjoy the rest of the conference. Thank you. |