The Imbalanced Classification Add-In: Compare Sampling Techniques and Models (20...

Americas 2020

The Imbalanced Classification Add-In: Compare Sampling Techniques and Models (2020-US-30MP-625)

Michael Crotty, JMP Senior Statistical Writer, SAS
Marie Gaudard, Statistical Consultant, Statistical Consultant
Colleen McKendry, JMP Technical Writer, JMP

The need to model data sets involving imbalanced binary response data arises in many situations. These data sets often require different handling than those where the binary response is more balanced. Approaches include comparing different models and sampling methods using various evaluation techniques. In this talk, we introduce the Imbalanced Binary Response add-in for JMP Pro that facilitates comparing a set of modeling techniques and sampling approaches.

The Imbalanced Classification add-in for JMP Pro enables you to quickly generate multiple sampling schemes and to fit a variety of models in JMP Pro. It also enables you to compare the various combinations of sampling methods and model fits on a test set using Precision-Recall ROC, and Gains curves, as well as other measures of model fit. The sampling methods range from relatively simple to complex methods, such as the synthetic minority oversampling technique (SMOTE), Tomek links, and a combination of the two. We discuss the sampling methods and demonstrate the use of the add-in during the talk.

The add-in is available here: Imbalanced Classification Add-In - JMP User Community.

Auto-generated transcript...

Speaker	Transcript
Michael Crotty	Hello. Thank you for tuning into
	our talk about the imbalanced classification add in that allows you to compare sampling techniques and models in JMP Pro.
	I'm Michael Crotty. I'm one of these statistical writers in the documentation team at JMP and my co-presenter today is Colleen McKendry, also in the stat doc team. And this is work that we've collaborated on with Marie Gaudard.
	So here's a quick outline of our talk today. We will look at the purpose of the add in that we created, some background on the imbalanced classification problem and how you obtain a classification model in that situation.
	We'll look at some sampling methods that we've included in the add in and that are popular for the imbalanced classification problem.
	We'll look at options that are available in the add in
	and talk about how to obtain the add in, and then Colleen will show an example and a demo of the add in.
	In the slides that are available on the Community, there's also references and an appendix that has additional background information.
	So the purpose of our add in, the imbalanced classification add in, it lets you apply a variety of sampling techniques that are designed for imbalanced data.
	You can compare the results of applying these techniques, along with various predictive models that are available in JMP Pro.
	And you can compare those models and sampling technique fits using precision recall curves, ROC curves, and Gains curves, as well as other measures.
	This allows you to choose a threshold for classification using the curves.
	And you can also apply the Tomek, SMOTE and SMOTE plus Tomek sampling techniques directly to your data, which enables you to then use existing JMP platforms and
	on on that newly sampled data and fine tune the modeling options, if you don't like the mostly default method options that we've chosen.
	And just one note, the Tomek, SMOTE and SMOTE plus Tomek sampling techniques can be used with nominal and ordinal, as well as continuous predictor variables.
	So some background on the imbalanced data problem.
	So in general, you could have a multinomial response, but we will focus on the response variable being binary, and the key point is that the number of observations at one response level is much greater than the number of observations had the other response level.
	And we'll call these response levels the majority and minority class levels, respectively. So the minority level, most of the time, is the level of interest that you're interested in predicting and detecting. This could be like detecting fraud or the presence of a disease or credit risk.
	And we want to predict class membership based on regression variables.
	So to do that we developed a predictive model that assigns probabilities of membership into the minority class and then we choose a threshold value that optimizes
	various criteria. This could be misclassification rate, true positive rate, false positive rate, you name it. And then we classify an observation, who's into the minority class, if the predicted probability of membership to the minority class exceeds the chosen threshold value.
	So how do we obtain a classification model?
	We have lots of different platforms in JMP that can make a prediction for a binary variable, binary outcome
	when in the presence of regression variables, and we need a way to compare those models. Well, there are some traditional measures, like classification accuracy, are not all that appropriate for imbalanced data. And just as a extreme example, you could consider the case of a 2% minority class.
	I could give you 98% accuracy, just by classifying all the observations as majority cases. Now this would not be a useful model and you wouldn't want to use it,
	because you're not predicting...you're not correctly predicting any of your target cases to minority cases but just overall accuracy, you'd be at 98%, which sounds pretty good.
	So this led people to explore other ways to measure classification accuracy in a imbalanced classification model. One of those is the precision recall curve.
	They're often used with imbalanced data and they plot the positive predictive value or precision against the true positive rate recall.
	And because the precision takes majority instances into account, the PR curve is more sensitive to class imbalance than an ROC curve.
	As such, a PR curve is better able to highlight differences in models for the imbalanced data. So the PR curve is what shows up first in our report for our add in.
	Another way to handle imbalanced classification data is to use sampling methods that help to model the minority class.
	And in general, these are just different ways to impose more balance on the distribution of the response, and in turn, that helps to better delineate the boundaries between the majority and minority class observations. So in in our add in we have seven different sampling techniques.
	We won't talk too much about the first four and we'll focus on the last three, but very quickly, no weighting means what it sounds like. We won't do any...won't make any changes and that's
	essentially in there to provide a baseline to what you would do if you didn't do any type of sampling method to account for the imbalance.
	Weighting will overweight the minority cases so that the sum of the weights of the majority class and the minority class are the same.
	Random undersampling will randomly exclude majority cases to get to a balanced case and random oversampling will randomly replicate
	minority cases again to get to a balanced state.
	And then we'll talk more about the next three more advanced methods in the following slides.
	So first of the advanced methods is SMOTE, which stands for synthetic minority oversampling technique.
	And this is basically a more sophisticated form of oversampling, because we are adding more minority cases to our data.
	We do that by generating new observations that are similar to the existing minority class observations, but we're not simply replicating them like in oversampling.
	So we use the Gower distance function and perform K nearest neighbors on each minority class observation and then observations are generated to fill in the space that are defined by those neighbors.
	And in this graphic, you can see if we've got this minority case here in red. We've chosen the three nearest neighbors.
	And we'll randomly choose one of those. It happens to be this one down here, and then we generate a case, another minority case that is somewhere in this little shaded box. And that's in two dimensions. If you had
	n dimensions of your predictors, then that shaded area would be an n dimensional space.
	But one key thing to point out is that you can choose the number of nearest neighbors that you
	randomly choose between, and you can also choose how many times you'll perform this
	this algorithm per minority case.
	The next sampling method is Tomek links. And what this method does is it tries to better define the boundary between the minority and majority classes. To do that, it removes observations from the majority class that are close to minority class observations.
	Again, we use to Gower distance to find Tomek links and Tomek link is a pair of nearest neighbors that fall into different classes. So one majority and one minority that are nearest neighbors to each other.
	And to reduce the overlapping of these instances, one or both members of the pair can be removed. In the main option of our add in, the evaluate models option, we remove only the majority instance. However, in the Tomek option, you can use either form of removal.
	And finally, the last sampling method is SMOTE plus Tomek. This combines the previous two sampling methods.
	And the way it combines them is it applies this mode algorithm to generate new minority observations and then once you've got your original data, plus a bunch of generated new minority cases,
	tt applies to Tomek algorithm to find pairs of nearest neighbors that fall into different classes. And in this method both observations in the Tomek pair are removed.
	So the imbalanced classification add in has four options when you install it that all show up as submenu items under the add ins menu.
	The first one is the evaluate models option, that allows you to fit a variety of models using a variety of sampling techniques. The next three are just standalone dialogues to just do those three sampling techniques that we just talked about.
	So in the evaluate models option of the add in, it provides an imbalanced classification report that facilitates comparison of the model and sampling technique combinations.
	It shows the PR curve and ROC curves, as well as the Gains curves, and for the PR and ROC curves, it shows the area under the curve, which generally, the more area under each of those curves, the better a model is fitting.
	It provides the plot of predicted probabilities by class that helps you get a picture of how each model is fitting.
	And it also provides a techniques and thresholds data table, and that table contains a script that allows you to reproduce the report
	that is produced the first time you run the add in. And we want to emphasize that if you run this and you want to save your results without rewriting the entire
	modeling and sampling methods algorithm, you can save this techniques and thresholds table and that will allow you to save your results and reproduce the report.
	So now we'll look at the dialogue for the evaluating models option. It allows you to choose from a number of models and sampling techniques.
	You can put in what your binary class variable is and all your X predictors, and then
	we, in order to fit all the models and and
	evaluate them on the on a test set, we randomly divide the data into training validation and test sets. You can provide up...you can set the proportions that will go into each of those sets.
	There's a random seed option if you'd like to reproduce the results. And then there are SMOTE options
	that I alluded to before, where you can choose the number of nearest neighbors, from which you select one to be the nearest neighbor used to generate a new case, and replication of each minority case is how many times you repeat the algorithm for each minority observation.
	Again, there are three other sampling option
	options in the add in and those correspond to Tomek, SMOTE and SMOTE plus Tomek. In the Tomek sampling option, it's going to add two columns to your data table that can be used as weights for the predict...for any predictive model that you want to do.
	The first column removes only the majority nearest neighbor in the link and the other removes both members of the Tomek link, so you have that option.
	SMOTE observations will add synthetic observations to your data table.
	And it will also it will provide a source column so that you can identify which
	observations were added. And SMOTE plus Tomek add synthetic observations and the weighting column that removes both members of the Tomek link.
	And the weighting column from the Tomek sampling and SMOTE plus Tomek,
	it's just an indicator column that you can use as a weight in a JMP modeling platform. It's just a 1 if it's included, and a 0 if it should be excluded.
	Most of the three other sampling option dialogues look basically the same.
	One option that's on them and not on the evaluate models option dialogue is show intermediate tables. This option appears for SMOTE and SMOTE plus Tomek.
	And basically, it allows you to see data tables that were used in the construction of the SMOTE observations. In general, you don't need to see it, but if you want to better understand how those observations are being generated, you can take a look at those intermediate tables.
	And they're all explained in the documentation.
	Again, you can obtain the add in through the Community,
	through this the page for this talk on the Discovery Summit Americas 2020
	part of the Community. And as I mentioned just a second ago, there's documentation available within the add in. Just click the Help button.
	And now it is time for Colleen to show an example in a demo of the add in.
Colleen McKendry	Thanks Michael. I'm going to demo the add in now, and to do the demo, I'm going to use this mammography demo data.
	And so the mammography data set is based on a set of digitized film mammograms used in a study of microcalcifications in mammographic images.
	And in this data, each record is classified as either a 1 or a 0. 1 represents that there's calcification seen in the image, and a 0 represents that there is not.
	In this data set, the images where you see calcification, those are the ones you're interested in predicting and so the class level one is the class level that you're interested in.
	In the full data set, there are six continuous predictors and about 11,000 observations.
	But in order to reduce the runtime in this demo, we're only going to use a subset of the full data set. And so it's going to be about half the observations. So about 5500 observations.
	And the observations that are classified as 1, the things that you're interested in, they represent about 2.31% of the total observations, both in the full data set and in the demo data set that we're using. And so we have about a 2% minority proportion.
	And now I'm going to switch over to JMP
	to
	So I have the mammography demo data.
	And we're going to open and I've already installed the add in. So I have the imbalanced classification add in in my drop down menu and I'm going to use the evaluate models option.
	And so here we have the launch window, and we're going to specify the binary class variable, your predictor variables, we're going to select all the models and all the techniques and we're going to specify
	a random seed.
	And click OK.
	And so while this is running, I'm going to explain what's actually happening in the background. So the first thing that the add in does is that it splits the data table into a training data set and a test data set.
	And so you have two separate data tables and then within the training data table those observations are further split into training and validation observations and the validation is used in the model fitting.
	And so once you have those two data sets,
	there are indicator variables...indicator columns that are added to the training data table for each of the sampling techniques that you specify, except for those that have involve SMOTE.
	And so those columns are added and are used as weighting columns and they just specify whether the observation is to be included in the analysis or not.
	If you specify a sampling technique with SMOTE, then there are additional rows that are added to the data table. Those are the generated observations.
	So once your columns and your rows are generated then for each model, each model is fit to each sampling technique. And so if you select all of them
	like we just did here, there are a total of 42 different models that are being fit. And so, that's all what's happening right now. In
	the current demo, we have 42 models being fit and once the models are fit, then the relevant information is gathered and put together in a results report. And that report,
	which will hopefully pop up soon, here it is, that report is shown here. And you also get a techniques and thresholds table and a summary table.
	And so we're going to take a look at what you get when you run the add in. So first we have the training set. And you can see that here are the weighting columns, the weight columns that are added. And these are the columns that are added for the predicted probabilities for those observations.
	Then we have the test set. This doesn't contain any of those weighting columns, but it does have the predicted probabilities for the test set observations.
	We have the results report
	and the techniques and thresholds data table. And so Michael mentioned this in
	the talk earlier, but this is important because this is the thing that you would like to save if you want to save your results and view your results again
	without having to rerun the whole script. And so this data table is what you would save and it contains scripts that will reproduce
	the results report and the summary table, which is the last thing that I have to show. And so this is just contains summaries for each sampling technique and model combination and their AUC values.
	So now to look at the actual results window, at the top we have dialogue specifications. And so this contains the information that you specified in the launch window.
	So if you forget anything that you specified, like your random seed or what proportions you assign, you can just open that and take a look.
	And we also have the binary class distribution. So, the distribution of the class variable across the training and the test set. And this is designed so that the proportion should be the same, which they are in this case at 2.3.
	And then we also have missing threshold. So this isn't super important, but it just gives
	an indication of if a value of the class variable has a missing prediction value, then that's shown here.
	For the actual results, we have these tabbed graphs. And so we have the precision recall curves, the ROC curves, and the cummulative Gains curves. And for the PR curves and the ROC curves, we have the corresponding AUC values as well.
	We also have these graphs of the predicted probabilities by class. And those are more useful when you're only viewing a few of them at a time, which we will later on.
	And then we have a data filter that connects all these graphs and results together.
	So for our actual results for the analysis, we can take a look now. So first I'm going to sort these.
	So you can already see that the ROC curve and the PR curve, there's a lot more differentiation between the curves in the PR curve than there is in the ROC curve.
	And if we select the top, say, five, these all actually have an AUC of .97.
	And you can see that they're all really close to each other. They're basically on top of each other. It would be really hard to determine which model is actually better, which one you should pick
	And so that's where, particularly with imbalanced data, the precision recall curves are really important. So if we switch back over, we can see that these models that had the highest AUC values for the ROC curves,
	they're really spread out in the precision recall curve. And they're actually not...they don't have the highest AUC values for the PR curve.
	So maybe that there...maybe there's a better model that we can pick.
	So now I'm going to look and focus on the top two, which are boosted tree Tomek and SVM Tomek, and I'm going to do that using the data filter.
	And then we just want to look at those are going to show and include.
	So now we have the curves for just these two models and the blue curve is the boosted tree and the red curve is SVM.
	And so you can see in these curves that they kind of overlap each other across different values of the true positive rate. And so you could use these curves
	to choose which model you want to use in your analysis, based on maybe what an acceptable true positive rate would be. So we can see this if I add some reference lines. Excuse my hands that you will see as I type this.
	Okay, so say that these are some different true positive rates that you might be interested in. So if, for example, for whatever data set you have, you wanted a true positive rate of .55.
	You could pick your threshold to make that the true positive rate. And then in this case,
	for that true positive rate, the boosted tree Tomek model has a higher precision. And so you could you could pick that model.
	However, if you wanted your true positive rate to be something like .85, then the SVM model might be a better pick because it has a higher precision for that specific true positive rate.
	And then if you had a higher true positive rate of .95, you would flip again and maybe you would want to pick the boosted tree model.
	So that's how you can use these curves to pick which model is best for your data.
	And now we're going to look at these graphs again, now that there are only a few of them. And this just shows the distribution of predicted probabilities for each class for the models that we selected. So in this particular case, you can see that in SVM there are majority
	probabilities throughout the kind of the whole range of predicted probabilities, where boosted tree does kind of a better job of keeping them at the lower end.
	And so that's it for this particular demo, but before we're done, I just wanted to show one more thing. And so that was an example of how you would use the evaluate
	models option. But say you just wanted to use a particular sampling technique. And you can do that here. So the launch window looks much the same. And you can assign your binary class, your predictors, and click OK.
	And this generates a new data table and you have your
	indicator column.
	Your indicator column, which just shows whether the observation should be included in the analysis or not.
	And then because it was SMOTE plus Tomek you also have all these SMOTE generated observations.
	So now you have this new data table and you can use any type of model with any type of options that you may want and just use this column as your weight or frequency column and go from there. And that is the end of our demo for the imbalanced classification add in. Thanks for watching.

Presenter

Michael Crotty

The Imbalanced Classification Add-In: Compare Sampling Techniques and Models (2020-US-30MP-625)

Presenter

Files