Imbalanced Classification Add-In

13 Kudos

The Imbalanced Classification add-in features sampling techniques that attempt to impose a more balanced distribution between the two classes. The sampling techniques include the synthetic minority oversampling technique (SMOTE), Tomek links, and a combination of the two, as well as some basic sampling approaches. The Tomek Sampling, SMOTE Observations, and SMOTE plus Tomek options enable you to apply these sampling techniques on their own to support your specific modeling efforts.

The comprehensive Evaluate Models option, which requires JMP Pro, enables you to fit models using various sampling methods and compare them on a test set to select thresholds using Precision-Recall, ROC, and Cumulative Gains curves, as well as other measures of classification accuracy. The other three options do not fit models, but rather enable you to apply the Tomek, SMOTE, and SMOTE plus Tomek sampling schemes to your own data.

The SMOTE, Tomek, and combined SMOTE and Tomek sampling techniques use the concept of nearest neighbors. The add-in uses Gower distance as its distance metric, which allows for continuous, nominal, and ordinal predictors. These options do not require JMP Pro.

add-in screenshot.png
Note: All options require JMP version 15.2 or higher. Excluded rows and rows with missing response values are ignored by the add-in.

Version 2, released 3/25/2021, supports JMP 16 and improves the handling of rows with missing values for all predictors.

Version 2.1, released 10/3/2023, fixes a bug in the interpolation of P-R values that occurred for specific data tables.

Version 2.2, released 8/15/2024, fixes a bug in the Help button on each of the launch windows.

goutam · ‎09-11-2020

Glad to see this add-in!

Raaed · ‎01-05-2021

Hi

please, the source of sampling techniques (books or articles)

michael_jmp · ‎01-07-2021

@Raaed This paper from He and Garcia is a good overview of the sampling techniques:

He, H. and Garcia, E. A. (2009). “Learning from Imbalanced Data.” IEEE Transactions on Knowledge and Data Engineering. 21, 9:1263-1284. https://www.academia.edu/29164932/Learning_from_Imbalanced_Data

Lu · ‎01-31-2021

Hi,

When running the add-inn I get the following JMP Alert "Invalid subscript (must be number of list of numbers) in access or evaluation of classvec[/*###*/near_row]', classvec[/*###*/near_row]

A pupup window "Creating weighting columns" shows up and the analysis is stocked.

I already reinstalled the Add-Inn without any help. Did somebody already had this problem? Any suggestion?

Thx in advance

Lu

michael_jmp · ‎02-01-2021

Hi @Lu, sorry to hear you are seeing an error message with the add-in. Would you be able to send me your data in a private message so I can investigate what might be causing the problem?

Thanks,
Michael

bfrank · ‎02-11-2021

Michael,

Thank you for posting this. Can you clarify what limitations exist for this add-in. It appears that it can't handle certain volumes of data.

michael_jmp · ‎02-12-2021

Hi @bfrank, it's hard to state a specific limit on the volume of data that the add-in can handle, because it is mostly limited by the amount of memory on the current machine (similar to the rest of JMP). We do know that especially as the number of categorical predictors increases, the runtime increases. Sorry we don't have more definitive specifications.

Thanks,
Michael

Zuska-Ariri · ‎05-13-2021

Hi, thanks for the numerous clarifications in the video. But I have been having the same issue as @Lu , it actually happens no matter what data I use, I also tried using the sample used in the video but nothing changed. Has the source of this problem been identified so far?

Thanks btw
- Zuska

michael_jmp · ‎05-14-2021

Hi @Zuska-Ariri, thanks for trying the add-in. Sorry that you're seeing some errors. We did identify the problem that @Lu reported to us, and it should be fixed in the "Version 2" of the add-in, linked on this page. Can you confirm that you are using version 2 of the add-in? Also, what version of JMP are you using?

If you're still seeing errors with version 2 of the add-in, we can investigate further.

Thanks,

Michael

Zuska-Ariri · ‎05-15-2021

I'm afraid it's the version 2 of the add-in. I also tried installing it again, but nothing changed at all. I am using JMP Pro 16, I'm not sure if there are some issues with this version either. Thanks!!

OneSidedLemur23 · ‎07-20-2022

Hi, @michael_jmp , thanks for your add-in. There's a probalem when I use it. For example, I have a dependent variable y and some independent variables, when I use the "SMOTE observations", I draged variable y to "binary class variable" and indendent variables to "X, predictors", and a new dataset was generated. But in this new dataset, the values of indendent variables were empty, how can I add the new generated observations to original observations? Thanks!!!

cozmck · ‎07-21-2022

Hi @OneSidedLemur23 - I also worked on this add-in. If you are able to share your data (or a data set that is similar to your data), that would be helpful in trouble-shooting the issue you are having. If you are able to, please email me at Colleen.McKendry@jmp.com and hopefully we can get some answers for you!

Lu · ‎09-01-2022

Hi, @michael_jmp, since XGboost is a ML model more frequently used since its high performance. Do you think you would be able to implement this Model type into the Add-in in the near future? Regards, Lu

michael_jmp · ‎09-02-2022

Hi @Lu - thanks for the suggestion. However, we don't have any plans currently to add models to the add-in.

Sandra1 · ‎06-08-2023

Hi,

When creating a weight column using a total of 21188 pieces of data, an Alert will appear, as shown in the figure, 螢幕擷取畫面 2023-06-05 234119.png

It was later found that it may be related to some problems when generating SMOTE plus Tomek? Therefore, after unchecking SMOTE plus Tomek, continue to try other model constructions, and the Alert shown in the figure still appears. The training set data was successfully generated, but there are many missing values.

螢幕擷取畫面 2023-06-06 063557.png

螢幕擷取畫面 2023-06-06 063818.png

Also, the Precision-Recall Curve window is not generated. How can we determine the source of this problem and what to do about it?

Thanks,

Sarah

michael_jmp · ‎06-20-2023

Hi @Sandra1,

Thanks for trying out the add-in. If you reduce the size of your data table, are you able to get the add-in to work without running into the errors? My guess is that one of the operations is running out of memory on your machine for the number of observations that you have.

Sandra1 · ‎06-21-2023

Dear @michael_jmp ,

Thank for your kind response. Yes, it works if I reduce the size to about a tenth of the original data table. Is the operation running out of memory something to do with my computer hardware?

michael_jmp · ‎06-21-2023

Hi @Sandra1 ,

Yes, all operations in JMP are performed in-memory, so you are limited by the amount of memory available on your machine.

WeihuaLi · ‎06-23-2023

Hi @michael_jmp ,

One of my customers got the same Error message as Sandra1 ("Concatenate for strings or matrices only at row 21189 in access or evaluation of 'V Concat' …") when analyzing some large data set. After looking into the source code, I found that such error arises from the below part of the code:

The variable "nIncr" equals 0, so that J(nIncr-1,1, .) is now a scaler rather than a matrix, which causes the V concat operation to be invalid.

To solve this issue, I hard coded nIncr to be "nIncr = nRows(PrecisionTemp) - nRows(SensitivityTemp) + 1;" and can now run the add-in on large data set successfully.

The question now is that I am not sure whether such code modification is valid, since I didn't have enough time to read through the code. Could you please help to confirm that?

michael_jmp · ‎10-03-2023

Thanks for the feedback, and sorry for the slow reply. I don't think that the suggested change to nIncr is the right way to fix this bug. I am attaching an updated version of the add-in (ver. 2.1) that has a fix included. (The fix changes the condition in the If() statement. Basically, if the Recall values are close enough together, we don't need to do any interpolation and therefore don't need to go into that If() statement at all.)

SophieCuvillier · ‎12-12-2023

Hello @michael_jmp

Thank you very much for this add-in. But for the moment it handles only binary categorical/ordinal unbalanced response ? Will it handle later on multi-class response ?

Best regards,

michael_jmp · ‎12-12-2023

Hello @SophieCuvillier

Thanks for your interest in the add-in. We do not have any plans at this time to add support for multi-class responses.

Regards,
Michael

NishaKumar · ‎02-23-2024

Dear @michael_jmp ,

I am wondering if you can let me know how can I get the performance measure Confidence Intervals (lower and the upper limit) and the alpha when I do the add ins imbalanced classification analysis to evaluate models (I picked all the models [except for generalized regression] and sampling techniques). I am able to get the PR Curve AUC, ROC Curve AUC, but I am not able to locate how to get the confidence interval (identify what confidence interval) as well as the upper and lower limits of the confidence interval. I noticed there is an alpha and I can select the alpha but how do I get an output for the upper and lower limit? Some guidance will be helpful.

Thank you,

Nisha

michael_jmp · ‎02-23-2024

Hi @NishaKumar,

Thanks for the feedback and question. We don't support any confidence intervals for the AUC graphs at this time. The option for changing alpha in that menu comes from the Graph Builder menu options (we're using Graph Builder to generate the graphs in our report). Since there are no confidence intervals shown in the graphs, the alpha option has no effect on the graphs.

Thanks,
Michael

NishaKumar · ‎02-23-2024

@michael_jmpThank you for getting back to me so quickly, your comment was very helpful. Maybe in the future, it's something to build into the analysis.

rohanshukla77 · ‎07-11-2024

Hi @michael_jmp,

I just had some quick questions regarding the output of each of these models.

1.)Is the value for the ROC curve AUC constructed for a specific set (training, validation, or test) or is it a composite result from each of them?

2.) I am currently working on a project that involves highly imbalanced data (~97/3 breakup) and each of the models within this add in and XGBOOST are giving me unrealistically perfect results (0.999-1 range). Do you have any tips for other models within JMP that may be more effective at dealing with this sort of data or how to refine within the current add-in?

Thank you so much!

-Rohan

NishaKumar2023 · ‎07-12-2024

Hi Everyone,

When evaluating imbalanced classification, what is the percentage for oversampling? What is the percentage for undersampling? What is the percentage used for weighted sampling method? It will be helpful if someone can point me to literature/guide that states this information.

@michael_jmp

Thank you,

Nisha

NishaKumar · ‎07-14-2024

A second question:

@michael_jmp

When calculating Tomek in imbalanced classification, what formula is used or how is it calculated?

I am running xgboost with tomek, but in the xgboost addin there are two different tomek sampling methods:

Tomek Remove Majority NN Only
Tomek Remove NN Pair

Does imbalanced classification use either of the type tomek sampling methods mentioned above? If not, how is it calculated?

Thank you,

Nisha

michael_jmp · ‎07-30-2024

Hi @NishaKumar ,

There is a bug in the current version (2.1) of the add-in such that the Help button in the launch window for the add-in doesn't work. However, you should be able to use this line of JSL to open the help document.

open("$ADDIN_HOME(com.jmp.imbalancedclass)/Support Files/Imbalanced Classification Add-In Documentation.pdf")

This document has information and references for the various sampling methods.

We hope to have an updated version of the add-in ready soon that fixes the broken Help button.

Thanks,
Michael

michael_jmp · ‎07-30-2024

Hi @rohanshukla77 ,

Thanks for using the add-in.

1) The AUC values for the curves are calculated for the Test set.

2) Are the models accurately predicting the response over 99% of the time? I'm not sure what you are defining as a more effective model.

Thanks,
Michael

Recommended Articles