cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
JMP is taking Discovery online, April 16 and 18. Register today and join us for interactive sessions featuring popular presentation topics, networking, and discussions with the experts.
Choose Language Hide Translation Bar
Imbalanced Classification Add-In

The Imbalanced Classification add-in features sampling techniques that attempt to impose a more balanced distribution between the two classes. The sampling techniques include the synthetic minority oversampling technique (SMOTE), Tomek links, and a combination of the two, as well as some basic sampling approaches. The Tomek Sampling, SMOTE Observations, and SMOTE plus Tomek options enable you to apply these sampling techniques on their own to support your specific modeling efforts.


The comprehensive Evaluate Models option, which requires JMP Pro, enables you to fit models using various sampling methods and compare them on a test set to select thresholds using Precision-Recall, ROC, and Cumulative Gains curves, as well as other measures of classification accuracy. The other three options do not fit models, but rather enable you to apply the Tomek, SMOTE, and SMOTE plus Tomek sampling schemes to your own data.


The SMOTE, Tomek, and combined SMOTE and Tomek sampling techniques use the concept of nearest neighbors. The add-in uses Gower distance as its distance metric, which allows for continuous, nominal, and ordinal predictors. These options do not require JMP Pro.

add-in screenshot.png
Note: All options require JMP version 15.2 or higher. Excluded rows and rows with missing response values are ignored by the add-in.

 

Version 2, released 3/25/2021, supports JMP 16 and improves the handling of rows with missing values for all predictors.

 

Version 2.1, released 10/3/2023, fixes a bug in the interpolation of P-R values that occurred for specific data tables.

Comments
goutam

Glad to see this add-in!

Raaed

Hi

please, the source of sampling techniques (books or articles)

@Raaed This paper from He and Garcia is a good overview of the sampling techniques:

He, H. and Garcia, E. A. (2009). “Learning from Imbalanced Data.” IEEE Transactions on Knowledge and Data Engineering. 21, 9:1263-1284. https://www.academia.edu/29164932/Learning_from_Imbalanced_Data

Lu

Hi,

 

When running the add-inn I get the following JMP Alert "Invalid subscript (must be number of list of numbers) in access or evaluation of classvec[/*###*/near_row]', classvec[/*###*/near_row]

 

A pupup window "Creating weighting columns" shows up and the analysis is stocked.

I already reinstalled the Add-Inn without any help. Did somebody already had this problem? Any suggestion?

Thx in advance

 

Lu

 

Hi @Lu, sorry to hear you are seeing an error message with the add-in. Would you be able to send me your data in a private message so I can investigate what might be causing the problem?

 

Thanks,
Michael

bfrank

Michael, 

Thank you for posting this. Can you clarify what limitations exist for this add-in. It appears that it can't handle certain volumes of data. 

Hi @bfrank, it's hard to state a specific limit on the volume of data that the add-in can handle, because it is mostly limited by the amount of memory on the current machine (similar to the rest of JMP). We do know that especially as the number of categorical predictors increases, the runtime increases. Sorry we don't have more definitive specifications.

 

Thanks,
Michael

Hi, thanks for the numerous clarifications in the video. But I have been having the same issue as @Lu , it actually happens no matter what data I use, I also tried using the sample used in the video but nothing changed. Has the source of this problem been identified so far?

Thanks btw
- Zuska

Hi @Zuska-Ariri, thanks for trying the add-in. Sorry that you're seeing some errors. We did identify the problem that @Lu reported to us, and it should be fixed in the "Version 2" of the add-in, linked on this page. Can you confirm that you are using version 2 of the add-in? Also, what version of JMP are you using?

If you're still seeing errors with version 2 of the add-in, we can investigate further.

Thanks,

Michael

I'm afraid it's the version 2 of the add-in. I also tried installing it again, but nothing changed at all. I am using JMP Pro 16, I'm not sure if there are some issues with this version either. Thanks!!

Hi, @michael_jmp , thanks for your add-in. There's a probalem when I use it. For example, I have a dependent variable y and some independent variables, when I use the "SMOTE observations",  I draged variable y to "binary class variable" and indendent variables to "X, predictors", and a new dataset was generated. But in this new dataset, the values of indendent variables were empty, how can I add the new generated observations to original observations? Thanks!!!

cozmck

Hi @OneSidedLemur23 - I also worked on this add-in. If you are able to share your data (or a data set that is similar to your data), that would be helpful in trouble-shooting the issue you are having. If you are able to, please email me at Colleen.McKendry@jmp.com and hopefully we can get some answers for you!

Lu
Hi, @michael_jmp, since XGboost is a ML model more frequently used since its high performance. Do you think you would be able to implement this Model type into the Add-in in the near future? Regards, Lu

Hi @Lu - thanks for the suggestion. However, we don't have any plans currently to add models to the add-in.

Sandra1

Hi,

 

When creating a weight column using a total of 21188 pieces of data, an Alert will appear, as shown in the figure,螢幕擷取畫面 2023-06-05 234119.png

 

It was later found that it may be related to some problems when generating SMOTE plus Tomek?  Therefore, after unchecking SMOTE plus Tomek, continue to try other model constructions, and the Alert shown in the figure still appears. The training set data was successfully generated, but there are many missing values.

螢幕擷取畫面 2023-06-06 063557.png

螢幕擷取畫面 2023-06-06 063818.png

 

Also, the Precision-Recall Curve window is not generated. How can we determine the source of this problem and what to do about it?

 

Thanks,

Sarah

 

 

Hi @Sandra1,

Thanks for trying out the add-in. If you reduce the size of your data table, are you able to get the add-in to work without running into the errors? My guess is that one of the operations is running out of memory on your machine for the number of observations that you have.

Sandra1

Dear @michael_jmp ,

 

Thank for your kind response. Yes, it works if I reduce the size to about a tenth of the original data table. Is the operation running out of memory something to do with my computer hardware? 

Hi @Sandra1 ,

Yes, all operations in JMP are performed in-memory, so you are limited by the amount of memory available on your machine.

WeihuaLi

Hi @michael_jmp ,

 

One of my customers got the same Error message as Sandra1 ("Concatenate for strings or matrices only at row 21189 in access or evaluation of 'V Concat' …") when analyzing some large data set.  After looking into the source code, I found that such error arises from the below part of the code:

lixx2535_0-1687508789622.png

The variable "nIncr" equals 0, so that J(nIncr-1,1, .) is now a scaler rather than a matrix, which causes the V concat operation to be invalid. 

 

To solve this issue, I hard coded nIncr to be  "nIncr = nRows(PrecisionTemp) - nRows(SensitivityTemp) + 1;" and can now run the add-in on large data set successfully.

 

The question now is that I am not sure whether such code modification is valid, since I didn't have enough time to read through the code. Could you please help to confirm that? 

Thanks for the feedback, and sorry for the slow reply. I don't think that the suggested change to nIncr is the right way to fix this bug. I am attaching an updated version of the add-in (ver. 2.1) that has a fix included. (The fix changes the condition in the If() statement. Basically, if the Recall values are close enough together, we don't need to do any interpolation and therefore don't need to go into that If() statement at all.)

Hello @michael_jmp 

 

Thank you very much for this add-in. But for the moment it handles only binary categorical/ordinal unbalanced response ?  Will it handle later on multi-class response ?

 

Best regards,

Hello @SophieCuvillier 

Thanks for your interest in the add-in. We do not have any plans at this time to add support for multi-class responses.

 

Regards,
Michael

NishaKumar

Dear @michael_jmp ,

 

I am wondering if you can let me know how can I get the performance measure Confidence Intervals (lower and the upper limit) and the alpha when I do the add ins imbalanced classification analysis to evaluate models (I picked all the models [except for generalized regression] and sampling techniques). I am able to get the PR Curve AUC, ROC Curve AUC, but I am not able to locate how to get the confidence interval (identify what confidence interval) as well as the upper and lower limits of the confidence interval. I noticed there is an alpha and I can select the alpha but how do I get an output for the upper and lower limit? Some guidance will be helpful.

 

Thank you,

Nisha

Hi @NishaKumar,

 

Thanks for the feedback and question. We don't support any confidence intervals for the AUC graphs at this time. The option for changing alpha in that menu comes from the Graph Builder menu options (we're using Graph Builder to generate the graphs in our report). Since there are no confidence intervals shown in the graphs, the alpha option has no effect on the graphs.

 

Thanks,
Michael

NishaKumar

@michael_jmpThank you for getting back to me so quickly, your comment was very helpful. Maybe in the future, it's something to build into the analysis.