News
On June 1, we’re asking you to select a content label when starting a new topic in the Discussions area. Read more to find out why.
Choose Language Hide Translation Bar
Highlighted
Lu
Lu
Level I

unbalanced data and weight classification

Hello,

 

I have a database with an highly unbalanced oridnal response. The class of major personal interest is a minority class (patients with a rare but severe infection, around 5% of the observations).

The classification of the other oridinal classes work well with a boosted forest tree classifier but these classes or not at my interest (less severe infection or no infection at all). I wander whether I can use the "weight" option to give more weight to the minority class. Moreover, I want to calculate at which optimal threshold of "weight" the classifier performed best. In unbalanced data, the AUC ROC seems not to be the appropriate performance measure.  A F1-score (measure of false positive and false negative rates) seems more adequate for this. 

Is anyone aware of autocalculation of F1-scores in JMP?

Is there a way to fine tune  the option "Weight" in JMP classification models in highly unbalanced data in order maxiamalize the F1-scores ?

Does somebody has other suggestions to solve this problem?

 

Thanks,

 

Lu

0 Kudos
6 REPLIES 6
Highlighted
mzwald
Staff

Re: unbalanced data and weight classification

If you turn on the Confusion Matrix option for a classification model, you can get the false positive/negative rates.
You can also set different cut-offs for positive/negative classification (default is 0.5).
Highlighted
dale_lehman
Level VI

Re: unbalanced data and weight classification

I have also been interested in F1 scores but don't believe they are calculated automatically in JMP.  Unless someone has a better idea, I'd refrain from using weights to address your issue - I think that will make it hard to see what is going on.  My own inclination would be to recode your response variable into 2 categories - the disease you are interested in and everything else.  Then you can manually fine tune the cutoff probabilities (the add-in for this is quite good) to see what cutoff probability does the best job of identifying the disease you are interested in.  Then, you can go back to the full categorization of your response variable and see if that cutoff probability seems robust.  For me, trial and error may be more illuminating than looking for an automatic solution to your problem (but if there is an automatic methodology, I'd like to know it).

0 Kudos
Highlighted
dale_lehman
Level VI

Re: unbalanced data and weight classification

I think there are two somewhat different issues here. One is the unbalanced nature of the data - some of your categories are much more common than others, so simple misclassification rates may "ignore" categories with small numbers - I believe the F1 is primarily a way to address this. The other problem is asymmetry of errors - false positives and false negatives may have very different consequences. I don't think the F1 will help with that - only better models and changing the probability cutoffs can deal with that issue. I'm not aware of a systematic relationship between the two issues (there may be one, and that would be very interesting to learn about).
0 Kudos
Highlighted
Lu
Lu
Level I

Re: unbalanced data and weight classification

Hello, I found a nice Add-in which can help defining the optimal Threshold for classification (attached below). Or search for "confusion matrix" in the File Exchange - Add-ins of JMP community website.

 

0 Kudos
Highlighted
Lu
Lu
Level I

Re: unbalanced data and weight classification

A recent publications describe clearly the advantages of Matthias Correlation Coëfficiënt (MCC) as an optimal perfomance meassure for classification models, even in unbalanced data. 

https://www.ncbi.nlm.nih.gov/pubmed/31898477

https://www.ncbi.nlm.nih.gov/pubmed/28574989

 

0 Kudos
Highlighted
Lu
Lu
Level I

Re: unbalanced data and weight classification

Sorry, I mean Matthews CC

0 Kudos