Re: unbalanced data and weight classification

Lu · Dec 28, 2019 12:10 PM

Hello,

I have a database with an highly unbalanced oridnal response. The class of major personal interest is a minority class (patients with a rare but severe infection, around 5% of the observations).

The classification of the other oridinal classes work well with a boosted forest tree classifier but these classes or not at my interest (less severe infection or no infection at all). I wander whether I can use the "weight" option to give more weight to the minority class. Moreover, I want to calculate at which optimal threshold of "weight" the classifier performed best. In unbalanced data, the AUC ROC seems not to be the appropriate performance measure. A F1-score (measure of false positive and false negative rates) seems more adequate for this.

Is anyone aware of autocalculation of F1-scores in JMP?

Is there a way to fine tune the option "Weight" in JMP classification models in highly unbalanced data in order maxiamalize the F1-scores ?

Does somebody has other suggestions to solve this problem?

Thanks,

Lu

mzwald · Dec 28, 2019 05:51 PM

If you turn on the Confusion Matrix option for a classification model, you can get the false positive/negative rates.
You can also set different cut-offs for positive/negative classification (default is 0.5).

dale_lehman · Dec 30, 2019 10:06 AM

I have also been interested in F1 scores but don't believe they are calculated automatically in JMP. Unless someone has a better idea, I'd refrain from using weights to address your issue - I think that will make it hard to see what is going on. My own inclination would be to recode your response variable into 2 categories - the disease you are interested in and everything else. Then you can manually fine tune the cutoff probabilities (the add-in for this is quite good) to see what cutoff probability does the best job of identifying the disease you are interested in. Then, you can go back to the full categorization of your response variable and see if that cutoff probability seems robust. For me, trial and error may be more illuminating than looking for an automatic solution to your problem (but if there is an automatic methodology, I'd like to know it).

dale_lehman · Dec 30, 2019 10:52 AM

I think there are two somewhat different issues here. One is the unbalanced nature of the data - some of your categories are much more common than others, so simple misclassification rates may "ignore" categories with small numbers - I believe the F1 is primarily a way to address this. The other problem is asymmetry of errors - false positives and false negatives may have very different consequences. I don't think the F1 will help with that - only better models and changing the probability cutoffs can deal with that issue. I'm not aware of a systematic relationship between the two issues (there may be one, and that would be very interesting to learn about).

Lu · Feb 16, 2020 07:01 AM

Hello, I found a nice Add-in which can help defining the optimal Threshold for classification (attached below). Or search for "confusion matrix" in the File Exchange - Add-ins of JMP community website.

Lu · Feb 16, 2020 07:10 AM

A recent publications describe clearly the advantages of Matthias Correlation Coëfficiënt (MCC) as an optimal perfomance meassure for classification models, even in unbalanced data.

https://www.ncbi.nlm.nih.gov/pubmed/31898477

https://www.ncbi.nlm.nih.gov/pubmed/28574989

Lu · Feb 16, 2020 07:13 AM

Sorry, I mean Matthews CC