- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
unbalanced data and weight classification
Hello,
I have a database with an highly unbalanced oridnal response. The class of major personal interest is a minority class (patients with a rare but severe infection, around 5% of the observations).
The classification of the other oridinal classes work well with a boosted forest tree classifier but these classes or not at my interest (less severe infection or no infection at all). I wander whether I can use the "weight" option to give more weight to the minority class. Moreover, I want to calculate at which optimal threshold of "weight" the classifier performed best. In unbalanced data, the AUC ROC seems not to be the appropriate performance measure. A F1-score (measure of false positive and false negative rates) seems more adequate for this.
Is anyone aware of autocalculation of F1-scores in JMP?
Is there a way to fine tune the option "Weight" in JMP classification models in highly unbalanced data in order maxiamalize the F1-scores ?
Does somebody has other suggestions to solve this problem?
Thanks,
Lu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: unbalanced data and weight classification
You can also set different cut-offs for positive/negative classification (default is 0.5).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: unbalanced data and weight classification
I have also been interested in F1 scores but don't believe they are calculated automatically in JMP. Unless someone has a better idea, I'd refrain from using weights to address your issue - I think that will make it hard to see what is going on. My own inclination would be to recode your response variable into 2 categories - the disease you are interested in and everything else. Then you can manually fine tune the cutoff probabilities (the add-in for this is quite good) to see what cutoff probability does the best job of identifying the disease you are interested in. Then, you can go back to the full categorization of your response variable and see if that cutoff probability seems robust. For me, trial and error may be more illuminating than looking for an automatic solution to your problem (but if there is an automatic methodology, I'd like to know it).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: unbalanced data and weight classification
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: unbalanced data and weight classification
Hello, I found a nice Add-in which can help defining the optimal Threshold for classification (attached below). Or search for "confusion matrix" in the File Exchange - Add-ins of JMP community website.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: unbalanced data and weight classification
A recent publications describe clearly the advantages of Matthias Correlation Coëfficiënt (MCC) as an optimal perfomance meassure for classification models, even in unbalanced data.
https://www.ncbi.nlm.nih.gov/pubmed/31898477
https://www.ncbi.nlm.nih.gov/pubmed/28574989
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: unbalanced data and weight classification
Sorry, I mean Matthews CC