Level: Advanced
Michael Crotty, @michael_jmp, JMP Senior Statistical Writer, SAS
Colleen McKendry, @cozmck, JMP Technical Writer, SAS
Marie Gaudard, @marie_gaudard, Statistical Consultant
In situations where anomaly detection is the goal of a predictive model, the underlying data often exhibit an imbalanced class distribution. Namely, the anomalous class is significantly smaller than the non-anomalous class. The modeling goal is usually to identify members of the minority class. However, a straightforward application of predictive modeling techniques can result in a biased and inaccurate model. Many techniques have been proposed to address these issues. We seek to guide JMP Pro users in developing predictive models for imbalanced data. We address JMP Pro approaches to classification into an underrepresented class. We first describe general aspects of the imbalanced class problem: bias, performance measures and approaches to addressing the modeling issues. We then discuss the sampling methods we use in our study; these include weighting, under-sampling, over-sampling and the synthetic minority oversampling technique (SMOTE). For several real data sets that exhibit varying class proportions, we compare the fits obtained using these sampling methods in combination with predictive models available in JMP Pro classification platforms. We perform a similar exploration of sampling techniques and predictive models for a limited range of simulated data sets. For the simulated data sets, we attempt to identify the degree of under-representation for which standard models begin to be affected by class imbalance. We also present conclusions about the relative performance of the sampling methods and predictive models.
An add-in that performs the analyses discussed in this presentation is now available: Imbalanced Classification Add-In. The add-in expands the capabilities of the scripts found in Scripts_and_Results.zip. Among other features, the add-in does not depend on R, it performs SMOTE, Tomek, and SMOTE plus Tomek sampling on data sets that include nominal and ordinal predictors, and it provides extensive documentation through the dialog's Help button.