Level: Intermediate Job Function: Analyst / Scientist / Engineer Michael Crotty, JMP Senior Statistical Writer, SAS Marie Gaudard, Statistical Consultant Colleen McKendry, JMP Technical Writer, SAS
Best Invited Paper Finalist
In situations where anomaly detection is the goal of a predictive model, the underlying data often exhibit an imbalanced class distribution. Namely, the anomalous class is significantly smaller than the non-anomalous class. The modeling goal is usually to identify members of the minority class. However, a straightforward application of predictive modeling techniques can result in a biased and inaccurate model. Many techniques have been proposed to address these issues. We seek to guide JMP Pro users in developing predictive models for imbalanced data. We address JMP Pro approaches to classification into an underrepresented class. We first describe general aspects of the imbalanced class problem: bias, performance measures and approaches to addressing the modeling issues. We then discuss the sampling methods we use in our study; these include weighting, under-sampling, over-sampling and the synthetic minority oversampling technique (SMOTE). For several real data sets that exhibit varying class proportions, we compare the fits obtained using these sampling methods in combination with predictive models available in JMP Pro classification platforms. We perform a similar exploration of sampling techniques and predictive models for a limited range of simulated data sets. For the simulated data sets, we attempt to identify the degree of under-representation for which standard models begin to be affected by class imbalance. We also present conclusions about the relative performance of the sampling methods and predictive models.