Anomaly Detection and JMP® Pro (2019-US-45MP-218)

1 Kudo

Level: Advanced

Michael Crotty, @michael_jmp, JMP Senior Statistical Writer, SAS
Colleen McKendry, @cozmck, JMP Technical Writer, SAS
Marie Gaudard, @marie_gaudard, Statistical Consultant

In situations where anomaly detection is the goal of a predictive model, the underlying data often exhibit an imbalanced class distribution. Namely, the anomalous class is significantly smaller than the non-anomalous class. The modeling goal is usually to identify members of the minority class. However, a straightforward application of predictive modeling techniques can result in a biased and inaccurate model. Many techniques have been proposed to address these issues. We seek to guide JMP Pro users in developing predictive models for imbalanced data. We address JMP Pro approaches to classification into an underrepresented class. We first describe general aspects of the imbalanced class problem: bias, performance measures and approaches to addressing the modeling issues. We then discuss the sampling methods we use in our study; these include weighting, under-sampling, over-sampling and the synthetic minority oversampling technique (SMOTE). For several real data sets that exhibit varying class proportions, we compare the fits obtained using these sampling methods in combination with predictive models available in JMP Pro classification platforms. We perform a similar exploration of sampling techniques and predictive models for a limited range of simulated data sets. For the simulated data sets, we attempt to identify the degree of under-representation for which standard models begin to be affected by class imbalance. We also present conclusions about the relative performance of the sampling methods and predictive models.

An add-in that performs the analyses discussed in this presentation is now available: Imbalanced Classification Add-In. The add-in expands the capabilities of the scripts found in Scripts_and_Results.zip. Among other features, the add-in does not depend on R, it performs SMOTE, Tomek, and SMOTE plus Tomek sampling on data sets that include nominal and ordinal predictors, and it provides extensive documentation through the dialog's Help button.

Natalia · ‎06-26-2020

Thank you so much for providing this resource! I have a question on obtaining AUC for Precision curve. I have R installed on the destop and when I run the Imbalanced Data script, it gets through to creating datasets and then says Getting AUC values for Precison Curve. It sits thee for about 10-145 sec and then pops JMP error. It does not create the report of any kind. Before I installed R, it would create the report (just like your powerpoint shows), but AUC values for precision curve are not calculated. What am I doing wrong? Would like to have those calculated automatically.

Thank you

marie_gaudard · ‎06-27-2020

Hi Natalia,

Thank you for using our script! It sounds as if the problem is connected with the R installation. In the version of the script that is currently posted, we obtain the Precision Recall curve AUC values from R. However, we have a new version of the script that does not use R at all, and has additional features, but it may require JMP 15 to run.

If you would like, I could send you that version, packaged as an add-in. However, I don't see a way to attach a file to this email through the JMP Community. If you could send me your email address, then I can send you a draft of the new add-in. My email address is mgaudard@tampabay.rr.com.

--Marie

marie_gaudard · ‎09-10-2020

Hi Natalia,

I just wanted to let you know that the Imbalanced Data scripts from the Tucson conference have now been packaged into an add-in, which you can find here:

Imbalanced Classification Add-In. The add-in has expanded capabilities: among other features, it does not depend on R, it performs SMOTE, Tomek, and SMOTE plus Tomek sampling on data sets that include nominal and ordinal predictors, and it provides extensive documentation through the dialog's Help button.

We hope you find it useful!

Marie