Isolation forest and isolation trees

FN · ‎03-04-2023

Isolation Forest has been emerging as arguably one of the most popular anomaly detectors in recent years due to its general effectiveness across different benchmarks and strong scalability. It is computationally efficient and has been proven to be very effective in anomaly detection.

Isolation forest compared to others (src)

The algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The implementation is therefore the same as partition trees and bootstrap trees with the difference that the target function should be uniform (trying to split variables with a constant Y value).

Yet, if you try this in JMP, it doesn’t work.

By including Isolation Forest in JMP, similar tu predictor screening, users would have access to a powerful tool for detecting anomalies in their data. This could help them identify unusual patterns or behaviors that may warrant further investigation.

This paper studies how IForest works and improves upon its few limitations (i.e., extended isolation forest)

https://hal.science/hal-03537102/document

Scikit-learn documentation

https://scikit-learn.org/stable/modules/outlier_detection.html

Victor_G · ‎03-09-2023

Great suggestion @FN.

In addition to Isolation Forest,

it might be also interesting to consider the Extended Isolation Forest instead of the "regular" Isolation Forest algorithm, as it provides more flexibility in the directions of decision boundaries (not only horizontal or vertical).

Here is an article explaining how Extended Isolation Forest work and how to use it : https://link.medium.com/TL1Lxirf9pb

FN · ‎03-10-2023

Extended trees or forests will be even better, indeed.
Following your link, the original paper for extended is here: https://arxiv.org/pdf/1811.02141.pdf

mia_stephens · ‎04-28-2023

Thank you for this request @FN . We are currently investigating.

FN · ‎05-17-2023

Thanks, Mia. Notice that the implementation of an isolation tree is the same as a partition tree.

The "only" difference is that the partition happens randomly.

In JMP, if you introduce a Y without any variability (all 0s, for example), JMP doesn't perform any split (something expected, otherwise solutions will be stochastic at a certain point).

In the literature, this is known as Extratrees.

https://en.wikipedia.org/wiki/Random_forest#ExtraTrees

FN · ‎05-21-2024

Coming back to this request to see if there are any updates.

This paper summarizes why JMP should have this implemented.

“33 unsupervised anomaly detection algorithms on 52 real-world multivariate tabular data sets, performing the largest comparison of unsupervised anomaly detection algorithms to date. On this collection of data sets, the EIF (Extended Isolation Forest) algorithm significantly outperforms the most other algorithms.”

FN · ‎06-21-2024

Hello @mia_stephens, any update on this? Thanks

mia_stephens · ‎06-21-2024

Hi @FN , we've done some research on using the Python integration in JMP 18 with the Python ‘isotree’ package, using JSL to create dialog windows in JMP. It looks promising, but we don't have anything to share at this point. Curious if you've taken a look at this approach?

FN · ‎07-12-2024

Thanks, Mia.

As shared in my previous comments, this approach will be very useful in JMP as it outperforms other anomaly detection algorithms with global outliers.

KNN distances (within the outlier analysis menu of JMP) work well for local outliers.

Adding Ex. Isolation Forest there will be useful to quickly identify and remove all types of outliers in massive datasets.
Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need?

The partition tree can also be modified to split the data randomly when Y is not given.

abmayfield · ‎08-06-2024

I just saw a JMP-Python integration webinar by Ross M (which was great) in which he highlights this approach in Python, but it seems like it could be relatively easily be done natively in JMP. It appears to represent a mix of techniques that JMP already does, like partition trees and multivariate outlier analysis.

mia_stephens · ‎10-09-2024

Under development as a Pythonic add-in.