cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar

Isolation forest and isolation trees

Isolation Forest has been emerging as arguably one of the most popular anomaly detectors in recent years due to its general effectiveness across different benchmarks and strong scalability. It is computationally efficient and has been proven to be very effective in anomaly detection.

 

FN_0-1678345243246.png

Isolation forest compared to others (src)


The algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The implementation is therefore the same as partition trees and bootstrap trees with the difference that the target function should be uniform (trying to split variables with a constant Y value).

Yet, if you try this in JMP, it doesn’t work.

By including Isolation Forest in JMP, similar tu predictor screening, users would have access to a powerful tool for detecting anomalies in their data. This could help them identify unusual patterns or behaviors that may warrant further investigation.

 

This paper studies how IForest works and improves upon its few limitations (i.e., extended isolation forest)

https://hal.science/hal-03537102/document

 

Scikit-learn documentation

https://scikit-learn.org/stable/modules/outlier_detection.html

 

 

11 Comments
Victor_G
Super User

Great suggestion @FN.

 

In addition to Isolation Forest, 

it might be also interesting to consider the Extended Isolation Forest instead of the "regular" Isolation Forest algorithm, as it provides more flexibility in the directions of decision boundaries (not only horizontal or vertical).

Here is an article explaining how Extended Isolation Forest work and how to use it : https://link.medium.com/TL1Lxirf9pb

FN
Level VI

Extended trees or forests will be even better, indeed. 
Following your link, the original paper for extended is here: https://arxiv.org/pdf/1811.02141.pdf




mia_stephens
Staff
Status changed to: Investigating

Thank you for this request @FN . We are currently investigating. 

FN
Level VI

Thanks, Mia. Notice that the implementation of an isolation tree is the same as a partition tree. 


The "only" difference is that the partition happens randomly.

 

In JMP, if you introduce a Y without any variability (all 0s, for example), JMP doesn't perform any split (something expected, otherwise solutions will be stochastic at a certain point).

 

In the literature, this is known as Extratrees.

https://en.wikipedia.org/wiki/Random_forest#ExtraTrees

FN
Level VI

Coming back to this request to see if there are any updates.


This paper summarizes why JMP should have this implemented.


“33 unsupervised anomaly detection algorithms on 52 real-world multivariate tabular data sets, performing the largest comparison of unsupervised anomaly detection algorithms to date. On this collection of data sets, the EIF (Extended Isolation Forest) algorithm significantly outperforms the most other algorithms.”

 

FN
Level VI

Hello @mia_stephens, any update on this? Thanks

mia_stephens
Staff

Hi @FN , we've done some research on using the Python integration in JMP 18 with the Python ‘isotree’ package, using JSL to create dialog windows in JMP. It looks promising, but we don't have anything to share at this point. Curious if you've taken a look at this approach?

FN
Level VI

Thanks, Mia.

 

As shared in my previous comments, this approach will be very useful in JMP as it outperforms other anomaly detection algorithms with global outliers.

 

KNN distances (within the outlier analysis menu of JMP) work well for local outliers.

 

Adding Ex. Isolation Forest there will be useful to quickly identify and remove all types of outliers in massive datasets.
Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need?

The partition tree can also be modified to split the data randomly when Y is not given.


 

 

 

abmayfield
Level VI

I just saw a JMP-Python integration webinar by Ross M (which was great) in which he highlights this approach in Python, but it seems like it could be relatively easily be done natively in JMP. It appears to represent a mix of techniques that JMP already does, like partition trees and multivariate outlier analysis.

mia_stephens
Staff
Status changed to: Yes, Stay Tuned!

Under development as a Pythonic add-in.