cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
sanch1
Level I

Outlier Analysis

I have a dataset where I'm trying to identify outliers or otherwise points of interest. Looking at my models actual by predicted plot reveals points with star marker. However the Mahalanobis distance and Jacknife distance do not necesarrily identify these points in the same manner. Is there a recommendation on outlier analysis?

sanch1_0-1714499535650.png

sanch1_1-1714499605424.png

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User

Re: Outlier Analysis

Hi @sanch1,

 

You're comparing two very different ways of assessing outliers with different goals :

  • Model-agnostic outlier detection methods, like Mahalanobis distances, don't rely on a specified model and just compare distance between points based on variables/factors/features. So an outlier identified by this type of methods indicates that this point looks "strange" and doesn't seem to be part of the factors distributions of the other points.
  • Model-based outlier detection methods, based on residuals (Studentized residuals, or other metrics like PRESS RMSE/R2...) that enable to identify outliers that are not well fitted/predicted by the model. This may be an indication that the model may be missing some important terms (like interaction or non-linear effects), or may not be appropriate for the data. You can use the other diagnostics panels/tools to see if the model seem to fit well for your data, based on statistical significance of your model, different metrics according to your goal like information criterion/RMSE/R2... You can also check if this detected model-based outlier has a strong impact on your model by calculating Cook's distances: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-residual-analy...If values for these points are high and/or unusual, it is an indication that these points could be influential and may bias your model. You should then investigate what are these points and if the measurements are valid, before deciding on diminishing their influence on the model or removing them.


More info on outliers in my previous answer on similar topic : https://community.jmp.com/t5/Discussions/Supress-the-effect-of-outliers-when-fitting-the-model-and-i...

I hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

4 REPLIES 4
Victor_G
Super User

Re: Outlier Analysis

Hi @sanch1,

 

You're comparing two very different ways of assessing outliers with different goals :

  • Model-agnostic outlier detection methods, like Mahalanobis distances, don't rely on a specified model and just compare distance between points based on variables/factors/features. So an outlier identified by this type of methods indicates that this point looks "strange" and doesn't seem to be part of the factors distributions of the other points.
  • Model-based outlier detection methods, based on residuals (Studentized residuals, or other metrics like PRESS RMSE/R2...) that enable to identify outliers that are not well fitted/predicted by the model. This may be an indication that the model may be missing some important terms (like interaction or non-linear effects), or may not be appropriate for the data. You can use the other diagnostics panels/tools to see if the model seem to fit well for your data, based on statistical significance of your model, different metrics according to your goal like information criterion/RMSE/R2... You can also check if this detected model-based outlier has a strong impact on your model by calculating Cook's distances: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-residual-analy...If values for these points are high and/or unusual, it is an indication that these points could be influential and may bias your model. You should then investigate what are these points and if the measurements are valid, before deciding on diminishing their influence on the model or removing them.


More info on outliers in my previous answer on similar topic : https://community.jmp.com/t5/Discussions/Supress-the-effect-of-outliers-when-fitting-the-model-and-i...

I hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
statman
Super User

Re: Outlier Analysis

How was the data collected?  This has a huge effect on what analysis is appropriate.  As Victor indicates, Mahalanobis is a multivariate outlier detector.  If your response is univariate, you may want to use good old control charts, but again, it depends on how the data was gathered.

"All models are wrong, some are useful" G.E.P. Box
hogi
Level XII

Re: Outlier Analysis

Nice Blog post series by @JerryFish  about Mahalanobis and Jackknife outlier detection:
Outliers Episode 3: Detecting outliers using the Mahalanobis distance (and T2) 

Outliers Episode 4: Detecting outliers using jackknife distance 

 

how did you calculate the distances?

hogi_1-1714585665158.png

hogi_3-1714585916119.png

 

 

 

Re: Outlier Analysis

@sanch1,

There is another option called Cook's D Influence.  You save Cook's D Influence through the Save Columns option under the Response red triangle.  Select the saved column and then run a Distribution.  If any of your data points are >= 1 as measured by Cook's D they can be considered as potential outliers.  See the example distribution image below.

 

Bill_Worley_0-1714590910783.png