Solved: Re: Outlier Analysis

sanch1 · Apr 30, 2024 01:53 PM

I have a dataset where I'm trying to identify outliers or otherwise points of interest. Looking at my models actual by predicted plot reveals points with star marker. However the Mahalanobis distance and Jacknife distance do not necesarrily identify these points in the same manner. Is there a recommendation on outlier analysis?

Victor_G · Apr 30, 2024 2:07 PM

Hi @sanch1,

You're comparing two very different ways of assessing outliers with different goals :

Model-agnostic outlier detection methods, like Mahalanobis distances, don't rely on a specified model and just compare distance between points based on variables/factors/features. So an outlier identified by this type of methods indicates that this point looks "strange" and doesn't seem to be part of the factors distributions of the other points.
Model-based outlier detection methods, based on residuals (Studentized residuals, or other metrics like PRESS RMSE/R2...) that enable to identify outliers that are not well fitted/predicted by the model. This may be an indication that the model may be missing some important terms (like interaction or non-linear effects), or may not be appropriate for the data. You can use the other diagnostics panels/tools to see if the model seem to fit well for your data, based on statistical significance of your model, different metrics according to your goal like information criterion/RMSE/R2... You can also check if this detected model-based outlier has a strong impact on your model by calculating Cook's distances: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-residual-analy...If values for these points are high and/or unusual, it is an indication that these points could be influential and may bias your model. You should then investigate what are these points and if the measurements are valid, before deciding on diminishing their influence on the model or removing them.

More info on outliers in my previous answer on similar topic : https://community.jmp.com/t5/Discussions/Supress-the-effect-of-outliers-when-fitting-the-model-and-i...

I hope this answer will help you,

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Apr 30, 2024 2:07 PM

Hi @sanch1,

You're comparing two very different ways of assessing outliers with different goals :

Model-agnostic outlier detection methods, like Mahalanobis distances, don't rely on a specified model and just compare distance between points based on variables/factors/features. So an outlier identified by this type of methods indicates that this point looks "strange" and doesn't seem to be part of the factors distributions of the other points.
Model-based outlier detection methods, based on residuals (Studentized residuals, or other metrics like PRESS RMSE/R2...) that enable to identify outliers that are not well fitted/predicted by the model. This may be an indication that the model may be missing some important terms (like interaction or non-linear effects), or may not be appropriate for the data. You can use the other diagnostics panels/tools to see if the model seem to fit well for your data, based on statistical significance of your model, different metrics according to your goal like information criterion/RMSE/R2... You can also check if this detected model-based outlier has a strong impact on your model by calculating Cook's distances: https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/mlr-residual-analy...If values for these points are high and/or unusual, it is an indication that these points could be influential and may bias your model. You should then investigate what are these points and if the measurements are valid, before deciding on diminishing their influence on the model or removing them.

More info on outliers in my previous answer on similar topic : https://community.jmp.com/t5/Discussions/Supress-the-effect-of-outliers-when-fitting-the-model-and-i...

I hope this answer will help you,

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · May 1, 2024 09:03 AM

How was the data collected? This has a huge effect on what analysis is appropriate. As Victor indicates, Mahalanobis is a multivariate outlier detector. If your response is univariate, you may want to use good old control charts, but again, it depends on how the data was gathered.

"All models are wrong, some are useful" G.E.P. Box

hogi · May 1, 2024 01:54 PM

Nice Blog post series by @JerryFish about Mahalanobis and Jackknife outlier detection:
Outliers Episode 3: Detecting outliers using the Mahalanobis distance (and T2)

Outliers Episode 4: Detecting outliers using jackknife distance

how did you calculate the distances?

Bill_Worley · May 1, 2024 03:15 PM

@sanch1,

There is another option called Cook's D Influence. You save Cook's D Influence through the Save Columns option under the Response red triangle. Select the saved column and then run a Distribution. If any of your data points are >= 1 as measured by Cook's D they can be considered as potential outliers. See the example distribution image below.