Hi @Mathej01,
Studentized residuals may be a good way to identify outliers based on an assumed model. See more infos about how the studentized residuals are calculated here : Row Diagnostics (jmp.com)
"Points that fall outside the red limits should be treated as probable outliers. Points that fall outside the green limits but within the red limits should be treated as possible outliers, but with less certainty." As you can see from the definition, there is no definitive certainty about the nature of outliers, and it may depends on the assumed model you imply.
I totally endorse and agree with the comment from @P_Bartell, there may be several explanations to the presence of your possible outlier (lurking variables, noise, different operators, ...).
The biggest advice to give about outliers before doing something about it could be : know your outliers. Depending on where they come from, what type they are (global, local/contextual or collective outliers) and the techniques to spot them, they are different, can be detected differently, and may have different influences on the models created :
- Global outliers : In univariate data, box-plots can easily detect them. In multivariate data, density-based clustering (DBSCAN, Optics, ...) or anomaly detection algorithms (Isolation Forest) might be helpful. They may result from recording/measurement errors, or anomaly samples.
- Contextual (conditional) outliers : Since outliers might not be outliers in all the dimensions of the data, K-Nearest Neighbors, Mahalanobis distances, Principal Components Analysis, Jackknife distances might be interesting to look at. These outliers are interesting to consider in the model, since their anormal values result from a specific context.
- Collective outliers : Clustering techniques might be helpful in this context : K-Means, Gaussian Mixtures, Hierarchical clustering, and other algorithms could be considered. They could be considered as a group.
A good advice is also to know if the possible outlier has an impact on the model, as all outliers are not (high)-leverage point. You can try to fit the same model in parallell with and without this point, to see if this outlier is (also) an high leverage point. Take a look at the plots Actual vs. Predicted, residuals plot, and metrics like RMSE PRESS and R² PRESS to better assess if the outlier has a any influence on the model.
Try to analyze what has caused some higher variability to the last 4-5 points (specific factor levels causing high noise, change of operator/measurement device, ... ?).
Hope this complementary answer will help you,
Victor GUILLER
L'Oréal Data & Analytics
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)