cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
JMP_user2
Level I

Diagnostics with Studentized Residuals

Hi all,

 

I am trying to identify the outliers of a REML mixed model that includes two nominal fixed factors and one random factor. As shown in the screenshots below, the parity plot, residual by row plot and residual by predicted plot all clearly flagged six observations as strong outliers (black points). However, these outliers are not captured in studentized residual plot where their residuals are comparable to others (grey points) and are within the +/- 3 limits. In addition, by removing three of the outliers and refitting the model, everything looks completely fine and the other three observations initially identified as outliers are now well predicted by the model. Any inputs what might be the root causes for these cases? Thank you. 

JMP_user2_1-1662735289992.pngJMP_user2_2-1662735299412.png

 

JMP_user2_3-1662735311641.png

JMP_user2_4-1662735324808.png

 

Refit the model after removing three of the outliers

 

JMP_user2_5-1662735426951.png

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
statman
Super User

Re: Diagnostics with Studentized Residuals

First, welcome to the community.

My thoughts:

There are a number of statistics used to evaluate a given model (and its adequacy).  Plots of residuals (including studentized) may be very helpful in identifying outliers.  No one plot is the best in all circumstances to do this. When plots identify these potentially unusual points you must remember it is not the actual data that is unusual, but that the mode did a poor job of predicting the actual data point.   This is an indicator the model may need to be re-evaluated (and perhaps more importantly you may get a better understanding of the true mechanisms/causal relationships at work).  Also remember, the model and all statistics associated with the evaluation of the model (RMSE, p-values, R-square-R-square adjusted delta, etc.) are ALL CONDITIONAL.  Change what is in the model or what estimates the MSE or the inference space , etc. and the model adequacy can/will change (hence why when you removed data, a new model was created and changed the residual plots).  If outliers are identified, I always use practical significance first, then it is possible the terms in the model do not adequately predict actual values.  This often is the result of the effect of noise in the system and possibly inconsistent noise. When plotting the residuals by row, always make sure the data is first sorted in run order.  This may offer clues as to when the model has issues.

"All models are wrong, some are useful" G.E.P. Box

View solution in original post

1 REPLY 1
statman
Super User

Re: Diagnostics with Studentized Residuals

First, welcome to the community.

My thoughts:

There are a number of statistics used to evaluate a given model (and its adequacy).  Plots of residuals (including studentized) may be very helpful in identifying outliers.  No one plot is the best in all circumstances to do this. When plots identify these potentially unusual points you must remember it is not the actual data that is unusual, but that the mode did a poor job of predicting the actual data point.   This is an indicator the model may need to be re-evaluated (and perhaps more importantly you may get a better understanding of the true mechanisms/causal relationships at work).  Also remember, the model and all statistics associated with the evaluation of the model (RMSE, p-values, R-square-R-square adjusted delta, etc.) are ALL CONDITIONAL.  Change what is in the model or what estimates the MSE or the inference space , etc. and the model adequacy can/will change (hence why when you removed data, a new model was created and changed the residual plots).  If outliers are identified, I always use practical significance first, then it is possible the terms in the model do not adequately predict actual values.  This often is the result of the effect of noise in the system and possibly inconsistent noise. When plotting the residuals by row, always make sure the data is first sorted in run order.  This may offer clues as to when the model has issues.

"All models are wrong, some are useful" G.E.P. Box