Re: outlier detection

Report Inappropriate Content · Jun 17, 2024 04:28 AM

Hello
if i want to calculate the residual of neural network prediction in claasification problem for detecting outlier
what formula should i use for calculate residual?
does the software do that?

Victor_G · Jun 17, 2024 1:54 AM

Hi @maryam_nourmand,

If you use a model to predict continuous numerical values, some models enable you to directly save the residuals in your data table by clicking on the red triangle, choose "Save Columns", and then "Residuals". For Neural Networks, you'll need some manual work, saving the prediction formula in your table (red triangle, "Save Formulas" or "Save Fast Formulas"), and then creating a new column "Residuals" with a formula : Actual value - Predicted value

Note that you already have options to plot "Residual by Predicted" graph in the Neural Network platform, when clicking on the red triangle.

If you use a model to predict classes, you won't have residuals but you have information about misclassification and calculated probabilities. You can then save these probabilities columns (same options than before), and try to understand where are the errors, for example by plotting the probabilities of the main class predicted and see if the classification probability threshold needs to be adjusted.

If you are interested about evaluating classification models, there are also various plots that helps you evaluate the performance of your classifier model : ROC Curve, Lift curve, confusion matrix, ...

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

maryam_nourmand · Jun 17, 2024 06:10 AM

but we can calculate residual in logistic regression
its a kind of classification model too
i attach the pic of formula of calculate deviance residual for that
now i want to know when i use neural network for a classification problem
how can i calculate residual for that too?

Victor_G · Jun 17, 2024 07:07 AM

Indeed, you can calculate deviance in generalized linear models : Statistical Details for Model Selection and Deviance

For example on the JMP dataset "Detergent", using "brand" as categorical response Y, I do have the options to calculate directly the deviance residuals and save them in the datatable :

Deviance is an indication about goodness-of-fit for statistical models where model fitting is achieved by maximum likekihood, which may not be used for Machine Learning models that use different loss functions : Deviance (statistics) - Wikipedia

So it may not be possible to calculate it for some models using different fitting functions.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

maryam_nourmand · Jun 17, 2024 07:15 AM

so what is the equal consept of residual in statistical models in Machine Learning models ?
but i think we have probability for each class in Machine Learning models so we should can calculate residual for that too

Victor_G · Jun 17, 2024 08:32 AM

Residuals analysis is used in statistical models not only to evaluate prediction performances, but also to validate assumptions behind their correct use : Regression Model Assumptions | Introduction to Statistics | JMP

In Machine Learning models, there are usually far less "strict" assumptions since we don't assume that the true relationship is linear, errors are normally distributed, etc...
The big common assumption behind the use of Machine Learning models remains that the data is iid : independent and identically distributed. The other assumptions are more related to the use of a specific algorithm and the way it learns on data. You can read more about this here :

https://www.kdnuggets.com/2021/02/machine-learning-assumptions.html
Inductive bias in Machine Learning serie : https://mindfulmodeler.substack.com/p/from-theory-to-practice-inductive

Any classification model will calculate probabilities for a data point to belong in each class. Based on a specified threshold (for example > 0,5 for a binary classification), it will then output the predicted class corresponding to the highest class probability. Working and visualizing the predicted probabilities may help you to better define a threshold for the output of the predicted class, and help you assess if your model seems to correctly differentiate the classes.

I don't know if there is a metric or way to use directly these predicted probabilities to calculate some sort of residuals in JMP ?

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

maryam_nourmand · Jun 21, 2024 06:19 AM

if you find out a way to use directly these predicted probabilities to calculate some sort of residuals in JMP or find a way not just in JMP to use these predicted probabilities to calculate residuals please tell in this post
Thanks

dlehman1 · Jun 21, 2024 09:20 AM

You could calculate the Brier score. But perhaps you should say a bit more about what exactly you are trying to do. There are many ways to measure and assess the accuracy of classification models. I find the residuals to be among the least intuitive since classification models yield probabilities and each observation will either be classified correctly or not. The more extreme the probability for incorrect classifications, the worse the model (and I suppose the residual measures that - but I think there are more intuitive ways to look at that). So, what is it you want to do with the residuals?

maryam_nourmand · Jul 10, 2024 03:29 AM

i want check that how good my model can predict response

dlehman1 · Jul 10, 2024 06:50 AM

You've described your problem as a classification model. JMP will calculate the probabilities of the discrete outcomes for each observation (preferably using validation and focusing on the validation data rather than the training data). You can use these probabilities to calculate measures such as a Brier score or you can use various methods described above to measure the accuracy of your model. I really don't understand what you mean by residuals or why you want to focus on these. Suppose your classification model is binary: predict response 0 or 1. Your model will produce a probability of 1 - let's say that probability is 0.6. Now, suppose that particular observation has an actual classification of 0 (or, alternatively, 1). What would you call the "residual" for this observation? If it is a correct prediction, say a 1, then the closer the probability is to 1, the "better" your model. If it is actually a 0, then the closer the probability from the model is to 1, the "worse" your model. The Brier score is an often used method to measure how good your model is (I don't believe JMP will calculate the Brier score - at least I haven't seen it - but it should be easy to calculate manually). Alternatively, the AUC, misclassification rate, AIC, etc. are all measures JMP readily provides to measure how good your model is.

But your focus on "residuals" is far from the most intuitive way to achieve your goal of determining "how good my model can predict response." Residuals are a natural way to determine that when you have a continuous response variable not a classification model, and JMP makes it easy to calculate the residuals for those types of models (either built in at the red triangles for Save Columns, or just create a column using ABS(actual-predicted).

outlier detection