Detect outliers in a dataset with categorical variables
Sep 1, 2020 4:51 AM(196 views)
I have a large data set with categorical survey data from > 800 participants. There are 10 categorical variables with 4 nominal levels each. Is there a jmp method for assigning/detecting outlier participants?
I am not aware of any such procedures. I will leave it to other users with more experience and knowledge to identify any.
Ignorance won't stop me from brainstorming some approaches. An outlier is an unusual or unexpected result. (Note that it is not necessarily wrong or a contaminated result.) So you could use a variety un-supervised and supervised learning methods that might isolate such cases.
Multiple correspondence analysis is designed to handle many categorical variables with many levels. The optimum scaling is based on the Chi square distance from the centroid. Outliers would perhaps separate in the plot.
Residual analysis with a multi-nominal generalized linear model might identify outliers.
Recursive partitioning will isolate such cases to a single node. You might have to adjust the minimum node size.
What kind of analysis or modeling were you planning? The method might include the identification of outliers.
On the dataset I will perform MCA. I noticed that a 2-dimension MCA plot is quite sensitive to removing/including paticipants to the data set so outlier detection/removal should occur carefully. Afer that with K-means clusters will be identified; K-means also is quite sensitive to outliers. So before starting the analysis outliers must be correctly detected and removed using the right statistical method & jmp tool.