- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Outliers Elimination
Hi Jmp team,
I have a production data set I have to analyse 5 columns which contains almost 5000 rows.
Almost 80% of them were outliers, and they were inside the specification limit.
Now I wanted to make a box plot comparison between these 5 columns to see a trend over the days.
In this case do I need to eliminate all the outliers whatever is there or should I keep and them for comparison?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
Why not do both to try test out your options? You can create new column in which you mark the rows as outlier and use that in the visualization. In some cases you might want to stack your data, so you can just exclude outliers from single measurement instead of all of them (wide vs tall data).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
Hi @Statexplorer :Gotta admit, my first thought when reading your question was "How can 80% of your data be outliers?"
Can you shed some light on this? If you are using some sort of "outlier" test based on normality, and 80% are in the tails, then your data aren't normally distributed and your outlier test is not appropriate.
Sorry, I just can't get my head around the concept of "80% of my data are outliers". i.e., outliers are "unusual" observations; how can 80% of them be "unusual"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
Hi @Statexplorer : The IQR method is based on normality of the data. If 80% of the data are at the extremes, then the data aren't normally distributed and those 80% are not "outliers". Conceptually, the majority of your data cannot be "unusual" (by the definition of unusual).
How then do you identify outliers from non-normal/heavy-tailed/skewed distributions? This is very complex question with no easy answer.
If you share some details about your data etc. perhaps a fit-for-purpose solution can be found.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
Hi @Statexplorer : And, back to your question. I would not remove any data...but don't use the boxplot to identify "outliers". That said, boxplots can be a good visual tool for your data: the median is shown, the IQR is shown, and so are the min and max.. So in terms of trending, you can still see visualize how the median and spread changes over time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
Hi @MRB3855, What method should I use to eliminate outliers when the data doesn't follow normal distribution? and it is a highly skewed distribution.(In this case most of outliers are within USL and LSL range)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Outliers Elimination
Hi @Statexplorer : It may be helpful for us to understand the context of your question; why are you trying to identify outliers?