Outliers Episode 2: Detecting outliers using quantile ranges

JerryFish · Dec 15, 2020 08:34 AM

Welcome to my second blog installment on Outliers. In this episode, I’ll look at using quantile ranges to detect outliers.

In Episode 1, we looked at describing and visually identifying outliers. You may recall that there are simple one-dimensional (or univariate) cases where we need to identify outliers. These are the easiest to identify, and quantile ranges are an excellent way to look for them.

For multidimensional outlier detection, you’ll have to wait for my next posts!

What Are Quantiles?

Quantiles are very easy to understand. Let’s say we have a series of 20 numbers. We can sort the numbers from lowest to highest. We can then group these points into quantiles, which are identified by cut points in the sorted data that describe the point below which X% of data falls.

Note that quantiles are generally expressed as a fraction (from 0 to 1). They correspond exactly to percentiles, which range from 0 to 100. I will use these interchangeably throughout this post.

Figure 1 shows the 20 numbers before and after sort, and the 0.20 and 0.50 quantiles (or 20^th and 50^th percentiles).

Figure 1: Data from 20 point sample, random order and sorted

The cut point for the 50^th percentile is the median of the sample. (In this case, the median is numerically 32.5, the midpoint between 40 and 25.)

A few other interesting definitions include:

Quartiles, which divide the data into 25% groups. The first quartile represents the data points that fall in the lowest 25%, the second quartile points fall between 25% and 50%, and so forth.
Interquartile range, or IQR, which defines the range covered by 2^nd and 3^rd quartiles.

Note that there are no assumptions made on the shape of the distribution when talking about quantiles.

Where Are Quantile Calculations Found in JMP?

You can find quantiles listed in three places in JMP. The Distribution platform and Graph Builder both use the same calculations. Explore Outliers/Quantile Range Outliers offers more flexibility to the user in terms of outlier sensitivity. Let’s look at each.

Distribution platform

In Distribution, all continuous/numeric data histograms include a tabular quantiles summary. By default, JMP displays several quantiles of interest. Using the Distribution platform on the column labeled Original Data in Figure 1 gives the following output:

Figure 2 Revised.png

Several common quantiles are shown in the center table of Figure 2. You can change the quantiles that are displayed in this table by selecting the red hotspot and choosing Display Options, and then either Set Quantile Increment or Custom Quantiles.

Also displayed in Figure 2 is the Box & Whisker plot. Symbols shown in the plot are annotated in Figure 2.

The ends of the whiskers are of most interest in detecting outliers. Note that the whisker length is measured from the min and max of the interquartile range and that the whiskers end at the last data point that is inside 1.5 times that range. This is a common rule of thumb for outlier detection. Points falling outside of these whiskers may be worth further investigation.

But in the data set shown in Figures 1 and 2, this technique detects no outliers. This is not uncommon, particularly with small sample sizes.

Figure 3 shows the Distribution platform output for the 1,000 sample data from my previous blog post. These samples are drawn from a random normal distribution, with mean of 0.0 and standard deviation of 1.0. I also include the point at X1=4 which was visually identified as an outlier in my previous post:

Figure 3: Distribution platform output for 1,000 point sample from normal distribution

With the larger sample size, some of the points now appear to be outliers (at least as defined by the ends of the whiskers). In this particular case, since we know that the data come from a normally distributed population with 1,000 points, it is probably not uncommon to see at least one point with a value of 4. In a real-world case, the points beyond the whiskers might warrant additional investigation.

Making Box and Whisker Plots in Graph Builder

You can also generate box and whisker plots in Graph Builder. The entire Graph Builder setup is shown in Figure 4:

Figure 4: Graph Builder setup for Box & Whisker plot (1,000 point sample from normal distribution)

Note the icon for graph type above the graph (circled in red), which enables the Box & Whisker plot type.

As with the Distribution platform, the whisker lengths are 1.5*interquartile range, and extend down from the 25^th percentile and up from the 75^th percentile.

Explore Outliers

What if we want to choose other whisker limits?

As stated above, a whisker length of 1.5*IQR is a common practice for identifying outliers. I believe this probably comes from looking at large sample normal distributions. 1.5*IQR beyond the interquartile range can be shown to encompass 99.30% of the normal distribution (leaving 0.3488% of the data in each tail). This would identify a relatively rare event – if you have a normal distribution.

But what if you don’t have a normal distribution? What if the distribution looks something like this?

Figure 5: Skewed distribution

Figure 5 shows a skewed distribution of N=100 points, with five outliers beyond the upper whisker, per the 1.5*IQR method. But is the point at X1=3 really an outlier? Or perhaps there are other points concealed by the whisker that would warrant outlier investigation. How do we adjust this 1.5*IQR rule?

JMP allows you to do this under Analyze/Screening/Explore Outliers, and then choosing the Quantile Range Outliers option. Bringing up this option presents the input screen shown below for the 100 point sample summarized in Figure 5.

Figure 6: Quantile Range Outliers user interface and sample output

There are two main inputs for the Quantile Range Outliers panel:

Tail quantile (TQ). This fraction describes the smallest cut point in the distribution that will be used in whisker calculations. (1-Tail Quantile) describes the upper cut point in the distribution. The difference between these cut points describes the range used in subsequent calculations.
1. For the IQR calculations described previously, tail quantile would be 0.25, indicating that the whiskers would start at the 25^th and 75^th percentiles, and the range would be the interquartile range (or 75^th-25^th percentiles).
2. The default value for the tail quantile in the Quantile Range Outliers is 0.1, indicating that the whiskers would start at the 10^th and 90^th percentiles, and the length depends on the (90^th-10^th percentile) range.
Q. This is the multiplier used to determine the length of the whiskers.
1. In the previous IQR calculations this was 1.5, so the interquartile range was multiplied by 1.5 to get the whisker length.
2. The default value for Q in the Quantile Range Outliers is 3, indicating that the whisker length will be defined by the (90^th-10^th percentile) range multiplied by 3.

The output of the Quantile Range Outliers is shown at the bottom of the panel in Figure 6. Here we see that Column X1 had a 10^th percentile at -1.283, and a 90^th percentile at 1.65556. Ends of the whiskers (noted as the low and high thresholds [LT and HT, respectively] in the output) are based on the following calculations:

LT = 10^th quantile – (90^th quantile – 10^th quantile) * Q

= -1.283 – (1.65556 – -1.283) * 3

= -10.099

HT = 90^th quantile + (90^th quantile – 10^th quantile) * Q

= 1.65556 + (1.65556 – -1.283) * 3

= 10.4713

In this case, there is only one value in the column that exists beyond the high threshold, at a value of 12.0, which is displayed at the bottom of the report shown in Figure 6.

Sensitivity to tail quantile and Q

The outlier detection sensitivity is clearly governed by the values of tail quantile and Q. The traditional 1.5*IQR and the 3*(90^th-10^th quantile) methods are both acceptable, with the former being much more sensitive to detecting outliers. You can use the Quantile Range Outliers platform to adjust these values as needed for your particular case.

While the effectiveness of the choice of tail quantile and Q will depend on your particular distribution, we can gain some insight into their behavior by looking at normal distributions. Below is a table showing several combinations of tail quantile and Q, along with the percent of values in the tails of a normal distribution that would fall outside of the whiskers:

Table 1: Percent of outliers detected in large sample normal distribution for various values of tail quantile and Q

Expanding on this idea, we can create a contour map showing a more complete picture of the quantities in Table 1:

Figure 7: Percent of outliers detected in large sample normal distribution for various values of Tail Quantile and Q

Figure 7 plainly shows that the combination of high values of tail quantile and low values of Q makes the calculations more sensitive to outliers, while low values of TQ and high Q makes us less sensitive.

Next Episode

Next time we will discuss Mahalanobis distance, which is used to detect outliers in multiple dimensions.

See all posts in this series on understanding outliers.

JerryFish · ‎12-22-2020

My friend and colleague @XanGregg pointed out that the whiskers always end on a data point. So the maximum length is 1.5 times the IQR, but they "shrink back" to the next real data point. I have corrected this in the post. Thanks Xan!

JK100 · ‎01-10-2021

Nice post. The 1.5 x IQR is sometimes referred to as the "step" and outliers can be identified as Upper Quartile plus one step or Lower quartile minus 1 step;a more stringent method may be to use "2 steps" +/- the respective upper/lower quartiles to identify "far outliers".

However when the distribution is skewed the second method you state extends the far outlier definition (using 2 steps) by dealing with the 90 and 10 points rather than the 75 and 25 percentile points.

yves_gueniffey · ‎01-28-2021

Thanks : If you miss the point of Xan Gregg, you can’t explain students why the two whiskers are not always equal

Jhaggapong · ‎02-17-2021

thanks for sharing

Madwolf · ‎03-01-2021

Hi there,

Thanks for this article. It is extremely insightful for me.

You mentioned: "The traditional 1.5*IQR and the 3*(90th-10th quantile) methods are both acceptable"

How do I know whether I should choose the traditional 1.5*IQR or 3*(90th-10th quantile), which is the default for JMP?

In addition, do I continue scanning and exclude the outliers, until the final scan shows that I have no more outliers existing?

JerryFish · ‎03-02-2021

Hi @Madwolf , thanks for the post.

You will find that the Distribution platform uses the 1.5*IRQ method, while the Analyze/Screening/Explore Outliers defaults to the 3*(90th-10th quantile) method. Both are legitimate ways of helping to identify outliers.

Remember that tools like these are meant to identify "potential" outliers. As such, they are just assistants. You can try both methods and see if the results are different. If outliers are identified (in either method), I would caution that they should still be investigated, and not simply immediately discarded. The data might be trying to tell you something important! So be judicious when discarding outliers.

In my opinion, either of these methods will catch "really bad" outliers. If you read my 3rd blog post on the Mahalanobis Distance, and my (upcoming) 4th post on jackknife distances, you'll find more ways to help identify outliers in trickier situations that those reviewed in this post.

Good luck!

Madwolf · ‎03-04-2021

Hi @JerryFish , thank you for the reply.

Thanks for your reply. You are right in saying that I should not just discard them entirely and really investigate what these outliers are telling.

Guess I am too concern on making sure that my K-Means Clustering do not have negative CCC values, and thus, eager to remove outliers as much as possible.

Understood on your point above and they are really useful. Thank you!