cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Check out the JMP® Marketplace featured Capability Explorer add-in
Choose Language Hide Translation Bar
JerryFish
Staff
Outliers Episode 1: The elusive outlier described, visually identified, and judged

A quick quiz: What are outliers?

  1. Odd data points that could influence modeling results and need to be addressed.
  2. Odd data points that warrant further investigation, possibly providing new insights into your process.
  3. A really good book by Malcolm Gladwell. (Outliers, 2008)

The answer, of course, is “All of the above.”

If you are in the business of looking at data, you have run across “outliers” from time to time, those numeric readings that don’t fall within the normal pattern of the data. As a result, our least squares fits of the data (for example) could be distorted. Hence, we need to quickly identify them and deal with them appropriately.

This is the first in a blog series that looks at outliers. Today, we will look at visually identifying outliers. In the next episode, we will examine whether outliers are generally good or bad; the last episodes will examine different means and algorithms used to detect outliers.

Examples of Outliers - Visual Identification

One of the easiest ways to identify outliers is visually. Hopefully, you are already plotting your data to look for trends, etc. Depending on how you plot your data, outliers will often be obvious. The examples below help to describe outliers using some simple plots.

Example 1: One-dimensional data

Figure 1 shows a dot plot of 1,000 samples pulled from a normally distributed population (mean=0, standard deviation=1).

Figure 1: An example with one-dimensional data.Figure 1: An example with one-dimensional data.

The point at X1=4 is highlighted in red. In this case, it is 4 standard deviations from the mean. For a normal distribution (like the one shown above), this is a very unlikely occurrence. In fact, we would expect to draw a “4” from a normally distribution with mean 0 and standard deviation of 1 only 0.01% of the time. This is definitely an unusual observation.

Example 2: Two-dimensional data, independent and normally distributed variables

In the example shown below, we have two input variables: X1 and X2:

Figure 2: A 2D example, with independent and normally distributed variables.Figure 2: A 2D example, with independent and normally distributed variables.

In Figure 2, we have two normally distributed and independent variables, X1 and X2. The histograms next to each can be used to surmise normality. Note the point marked in red at (4,4). Again, we have an outlier that is visually obvious, since it is located far away from the means of either X1 or X2.

Example 3: Two-dimensional data, normal and independent variables, different means and standard deviations

Let’s go a little further with the two-variable example. This time, we have different means and standard deviations.

Figure 3: A 2D example, with different means and standard deviations.Figure 3: A 2D example, with different means and standard deviations.

The point at approximately (1,27) might be considered an outlier. It is inside the X1 distribution, but clearly outside the X3 distribution.

But what about the red point at (4,5)? Is it an outlier? Our eyes tell us that it might be, but how can we tell? How do we assess it algorithmically? (For the answer to this, you'll have to wait for subsequent blog posts!)

Example 4: Two-dimensional data with correlation between the variables

Below is yet another two-dimensional example, this time with correlation between the two variables:

Figure 4: A 2D example, with correlation between the two variables.Figure 4: A 2D example, with correlation between the two variables.

Visually, the red point at (-3,6) in Figure 4 is quite obvious to the eye. But looking at either variable independently would indicate that this point is well behaved in both the X1 dimension and the X4 dimension. (This is an excellent example of why it is important to plot your data!)

Example 5: A trickier two-dimensional outlier

Another odd outlier in two dimensions is shown in the figure below:

Figure 5: A trickier two-dimensional outlier.Figure 5: A trickier two-dimensional outlier.

Again, it is obvious to the eye, but more difficult to detect algorithmically. Still, there is a way! (See future blog posts.)

Example 6: More than two dimensions

What if we have more than two dimensions? Things become more difficult to visualize. For three dimensions, you can try a 3D scatterplot and rotate it around to identify outliers. For more than three dimensions, you can try to make a series of scatterplots to cover all pairs of variables (such as in the Analyze/Multivariate platform), but remember, the outliers may be difficult to spot.

As dimensionality increases, we start to turn to algorithms specifically developed to help identify outliers, which will be covered in future posts.

Future Blog Episodes

In upcoming posts, we will discuss the various outlier algorithms available in JMP, including how they work on the above examples (and more) and when to use them. These algorithms include:

See all posts in this series on understanding outliers.

Last Modified: Jan 26, 2021 4:29 PM
Comments
P_Bartell
Level VIII

Great blog series Jerry!

 

In my experience the most frequent cause of 'outliers' was data entry error. Somebody slipped a decimal point, or inverted two numbers, missed hitting 'return' for the next data point entry in the column, or just plain missed hitting the key on the keyboard during hand entry of the data. Always check the source data first if suspicious looking data appears graphically!

 

This blog series reinforces the best practice of ALWAYS plotting your data every which way you can think of in JMP BEFORE numerical analytics work.

Sal
Level II

Very interesting topic, thanks for sharing!

 

To add to the previous comment, what we always try to do is apply "PGA":

  • Practical: where does the data come from, how was it collected, what period, etc.
  • Graphical: always plot your data first and try to see if it makes sense, if you see relations, "outliers", etc.
  • Analytical: only then start doing numerical/statistical analysis ...

Looking forward to the coming blogs!