A quick quiz: What are outliers?
- Odd data points that could influence modeling results and need to be addressed.
- Odd data points that warrant further investigation, possibly providing new insights into your process.
- A really good book by Malcolm Gladwell. (Outliers, 2008)
The answer, of course, is “All of the above.”
If you are in the business of looking at data, you have run across “outliers” from time to time, those numeric readings that don’t fall within the normal pattern of the data. As a result, our least squares fits of the data (for example) could be distorted. Hence, we need to quickly identify them and deal with them appropriately.
This is the first in a blog series that looks at outliers. Today, we will look at visually identifying outliers. In the next episode, we will examine whether outliers are generally good or bad; the last episodes will examine different means and algorithms used to detect outliers.
Examples of Outliers - Visual Identification
One of the easiest ways to identify outliers is visually. Hopefully, you are already plotting your data to look for trends, etc. Depending on how you plot your data, outliers will often be obvious. The examples below help to describe outliers using some simple plots.
Example 1: One-dimensional data
Figure 1 shows a dot plot of 1,000 samples pulled from a normally distributed population (mean=0, standard deviation=1).
Figure 1: An example with one-dimensional data.
The point at X1=4 is highlighted in red. In this case, it is 4 standard deviations from the mean. For a normal distribution (like the one shown above), this is a very unlikely occurrence. In fact, we would expect to draw a “4” from a normally distribution with mean 0 and standard deviation of 1 only 0.01% of the time. This is definitely an unusual observation.
Example 2: Two-dimensional data, independent and normally distributed variables
In the example shown below, we have two input variables: X1 and X2:
Figure 2: A 2D example, with independent and normally distributed variables.
In Figure 2, we have two normally distributed and independent variables, X1 and X2. The histograms next to each can be used to surmise normality. Note the point marked in red at (4,4). Again, we have an outlier that is visually obvious, since it is located far away from the means of either X1 or X2.
Example 3: Two-dimensional data, normal and independent variables, different means and standard deviations
Let’s go a little further with the two-variable example. This time, we have different means and standard deviations.
Figure 3: A 2D example, with different means and standard deviations.
The point at approximately (1,27) might be considered an outlier. It is inside the X1 distribution, but clearly outside the X3 distribution.
But what about the red point at (4,5)? Is it an outlier? Our eyes tell us that it might be, but how can we tell? How do we assess it algorithmically? (For the answer to this, you'll have to wait for subsequent blog posts!)
Example 4: Two-dimensional data with correlation between the variables
Below is yet another two-dimensional example, this time with correlation between the two variables:
Figure 4: A 2D example, with correlation between the two variables.
Visually, the red point at (-3,6) in Figure 4 is quite obvious to the eye. But looking at either variable independently would indicate that this point is well behaved in both the X1 dimension and the X4 dimension. (This is an excellent example of why it is important to plot your data!)
Example 5: A trickier two-dimensional outlier
Another odd outlier in two dimensions is shown in the figure below:
Figure 5: A trickier two-dimensional outlier.
Again, it is obvious to the eye, but more difficult to detect algorithmically. Still, there is a way! (See future blog posts.)
Example 6: More than two dimensions
What if we have more than two dimensions? Things become more difficult to visualize. For three dimensions, you can try a 3D scatterplot and rotate it around to identify outliers. For more than three dimensions, you can try to make a series of scatterplots to cover all pairs of variables (such as in the Analyze/Multivariate platform), but remember, the outliers may be difficult to spot.
As dimensionality increases, we start to turn to algorithms specifically developed to help identify outliers, which will be covered in future posts.
Future Blog Episodes
In upcoming posts, we will discuss the various outlier algorithms available in JMP, including how they work on the above examples (and more) and when to use them. These algorithms include:
See all posts in this series on understanding outliers.