In an earlier blog post, we looked at cleaning up dirty data in categories. This time, we look at cleaning dirty data in the form of outliers for continuous columns.
In industry, it’s not unusual to have most of your values in a narrow range (for example between .1 and .7) with a few values well outside that range (such as 99, which is the missing value code). Some of the most common examples of outliers include:
• Missing values codes, often a string of 9s.
• Obviously miscoded, manually entered values, especially when the decimal point is misplaced.
• Instrument failures, which result in an instrument failure code inserted where the data should be.
• Unusual events, such as an electrical short causing a current surge or a voltage dropout.
• Long-tailed distributions, such as time-to-failure distributions, where large values may be rare, but real.
• Contamination outliers, in which a few items have crept into a collection of items that was presumed to have been clean.
If we had only a few columns to explore and clean up, it would be reasonable to do it by hand. However, with dozens, or thousands of columns, you need some help. The “Explore Outliers” facility in Cols->Modeling Utilities is the new tool to use.
The most straightforward approach is to get some tail quantiles and then measure some scale of the interquantile range beyond the tail quantiles. The ANSUR Anthopometry data provides a good illustration; it measures 131 body measurements across 3,982 individuals. Before analyzing this data, you need to first check for and be aware of outliers.
For each selected variable, Explore Outliers first calculates the 10% and 90% quantiles to determine a good estimate for the range of most of the data. If there is a huge number of ties, meaning that the 10% and 90% quantiles are the same, the utility reaches further into the tail to obtain a value. It then multiplies the interquantile range by 3 (the default ‘Q’ scaling factor) and looks for any values that are further out from the 10% and 90% quantiles than three times the interquantile range. This is a pretty extreme default outlier selection criterion, but you are probably most interested in the most remote outliers, at least at first. You can adjust the quantiles and multiplier later to be more sensitive.
The default scan of the ANSUR data revealed two columns with a value of -999, which is more than three interquantile ranges past the 10% and 90% tails. In this case, it is obvious that they are missing-value codes, not real data.
Since the other 129 variables did not show outliers, we can more easily focus on the columns that did by checking “Show only columns with outliers”. Selecting these columns in the display will select the columns in the data table.
Next, we need to determine which action to take to address these outliers. There are six action buttons. You can select the rows, and use the Rows command “Next Selected” to advance the data table view to each selected row. You can color the rows to emphasize that the row had an outlier. You can color the cells to make them stand out as you browse the data table, shown below. You can also exclude the rows to hold them back from an analysis.
In this case, since the outliers are really representing missing data, the two remaining actions present the obvious choices: Either click “Add to missing value codes”, which installs a “missing value code” column property for that column, or click “Change to missing”, which changes it to a hardware (NAN) missing value that JMP always recognizes as missing. You might want to use the former button if there are multiple missing value codes that have distinct meanings.
We also check for high-nines, since these are often used as missing value codes, but all the high-nines are not far from the upper tail of the data, and are thus not likely to represent missing values.
Sometimes outliers are outlying in a multivariate sense, meaning that they may be nicely in the quantile range of each column, but the data is in clusters, and some points are far away from other points. Using Fisher’s Iris data as an example, we see that there are really three clusters, representing the three species. Because the species are widely separated, there is plenty of near-empty space where an outlier can be distant from all the other points, even though not a univariate outlier. We have the k-nearest-neighbor facility to find these outliers. For each point, it finds the distance of that point to the nearest point, the second-nearest point, the third-nearest point, and so on. Here, we identify a point that is far away from the nearest point.
If we look at this point in a Scatterplot Matrix, we see that it is not unusual in any of the single coordinates, but it is far away from other points. It is far away in Sepal width when considering its location with respect to Petal length and Petal width. We happen to know the identities of these clusters, and the unusual feature is that it has a very low Sepal width compared to others in its species.
To continue, we might have multiple outliers, i.e., those points that are not far from their nearest neighbors but are still far from the data. Here is the distance of each point from its third near-neighbor.
If you look at the Scatterplot Matrix of the two blue k=3 outliers, you see that they are near each other, but separate from other points, focusing on the Sepal width by Sepal length scatterplot.
Thus, multivariate outliers can be explored and considered to discern what is happening. Outliers in the k-nearest sense can happen in contamination situations, for example, if an iris from a totally separate species was mixed in and misidentified in the Fisher Iris data.
What to do about outliers depends upon the situation, though some actions are clear. If you have missing value codes, you should add a missing value code property so that the analysis platforms will treat it as missing. If you have a misplaced decimal point, you should correct it if it is a clear case, or change it to a missing value if it is not clear.
Sometimes finding outliers can lead to important discoveries. They shouldn’t automatically be treated like dirt that doesn’t fit.