Subscribe Bookmark



May 27, 2014

Cleaning categories at scale with Recode

Data entered manually is usually not clean and consistent. Even when data is entered by multiple-choice fields rather than by text-entry fields, it might need additional work when it is combined with data that may not use the same categories across sources. Sometimes the same categories are spelled differently, abbreviated differently, or capitalized differently, or just miskeyed, resulting in many more categories and an invalid analysis when they are used in a model.

For example, consider the customer field in the “Cylinder Bands” data that can be downloaded from the UCI Machine Learning laboratory (originally from Bob Evans, RR Donnelly & Sons).

When you import this data, you will find that there are 83 unique values of "Customer". There are a lot fewer actual customers than that — it is just that there are multiple codings of the same customer. For example, the customer Abbey Press is coded in three ways: “ABBEY”, “ABBEYPRESS”, and “ABBYPRESS”.


In the early versions of JMP, we used tricks to recode the values. For example, we might copy “ABBEYPRESS” into the clipboard, then use a histogram to select rows corresponding to “ABBEY”, select the column “Customer” and then paste — which would change each “ABBEY” into “ABBEYPRESS”.

Once the Recode command became available, and we could just copy “ABBEYPRESS” and paste it into the other recode Values for “ABBEY” and “ABBYPRESS” in the Recode dialog.

All this works if you don’t have too many categories, but what if you have hundreds of categories? It can be very laborious to perform all those copies and pastes.

There are two very important ways that recoding is much better in the new Recode facility in JMP 12.

First, you can do a whole group just by selecting all the recode values and right-clicking to pick the chosen category. Now they appear as a group, all together, even if the original values were separated. No more copy and paste; it’s simply select and choose, with the results forming a visible group.



But the really amazing feature in the new Recode is the one that allows you to automatically find groups of similar values.


In this example, the feature automatically formed 16 groups. You can then check to see if the resulting groups have all the categories you want, but not too many.

How does it determine which values to combine? It looks at each pair of category labels and determines the edit distance between the two character strings. Also called the Levenshtein distance, it is the minimum number of edit operations to convert from one string to the other. When you run the “Group Similar Values” command, it brings up a dialog to choose which kind of changes to consider when calculating this distance and the criterion to use to call it a match. The default choices for this dialog usually work well.

Some of the groups were done perfectly, such as for “ECKERD”:


Because “ABBEY” is so different from “ABBEYPRESS”, it doesn’t satisfy the criterion to combine them, so you must select and right-click "Group" to do that.


Similarly, it didn’t catch the abbreviation of “CAS” for “CASUAL” in these two categories:


And it only got five of the six “HANOVERHOUSES”:




While you will need to check the results, the automated feature gets most of them.

After recoding, we have reduced the number of categories from 83 to 56.

Closer inspection reveals that other recodings are needed; in fact, the entries after row 503 seem to have switched from uppercase to lowercase. By control-clicking to the menu item “Convert to lowercase”, we can fix them all.



As you move into larger data tables that need cleaning up, it can be very helpful to have some automated features like “Group Similar Values”, and an improved user interface flow, such as selecting and grouping instead of copying and doing multiple pastes.

(The new Recode features, introduced in an earlier blog post, were implemented by James Preiss.)

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

1 Comment
Community Member

Emil Friedman wrote:

This will make life a lot easier. Thanks.

A similar issue occurs when concatenating several files (Tables->Concatenate) wherein column names are almost the same. I wind up writing column formulae that look for alternate versions of the same metric and return whichever one is not missing. It's a lot safer than copy/paste and gives me a documentation trail. I wonder if that sort of thing could be built into a future version.

Article Tags