Coming in JMP 12: Overhauled Recode for easier data cleaning
Jan 21, 2015 10:14 AM
Text data cleaning is an unglamorous but important step in statistics and analytics. Manually entered data is full of misspellings, typographical errors and inconsistencies. Even machine-generated data can cause problems if two data sources disagree on formatting. Errors must be fixed before analyzing data, because the tiniest difference makes two pieces of text appear different to the computer. A 2014 New York Times article estimated that data scientists spend between 50 and 80 percent of their time cleaning data.
We recognize that data cleaning is a critical step, so we are adding several new features in JMP 12 to make it easier. The overhauled Recode command in JMP is one of them. We expanded Recode with automatic data cleaning algorithms and an improved user interface for manual cleaning tasks.
Recode's new "Group Similar Values" command automatically corrects small data entry errors. Typos and inconsistent spellings show up as a small number of missing, extra or incorrect characters in otherwise identical text. For example, the misspelled "rhythim" can be corrected to "rhythm" by removing the 'i' character. Group Similar Values identifies such errors in the data table and groups them, outputting its best guess at the correct spelling for the group. It offers manual control over the difference threshold, as well as whitespace, punctuation and case sensitivity.
Automatic tools save significant effort, but they can't get the right answer all the time. For those cases, we redesigned the user interface for manual editing, making it easier to work with large data sets and verify that recoded values are correct.
Screenshot of Recode showing the results of Group Similar Values with some groups collapsed.
Recode's new "Grouped Display" makes it clear when two different strings are recoded to the same string. Groups can be collapsed or filtered by a search query, making it easy to focus on areas of interest within a large data set. The right-click menu offers manual control over creating and splitting groups. Every operation is now undoable.
Finally, we added many new ways to save, load, modify and merge Recode scripts for repetitive data cleaning tasks. Scripts can now be saved to a standalone file in addition to the data table. Loading a script in the Recode dialog displays the script's effect on the current table data and allows further editing. Scripts can be merged intelligently when loading or saving.