Subscribe Bookmark
james_preiss

Staff

Joined:

Jan 22, 2015

Coming in JMP 12: Overhauled Recode for easier data cleaning

Text data cleaning is an unglamorous but important step in statistics and analytics. Manually entered data is full of misspellings, typographical errors and inconsistencies. Even machine-generated data can cause problems if two data sources disagree on formatting. Errors must be fixed before analyzing data, because the tiniest difference makes two pieces of text appear different to the computer. A 2014 New York Times article estimated that data scientists spend between 50 and 80 percent of their time cleaning data.

We recognize that data cleaning is a critical step, so we are adding several new features in JMP 12 to make it easier. The overhauled Recode command in JMP is one of them. We expanded Recode with automatic data cleaning algorithms and an improved user interface for manual cleaning tasks.

Recode's new "Group Similar Values" command automatically corrects small data entry errors. Typos and inconsistent spellings show up as a small number of missing, extra or incorrect characters in otherwise identical text. For example, the misspelled "rhythim" can be corrected to "rhythm" by removing the 'i' character. Group Similar Values identifies such errors in the data table and groups them, outputting its best guess at the correct spelling for the group. It offers manual control over the difference threshold, as well as whitespace, punctuation and case sensitivity.

Automatic tools save significant effort, but they can't get the right answer all the time. For those cases, we redesigned the user interface for manual editing, making it easier to work with large data sets and verify that recoded values are correct.

Screenshot of JMP 12 Recode showing the results of Group Similar Values with some groups collapsed.

Screenshot of Recode showing the results of Group Similar Values with some groups collapsed.

Recode's new "Grouped Display" makes it clear when two different strings are recoded to the same string. Groups can be collapsed or filtered by a search query, making it easy to focus on areas of interest within a large data set. The right-click menu offers manual control over creating and splitting groups. Every operation is now undoable.

Finally, we added many new ways to save, load, modify and merge Recode scripts for repetitive data cleaning tasks. Scripts can now be saved to a standalone file in addition to the data table. Loading a script in the Recode dialog displays the script's effect on the current table data and allows further editing. Scripts can be merged intelligently when loading or saving.

Editor's note: This post is part of a series of previews of JMP 12 written by the people who develop the software.

3 Comments
Community Member

Michael Clayton wrote:

Thanks for the effort to making messy data easier to analyze.

Hope the HELP tools are also updated, and perhaps some new tutorials, case studies, etc on anything related to ANY way to move messy data from DB or Excel (usually) and reorganize it as needed.

James Preiss wrote:

Hi Michael, thanks for your feedback. We don't have any examples of cleaning up data imported from databases or Excel right now. The topic is on our suggestion list for the next release. In the meantime, there will be a more in-depth post on Recode coming after JMP 12. If you (or any readers) have sample data or a text cleaning problem you'd like to share, please post them in the JMP User Community so we can learn more about our customers' data cleaning needs.

Community Member

Melvin Alexander wrote:

I recently gave a presentation with a data table (containing embeded scripts) and slide deck that describe ways to clean and transform data (e.g., remove stop words, using recode to change data values, apply log transforms to prepare data for singular value decomposition with the SVD() function, etc.).

The presentation is located at:

https://community.jmp.com/docs/DOC-7473

Hope this helps