cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Choose Language Hide Translation Bar
Data cleaning with regular expressions

Picture this: you're a fearless data explorer, venturing deep into the untamed wilderness of raw data. Armed with your trusty spreadsheet and a cup of coffee, you embark on a quest to extract meaningful insights. But alas, what you encounter is a jungle of inconsistent formatting, cryptic symbols, and typos that seem to have a life of their own. Fear not, intrepid data wrangler, for in this chaotic realm, a secret weapon awaits: regular expressions, the superhero cape of data cleaning.

 

What is a regular expression?

At the core, a regular expression, commonly known as regex, is a powerful sequence of characters that defines a search pattern. It is a versatile tool used in various programming languages and text editors to manipulate and match strings of text based on specific patterns or rules. Regular expressions allow you to express complex search patterns concisely, making it easier to find, extract, or modify specific parts of text data. By combining characters, metacharacters, and quantifiers, you can create regex patterns that match specific sequences, characters, or even more complex structures within a larger body of text. With their wide-ranging applications, regular expressions have become an invaluable resource for data cleaning, text processing, and pattern recognition tasks.

 

Regex in JMP

Within JMP, regular expressions are accessed and customized through the red triangle located in Text Explorer > Parsing Options > Customize Regex. Several regular expressions are readily available for use, such as the HTML Link Grabber and the Numbers highlighter. 

 

Brad_Norris_0-1686321747657.png

 

You can customize regular expressions within JMP.

 

While the built-in expressions are useful, data cleaning can require more customized expressions that better suit the dataset. Let's look at a simple example, like removing all of the numbers in a text string. The regular expression "\d+" will accomplish this task. This expression utilizes the "\d" metacharacter to match all digits (0-9), and the "+" quantifier to indicate that one or more digits should be matched. If we applied this regex to the string, "I have 5 apples and 10 oranges," the numbers "5" and "10" will be identified. By replacing these digits with an empty string, you have effectively cleaned your data of all present numbers. 

 

Let's analyze the HTML Link Grabber to explain a more complex example. This regex built into JMP is as follows:

 

 

 

 

<a(\s*(href\s*=\s*('|")([^>]*?)\3[^>]*)|(\s*[a-z]+\s*=\s*('|")([^>]*?)\6))*\s*>

 

 

 

 

Each significant part of the above expression is summarized in the table below:

<a This metacharacter searches for the opening anchor tag in HTML that signifies a hyperlink.
href\s*=\s*('|") This section matches the attribute "href" followed by an equal sign and optional spaces, then either a single quote ('), or a double quote ("). It captures the opening quote in a backreference.
([^>]*?) This matches any characters that are not the closing angle bracket (>). Most importantly, it captures the URL which is the main purpose of this expression. Note that this section is similar to format specifiers in C programming.
(\s*[a-z]+\s*=\s*('|")([^>]*?)\6) This pattern matches any additional attributes within the anchor tag. It captures the attribute name, an equal sign, optional spaces, the attribute value (inside quotes), and ensures the closing quote matches the opening quote.

(See Customize Regex in the Regular Expression Editor (jmp.com) for more information)

 

While these attributes and patterns may seem complex, there are plenty of documentation and resources available to learn and understand what each flag represents.

 

Application of regex

So now that we know exactly what regex is and can do, where is it most useful?

 

Well, it's hard to narrow it down to just one application, but I'll walk you through an example of some data filtering I did for a text analysis project covering data from the discussions section of the JMP User Community. The ultimate goal of this project was to find the most commonly used terms and phrases posted in discussions in the User Community.

 

Shown below is an example from just one out of more than 48,000 discussion comments. Note that the HTML formatting and JSL scripting make it very difficult to determine the vocabulary being used in the original comment.

 

Brad_Norris_2-1686335169808.png

Data sample prior to regular expression cleaning.

 

Cleaning out these terms and formatting manually could take months or even years to complete, but with the use of just a few regular expressions, cleaning can be done instantaneously.

 

To start, I wanted to remove any JSL scripting language content within the discussion. To accomplish this, I used the following regex:

 

 

 

 

<CODE class=" language-jsl">[\s\S]*?<\/CODE>| 

|((([€¥₩£\$]|r\$)\s{0,2}[\.,]?[0-9][\.,0-9]*)|([\.,]?[0-9][\.,0-9]*\s{0,2}([€¥₩£\$]|r\$))) 

 

 

 

 

Next, I wanted to remove any HTML and CSS formatting tags from the data:

 

 

 

 

<[^>]*>

 

 

 

 

Finally, I wanted to remove any numbers, decimals, and words containing numbers:

 

 

 

 

\d+(\.\d+)?|\.\d+

 

 

 

 

Looking at the same original sample message, you can see below how each expression filters the specified parameters:

Brad_Norris_4-1686336445092.png

Filtered data. Words that are not filtered are highlighted yellow.

 

Exploring the results

Once we've finished filtering out our custom expressions, how do our results change? Let's look at the side-by-side comparisons of the word clouds for the raw and unfiltered data set vs. the regex cleaned set:

 

Old (no regex)

New (with regex)

Brad_Norris_9-1686336924671.png

 


 

Brad_Norris_11-1686338948089.png

 

 

Note that prior to using regular expressions, the data is filled with HTML formatting, JSL formatting, and completely unrelated content. For the purpose of my analysis, this data is completely irrelevant. After employing regex, the data is much more useful for analysis and can provide very interesting results with some more work.

 

Conducting a topic analysis on the cleaned data provided the following results as to what topics are most commonly answered within the User Community, shown below:

 

Brad_Norris_10-1686337169083.png

Five-topic analysis on discussions data

 

The five topics provided can be generalized into different categories of questions that users are asking about: 

  • Topic #1: Design of experiment (DOE)
  • Topic #2: Predictive modeling & machine learning
  • Topic #3: Instructional answers
  • Topic #4: Data formatting and analysis
  • Topic #5: Programming and scripting

 

Regex column text matching

Regular expressions are not limited to application from within a Text Exploration window; you can also access them when creating a new data column with a string formula. To access this, view Cols -> Utilities -> New Column by Text Matching.

 

Brad_Norris_0-1686667500159.png

Regex is available in any text column of a JMP file.

 

Selecting "New Column by Text Matching" will open the same regex editor window as the one in Text Explorer. The entire data curation process is the exact same, however, your filtered text will now be automatically formatted into a new column, as shown in the image below:

 

Brad_Norris_1-1686668709847.png

The same process is shown outside of a Text Explorer window.

 

Final thoughts

Regular expressions are an indispensable tool in the realm of data exploration. They offer a versatile solution for manipulating and matching text strings to a desired output. With regex, we can unlock the full potential of our data.

 

For more on regex, see:

Customize Regex in the Regular Expression Editor (jmp.com)

Quantifiers in Regular Expressions | Microsoft Learn

Regular Expressions Reference (regular-expressions.info)

Regex Function (jmp.com)

JMP Help

Last Modified: Mar 18, 2024 12:13 AM
Comments