cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar

Processing Unstructured Text Data

Started ‎06-10-2020 by
Modified ‎12-03-2021 by
View Fullscreen Exit Fullscreen

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this video, you learn how to process unstructured text data using the file Pet Owner.jmp. This example is based on the file Pet Survey.jmp, in the JMP sample data library.

 

In this scenario, 150 dog and cat owners were asked, "Think about your cat or dog. What's the first thought that comes to mind?"

 

The responses are in the column Survey Response. Because these data are unstructured text, the modeling type Unstructured Text has been applied.

 

To analyze these data, we select Text Explorer from the analyze menu in JMP.

 

We select Survey Response as the text column.

 

The default language is English, but other languages are also available.

 

There are several options for tokenizing, terming, and phrasing.

 

You can use the phrase options to specify the maximum number of words per phrase and the maximum number of phrases to display.

 

You can use the word options to specify the minimum word size and the maximum number of characters per word.

 

You can use the stemming option to automatically combine terms with the same stem.

 

There are two options under Tokenizing: Regex and basic words. The default is Regex, or Regular Expression. When Regex is used, the documents are parsed using built-in expressions and symbols, including common punctuation, spaces, and tabs. You can use the Customize Regex option to add your own custom expression.

 

We’ll use the default options and click OK.

 

Behind the scenes, JMP has tokenized the data. Punctuation, spacing, and symbols have been removed, along with basic words like "a", "an", and "the".

 

The summary at the top reports the number of unique terms, the number of documents (or cases), the total number of tokens, and the average number of tokens per case.

 

The term list and phrase list are also provided.

 

There are several interesting phrases. For example, the phrase "duck hunting" has more meaning than the individual words "duck" and "hunting".

 

To add the top phrases to the term list, we’ll select the phrases, right-click, and select Add Phrase.

 

Let’s take a closer look at the term list. We’ll click the word "Term" in the term list to sort in alphabetical order.

 

As we scroll through the list, you can see some typographical errors. For example, the term "doggs" should be spelled "dogs".

 

If you want to fix typographical errors, or combine terms that have the same meaning into one term, you can select the terms, right-click, and select Recode. For example, suppose that we want to group these two terms into one term, "dogs". We’ll select the two values, click Group, rename the group "dogs", and then click Recode.

 

What about the terms "dog" and "dogs", or "duck" and "ducks"?  These words have the same base, or stem. If you want to combine terms with the same stem, you can use stemming.

 

To do this, we click the top red triangle, then select Term Options, Stemming, and then Stem for Combining.

 

The dot at the end of the terms indicates that stemmed terms have been combined. When we position the cursor on these stemmed terms, we see the list of terms that have been combined.

 

We’ll click the Count column twice to sort in descending order. Notice that the top terms are "cat" and "dog". Because we know these data are from a survey about cats and dogs, these terms aren’t particularly useful.

 

To remove these terms, we can add them as stop words. To do this, we select the terms,
right-click, and select Add Stop Word.

 

Now you can see that the top terms relate to barking, walking, jumping, and videos.

 

We’ll save the script for this analysis to the data table so that we can repeat this same analysis later.