Solved: Re: Text Analysis -- Manually Grouping Similar Phrases

Victor60 · May 11, 2018 12:13 PM

I have spent a bit of time, probably two hours, trying to use the JMP training info, but I must be missing something.

I have unstructured survey questions and performed the basic text analysis. From the phrases, I can humanly see about five phrases that are the same meaning. I would like to group these phrases into one and only one phrase for analysis. In JMP can I select phrases and create a "super phrase" with ease? I have tried several times and can't figure out what to do.

"request tests"

"request a test"

"request testing"

"test requests"

"test request"

"testing requests"

all would be grouped as "request tests" since they all have that meaning.

gzmorgan0 · May 15, 2018 06:26 AM

You did not mention what version of JMP you are working on. If JMP12 or higher there is a a column utility called Recode.

The script below creates a dummy table and calls Recode for coulmn User1.

ex= {"request tests", "request a test", "request testing", "test requests", "test request",
"testing requests"};

dt= New Table("Survey", add rows(50), NewColumn("User1", character, 
   << set each value(ex[Random integer(1,6)]) ) );
   
dt << Go To (:User1);
dt << Recode;

The attached screeshot shows the table and the interactive interface. The red box around Formula is a menu option to create a New Column (of values), a Formula (column) or In Place (replace values). I chose Formula and named the column Recode User1. Next, select all options and right click, then a pop-up menu of options appears. The options are to use one of the numerous phrases, or use a new value. Make your selection and press the Recode button.

Note: If there are multiple goups of answers, you can select the groups by different options, specifying the common term. And when done press recode.

The formula created by this action is

Match( :User1,
	"request a test", "request tests",
	"request testing", "request tests",
	"test request", "request tests",
	"test requests", "request tests",
	"testing requests", "request tests",
	:User1
)

I would have used a the following formula

t0 = Trim( Lowercase( :User1 ) );
If( Contains( t0, "request" ) & Contains( t0, "test" ),
	"request test",
	:User1
);

Neither of these functions handle typos and misspellings. JMP has a function called shortest edit distance and I have a script for computing the Levenshtein Distance and there are other algorithms to "score" the level matching (or non-matching) or words and phrases.

However, of you are working with your data interactively, Recode is very nice to use.

Look up Recode Data in the online book Using JMP. (Main Menu > Help > Books > Using JMP).

View solution in original post

dale_lehman · May 11, 2018 12:39 PM

There may be a more elegant way to do this (and I'd be interested if someone knows of it), but you can accomplish this by creating a new column using a formula IF with several OR clauses that say if that text field CONTAINS "each of the phrases you listed" then 1, otherwise 0. This is even easier if you use the Row, Select Where option, and add multiple conditions, each of which is the Text Field "contains" and list the phrases you have on your list (make sure the check "if any condition is met"). Once those rows are slected, under Rows, Name Selection in Column will create the same column the formula would give you.

Mark_Bailey · May 11, 2018 01:24 PM

]I do not understand why "a" is in your term list and therefore in your phrase list. Are one-character tokens really informative? Also, the stopping words include "a" and it should have been removed automatically.

You could first create stems for request/requests and for test/tests so you are down to just two phrases, request test and test request. You could add them to the term list and then recode them to the one desired level.

Mark_Bailey · May 11, 2018 01:30 PM

Oh, we don't have training for Text Explorer yet. We will premier a new course at the JMP Discovery Summit in October!

gzmorgan0 · May 15, 2018 06:26 AM

You did not mention what version of JMP you are working on. If JMP12 or higher there is a a column utility called Recode.

The script below creates a dummy table and calls Recode for coulmn User1.

ex= {"request tests", "request a test", "request testing", "test requests", "test request",
"testing requests"};

dt= New Table("Survey", add rows(50), NewColumn("User1", character, 
   << set each value(ex[Random integer(1,6)]) ) );
   
dt << Go To (:User1);
dt << Recode;

The attached screeshot shows the table and the interactive interface. The red box around Formula is a menu option to create a New Column (of values), a Formula (column) or In Place (replace values). I chose Formula and named the column Recode User1. Next, select all options and right click, then a pop-up menu of options appears. The options are to use one of the numerous phrases, or use a new value. Make your selection and press the Recode button.

Note: If there are multiple goups of answers, you can select the groups by different options, specifying the common term. And when done press recode.

The formula created by this action is

Match( :User1,
	"request a test", "request tests",
	"request testing", "request tests",
	"test request", "request tests",
	"test requests", "request tests",
	"testing requests", "request tests",
	:User1
)

I would have used a the following formula

t0 = Trim( Lowercase( :User1 ) );
If( Contains( t0, "request" ) & Contains( t0, "test" ),
	"request test",
	:User1
);

Neither of these functions handle typos and misspellings. JMP has a function called shortest edit distance and I have a script for computing the Levenshtein Distance and there are other algorithms to "score" the level matching (or non-matching) or words and phrases.

However, of you are working with your data interactively, Recode is very nice to use.

Look up Recode Data in the online book Using JMP. (Main Menu > Help > Books > Using JMP).

Mark_Bailey · May 15, 2018 06:43 AM

I could be mistaken but it appears to me that the various forms of the phrase to be dealt with are found in the phrase list of Text Explorer. These phrases are not the original character string values in the text data column. I think that this case is unstructured text, not structured character values. So the recode must be done within Text Explorer after parsing and terming.

Victor60 · May 21, 2018 08:27 AM

I have learned that this is a two step process in JMP 14 Pro. Text Explorer... First, create a new phrase by selecting multiple stemmed phrases in the right box entitled Phrase. Then, once that has been done, go to the left box entitled Term and Phrase Lists. Find and right click on the new phrase. Select "Recode" and for each of the phrases you grouped, give a single descriptive name. These disappear as the new name is given to each. Then, in that left box, you will see the new "superphrase" and you can view it in the pareto.