Subscribe Bookmark RSS Feed

Text mining: find a given word, then get the words on either side of it

markschahl

Community Trekker

Joined:

Jun 18, 2012

I am trying to mine text in manually entered comments. Thankfully there is somewhat of a pattern. Within the first 10 words, if the word "to" appears, the words on either side of "to" are the words that I want.

I've taken the baby step of creating an Expression column with the formula: Words[Comments]. It has returned a list of the words from Comments column - pretty cool.

So, what do I do next to find whether "to' occurs in the first ten words and if so, grab the words on either side of "to"?

4 REPLIES
ian_jmp

Staff

Joined:

Jun 23, 2011

Without getting too fancy, maybe something like this? The second one gives '{}' because the last 'word' is 'to.' not 'to'. But you could code round that if need be.

NamesDefaultToHere(1);

wordsThatBracketTo =

Function({str}, {Default Local},

strWords = Words(str);

toPos = Loc(strWords, "to");

requiredWords = {};

if(toPos <= 10,

if(toPos > 1, InsertInto(requiredWords, strWords[toPos-1]));

if(toPos < NItems(strWords), InsertInto(requiredWords, strWords[toPos+1]));

);

EvalList(requiredWords);

);

sampleText = "Here's a phrase that does not contain to too many times.";

Print(wordsThatBracketTo(sampleText));

sampleText = "Here's a phrase that does contain to.";

Print(wordsThatBracketTo(sampleText));

Jeff_Perkinson

Community Manager

Joined:

Jun 23, 2011

Ian@JMP​ has a nice scripted example.

If you want to do it in a column formula, it will look like this.

11077_JMPScreenSnapz003.png

This formula uses a Local Variable to hold the position of the "to".

I've attached a data table with a column with this formula to show how it works.

NB: If "to" appears in position 10 it will give you words 9 and 11. If that's not what you want you'll need add an additional condition inside the If().

-Jeff
msharp

Super User

Joined:

Jul 28, 2015

If I've said it once, I've said it a million times.  If you're going to do any amount of text mining it'd be worth your time to spend an hour learning Regular Expressions.  Tons of great youtube videos and there is even a very good section on them in the Scripting Guide.

text = "Example text to the question.";

answer1 = regex(text, "(\w+) to (\w+)", "\1");

answer2 = regex(text, "(\w+) to (\w+)", "\2");

answer = regex(text, "(\w+) to (\w+)", "\1" || ", " || "\2");

print(answer1);

print(answer2);

print(answer);


output:

"text"

"the"

"text, the"

markschahl

Community Trekker

Joined:

Jun 18, 2012

Thanks to all! Sorry for the late reply - was busy doing the text mining for an annual meeting that just wrapped up.

So, I used Jeff's formula approach because I didn't have a lot of time to do anything really elegant. Now, I have time to learn regex. I will develop a scripted solution for this analysis so that I can do it quickly next year.

The data cleanup / extraction of information was the messy part of this work. The formula helped to find what I was looking for. But, I still had to do a lot of work in Columns > Recode. Hats off to those that made it so powerful in JMP12!

BTW, text mining would make for an excellent Mastering JMP Webinar .