Solved: Re: Word Cloud in Graph Builder?

pmroz · Oct 1, 2014 7:21 AM

Hello fellow JMPers,

Has anyone used JMP for text mining? Specifically I'm interested in using Graph Builder to create an interactive word cloud. I've seen solutions that connect to R, but those lack the interactivity of JMP.

Thanks!

Craige_Hales · Oct 18, 2016 2:52 PM

The NY Times look is nice because it is easy to implement.

Here's some JSL

/* the loadtextfile, below, fetches a file from http://www.gutenberg.org/ which says...

Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever. You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.org

Title: Alice's Adventures in Wonderland

Author: Lewis Carroll

Posting Date: June 25, 2008 [EBook #11]

Release Date: March, 1994

[Last updated: December 20, 2011]

Language: English

*/

txt = Load Text File( "c:\path\Alice.txt" ); Length( txt );

aa = [=> 0]; // associativearray with default value of zero

// count the words. JMP has a words() function that returns a list of words, but it isn't

// able to distinguish apostrophes inside words from on edges of words.

// 'we're is an example this regex works around. as is 'Oh.

// 'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'

rc = Pat Match( txt, Pat Repeat( (Pat Regex( "[\w]+('[\w]+)?" ) >> word +

Expr( word = Uppercase( word ); aa[word]++; ""; )) |

(Pat Regex( "[^\w]+" )), 1, 9999999, GREEDY ));

keys = aa << getkeys; N Items( keys );

vals = aa << getvalues; N Items( vals );

dt = New Table( "Untitled",

New Column( "word", character, Set Values( keys ) ),

New Column( "count", Numeric, "Continuous", Format( "Best", 12 ), Set Values( vals ) )

);

dt2 = dt << sort( by( count ), order( descending ) );

Close( dt, nosave );

dt = dt2;

New Window( "words", Border Box( Left( 10 ), Right( 10 ), top( 10 ), bottom( 10 ), v = V List Box() ) );

h = H List Box(); v << append( h );

row = 12; // by inspection

While( dt:count[row] > 8,

x = Text Box( dt:word[row] || " " );

x << setfontsize( Log( dt:count[row] ) * 4 );

h << append( x );

If( h << getwidth > 700, h = H List Box(); v << append( h ); );

row++;

Wait( 0 );

);

The PatMatch statement uses some regular expressions to identify the words in the text, upper case them, and use them as the keys to an associative array. The associative array elements default to zero if they don't already exist (see the =>0) so no special effort is required for the first instance of a word. The PatRepeat within the PatMatch processes the entire string. Within the repeat the two regex alternate processing words and stuff between words. Within the first regex, expr is used to execute some JSL on each matched word. The last thing the expr does is evaluate to an empty string which will always match at the current position. This PatMatch processes about 150K characters in about two seconds, including counting the words into the associative array.

A data table is constructed from the key-value pairs and sorted, then a window is created with a vertical list box containing a bunch of horizontal list boxes. The horizontal list boxes are filled with the words, from left to right. After adding each word, the wait(0) allows the window to update so the width of the horizontal text box will be correct (and you get to watch it.) When the box exceeds 700 pixels, start a new line.

Start at row 12 to skip THE AND TO A OF IT SHE SAID YOU IN I. Stop when the sorted count goes below 8 instances. Use a log transform; Alice has 386 instances.

As Xan points out, there are issues with words that are only different in suffix, prefix, or synonyms. Probably some other issues too.

Jabberwocky is in Through the Looking Glass. Slithy Toves are not here.

update: probably shouldn't fetch text from the link every time. Point the loadtextfile to some file on your computer, perhaps downloaded from http://www.gutenberg.org/

Craige

View solution in original post

stan_koprowski · Oct 2, 2014 11:37 AM

Hello Peter,

Here is some JSL to count the words in a text file and then create a tree map.

Not exactly a word cloud but could get you started in the direction you want.

Best,

Stan

clear log();

NamesDefaultToHere(1);

// Prompt for text file

Getfile = Pick File();

// Store text string in variable

text = Load Text File( Getfile );

//Create list of words in sorted order removing white spaces and other delimiters

wordlists = Sort List(Words(text, ",./\;!?'()\!" "));

// Count the items in the list

nwds = N Items(wordlists);

// create two lists; one for deletions and one for insertions

deletelist = {};

keeplist = {};

// remove common words, i.e.; those that are 3 characters or less

For ( i= 1, i <= N Items ( wordlists ), i++,

wrdlength = length( wordlists[i] ) ;

If( wrdlength <= 3,

Insert Into( deletelist, wordlists[i], 1 ),

Insert Into( keeplist, wordlists[i] )

)

);

// Create data table

dt = New Table("Word Cloud Data Table",

New Column("Words", Character, Set Values(keeplist)),

);

Tabulate(

Show Control Panel( 0 ),

Include missing for grouping columns( 1 ),

Order by count of grouping columns( 1 ),

Add Table( Row Table( Grouping Columns( :Words ) ) )

)<< Make Into Data Table;

// Create tree map of all words

Graph Builder(

Show Control Panel( 0 ),

Variables( X( :Words ) ),

Elements( Treemap( X, Summary Statistic( "Sum" ) ) ),

);

pmroz · Oct 2, 2014 01:26 PM

Thanks Stan for that suggestion. I like treemaps but a wordcloud is better suited to displaying words, especially long ones.

XanGregg · Oct 9, 2014 02:24 PM

Hi Peter, I'm not a fan of wordclouds in general. Maybe it's because so many of them are poorly done with no attention to synonyms or stop words and with random word positions. However, there are some inherent drawbacks like the way an irrelevant detail like word length influences a word's prominence.

The best I've seen is the NY Times look at inauguration speeches, which has meaningful positioning and highlighting.

Craige_Hales · Oct 18, 2016 2:52 PM

The NY Times look is nice because it is easy to implement.

Here's some JSL

/* the loadtextfile, below, fetches a file from http://www.gutenberg.org/ which says...

Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever. You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.org

Title: Alice's Adventures in Wonderland

Author: Lewis Carroll

Posting Date: June 25, 2008 [EBook #11]

Release Date: March, 1994

[Last updated: December 20, 2011]

Language: English

*/

txt = Load Text File( "c:\path\Alice.txt" ); Length( txt );

aa = [=> 0]; // associativearray with default value of zero

// count the words. JMP has a words() function that returns a list of words, but it isn't

// able to distinguish apostrophes inside words from on edges of words.

// 'we're is an example this regex works around. as is 'Oh.

// 'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'

rc = Pat Match( txt, Pat Repeat( (Pat Regex( "[\w]+('[\w]+)?" ) >> word +

Expr( word = Uppercase( word ); aa[word]++; ""; )) |

(Pat Regex( "[^\w]+" )), 1, 9999999, GREEDY ));

keys = aa << getkeys; N Items( keys );

vals = aa << getvalues; N Items( vals );

dt = New Table( "Untitled",

New Column( "word", character, Set Values( keys ) ),

New Column( "count", Numeric, "Continuous", Format( "Best", 12 ), Set Values( vals ) )

);

dt2 = dt << sort( by( count ), order( descending ) );

Close( dt, nosave );

dt = dt2;

New Window( "words", Border Box( Left( 10 ), Right( 10 ), top( 10 ), bottom( 10 ), v = V List Box() ) );

h = H List Box(); v << append( h );

row = 12; // by inspection

While( dt:count[row] > 8,

x = Text Box( dt:word[row] || " " );

x << setfontsize( Log( dt:count[row] ) * 4 );

h << append( x );

If( h << getwidth > 700, h = H List Box(); v << append( h ); );

row++;

Wait( 0 );

);

The PatMatch statement uses some regular expressions to identify the words in the text, upper case them, and use them as the keys to an associative array. The associative array elements default to zero if they don't already exist (see the =>0) so no special effort is required for the first instance of a word. The PatRepeat within the PatMatch processes the entire string. Within the repeat the two regex alternate processing words and stuff between words. Within the first regex, expr is used to execute some JSL on each matched word. The last thing the expr does is evaluate to an empty string which will always match at the current position. This PatMatch processes about 150K characters in about two seconds, including counting the words into the associative array.

A data table is constructed from the key-value pairs and sorted, then a window is created with a vertical list box containing a bunch of horizontal list boxes. The horizontal list boxes are filled with the words, from left to right. After adding each word, the wait(0) allows the window to update so the width of the horizontal text box will be correct (and you get to watch it.) When the box exceeds 700 pixels, start a new line.

Start at row 12 to skip THE AND TO A OF IT SHE SAID YOU IN I. Stop when the sorted count goes below 8 instances. Use a log transform; Alice has 386 instances.

As Xan points out, there are issues with words that are only different in suffix, prefix, or synonyms. Probably some other issues too.

Jabberwocky is in Through the Looking Glass. Slithy Toves are not here.

update: probably shouldn't fetch text from the link every time. Point the loadtextfile to some file on your computer, perhaps downloaded from http://www.gutenberg.org/

Craige

pmroz · Oct 17, 2014 7:03 AM

Thanks Craige, very cool solution.

pmroz · Oct 18, 2016 2:54 PM

I added stop words and color to the output. Also made dt an invisible table.

cloud_title = "Alice in Wonderland";

// Put all the text into one variable

txt = load text file("c:\temp\alice.txt");

aa = [=> 0]; // associativearray with default value of zero

// count the words. JMP has a words() function that returns a list of words, but it isn't

// able to distinguish apostrophes inside words from on edges of words.

// 'we're is an example this regex works around. as is 'Oh.

// 'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'

rc = Pat Match(

txt,

Pat Repeat(

(Pat Regex( "[\w]+('[\w]+)?" ) >> word + Expr(

word = Uppercase( word );

aa[word]++;

"";

)) | (Pat Regex( "[^\w]+" )),

1,

9999999,

GREEDY

)

);

keys = aa << getkeys;

//show(N Items( keys ));

vals = aa << getvalues;

//show(N Items( vals ));

dt = New Table( "",

New Column( "word", character, Set Values( keys ) ),

New Column( "count", Numeric, "Continuous", Format( "Best", 12 ), Set Values( vals ) ),

invisible

);

dt << sort( by( count ), order( descending ), replace table );

// Words to ignore

stop_words = {"THE", "WAS", "NOT", "AND", "ON", "A", "WITH", "FROM", "THIS", "OF", "WERE", "FOR", "TO",

"AN", "BY", "IT", "OR", "AS", "HAD", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "AT", "BE", "IN",

"DID", "THAT", "NO", "ALSO", "IS", "MAY", "BUT", "HAS", "HER", "SHE", "HE", "HIS", "BEEN", "00",

"01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17",

"18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "WHERE",

"IE", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R",

"S", "T", "U", "V", "W", "X", "Y", "Z", "HAVE", "LIKE", "DE", "THERE", "ALL", "GOT",

"WHO", "AFTER", "ANY", "ABOUT", "COMPANY", "NEXT", "THEY", "IF", "THESE", "THEN", "HOWEVER", "MY"};

del_rows = dt << get rows where( Contains( stop_words, :word ) );

dt << delete rows( del_rows );

New Window( "Text Word Cloud",

Outline Box( cloud_title ,

Border Box( Left( 10 ), Right( 10 ), top( 10 ), bottom( 10 ), v = V List Box() )

)

);

h = H List Box();

<< append(;

// List of "good" colors from Scripting Guide, page 341

color_list =

[0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,

31, 35, 36, 37, 38, 39, 40, 43, 44, 45, 46, 47, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62];

ncolors = nrows(color_list);

row = 1;

k = 1;

nr = nrows(dt);

min_words = 8;

size_multiplier = 1;

if (nr < 100,

min_words = 2;

size_multiplier = 2;

);

While( dt:count[row] > min_words,

x = Text Box( dt:word[row] || " " );

x << setfontsize( size_multiplier * Log( dt:count[row] ) * 4 );

x << font color(color_list[k]);

k++;

if (k > ncolors,

k = 1;

);

h << append( x );

If( h << getwidth > 700,

h = H List Box();

v << append( h );

);

row++;

Wait( 0 );

);

wait(0);

close(dt, nosave);