Discussions

caseylott · Apr 7, 2021 02:40 PM

Hello community! Has anyone figured out how to create a keyword cooccurence network graph in JMP that can be saved as a script to the data table with interactivity with other graphs? I can call out to R to make the graph, but it’s static when I bring it back to JMP. I’m a standard JMP user and I’d love to find a solution that doesn’t involve the words “available in JMP Pro only”, since a JMP Pro license is way outside my budget. Thanks in advance for advice that anyone can provide.

Craige_Hales · Apr 9, 2021 3:59 AM

Very nice!

Stemming might help combine some of the near-duplicates

The problem appears to be JMP 15 vs 16; you can make your own label column instead:

// load some documents that might separate into some categories
dt1 = Open( "z:/Fully Screened Subset With Abstracts and LatLongs.jmp" ); // 5000 is too many, subset 36...

originalnames = dt1 << getColumnNames();

te = dt1 << Text Explorer(
	Text Columns( :AbstractProofed ), // the entire document is in one cell of the row
	Add Stop Words(
		{"agreement", "almost", "another", "company", "copyright holder", "electronic", "foundation", "gutenberg", "literary archive", "little",
		"person or entity", "project", "public domain", "research", "without", "ebooks", "enough", "though", "rather", "better", "common", "possible",
		"weight", "present", "series", "necessary", "placed", "therefore", "towards", "footnote", "something","slowly","around","behind","looking","seemed","nothing",
			"probably","called","easily","distributing","paragraph"
		}
	),
	Minimum Characters per Word( 6 ),
	Stemming( "no stemming" ), //"Stem for Combining"
	Language( "English" )
);

//te << savedocumenttermmatrix( Maximum Number of Terms( 300 ), Minimum Term Frequency( 25 ), Weighting( "TF IDF" ) );//BINARY would use ALL rather than !any, below
te << savedocumenttermmatrix( Maximum Number of Terms( 100 ), Minimum Term Frequency( 25 ), Weighting( "BINARY" ) ); // 0 or 1 if it occurs
te << closewindow;

// remove all-connected columns
allnames = dt1 << getColumnNames();
For( iname = N Items( allnames ), iname > N Items( originalnames ), iname -= 1, 
//	If( !Any( dt1[0, iname] ),
	If( All( dt1[0, iname] ),
		dt1 << deletecolumns( iname )
	)
);
allnames = dt1 << getColumnNames();
cols = (N Items( originalnames ) + 1) :: N Items( allnames );
//
//dt1 << Hierarchical Cluster(	Y( allnames[cols] ),
//	Label( Transform Column( "Label", Character, Formula( Left( left(LoC Class,4)||:Subject||:Title, 20 ) ) ) ), // build your own identifier here
//	Method( "Ward" ),	Standardize Data( 1 ),	Dendrogram Scale( "Distance Scale" ),
//	Number of Clusters( 4 ),	Constellation Plot( 1 ),	Show Dendrogram( 0 ),
//	SendToReport(Dispatch({"Constellation Plot"},"Clust Hier",FrameBox,{Frame Size( 1056, 716 )}))
//);

dt2 = dt1 << Data Table( "Fully Screened Subset With Abstracts and LatLongs" ) << Transpose(
	columns( allnames[cols] ),
	Label( :Title),
	Label column name( "Title" ),
	Output Table( "Transpose of Fully Screened Subset With Abstracts and LatLongs" )
);

dt2<<newcolumn("label",character,formula(Left( :Title, Length( :Title ) - 7 )));

dt2 << Hierarchical Cluster(
	Y( (dt2 << getColumnNames)[2/*title at start*/ :: (N Cols( dt2 )-1/*label at end*/)]),
	Label( dt2:label ),
	Method( "Ward" ),
	Standardize Data( 1 ),
	Show Dendrogram( 0 ),
	Dendrogram Scale( "Distance Scale" ),
	Number of Clusters( 13 ),
	Constellation Plot( 1 ),
	SendToReport( Dispatch( {"Constellation Plot"}, "Clust Hier", FrameBox, {Frame Size( 948, 827 )} ) )
);

Since I added a label column, I had to accout for it in the column list.

There is no real difference between this display and the normal cluster dendogram display; the circled node in this diagram is the root of the dendogram and everything else is just fanned out in a circular pattern. This is easier to read. You might need to tell your audience to use the length of the connecting path between words (not just proximity) as a measure of connectedness.

You'll also want to play with the text explorer part; the stop words I used were specific to the Project Gutenberg document boiler plate in every book, and the word length of 6, and the stemming choice might not be optimal for your docs.

Edit: and combine some words maybe: northern, southern, western, eastern might all mean the same thing for your purpose?

Craige

caseylott · Apr 9, 2021 12:40 PM

Hi Craige, Worked like a charm! Thank you. I've attached my graphic below. Since I've saved all of my corpus curation steps as column properties (recodes, phrases, stop words, stem exceptions) they get applied automatically when the script is run. I'll start experimenting with some of the parameters to see how they affect the resulting graph. I have a lot of applications where this kind of graphic is helpful, at least during the data exploration phase, so I will use this script over and over. Thank you again for all of your help!

Casey

Discussions

Keyword cooccurence networks in JMP?

Re: Keyword cooccurence networks in JMP?

Re: Keyword cooccurence networks in JMP?

Recommended Articles