Very nice!
Stemming might help combine some of the near-duplicates
The problem appears to be JMP 15 vs 16; you can make your own label column instead:
// load some documents that might separate into some categories
dt1 = Open( "z:/Fully Screened Subset With Abstracts and LatLongs.jmp" ); // 5000 is too many, subset 36...
originalnames = dt1 << getColumnNames();
te = dt1 << Text Explorer(
Text Columns( :AbstractProofed ), // the entire document is in one cell of the row
Add Stop Words(
{"agreement", "almost", "another", "company", "copyright holder", "electronic", "foundation", "gutenberg", "literary archive", "little",
"person or entity", "project", "public domain", "research", "without", "ebooks", "enough", "though", "rather", "better", "common", "possible",
"weight", "present", "series", "necessary", "placed", "therefore", "towards", "footnote", "something","slowly","around","behind","looking","seemed","nothing",
"probably","called","easily","distributing","paragraph"
}
),
Minimum Characters per Word( 6 ),
Stemming( "no stemming" ), //"Stem for Combining"
Language( "English" )
);
//te << savedocumenttermmatrix( Maximum Number of Terms( 300 ), Minimum Term Frequency( 25 ), Weighting( "TF IDF" ) );//BINARY would use ALL rather than !any, below
te << savedocumenttermmatrix( Maximum Number of Terms( 100 ), Minimum Term Frequency( 25 ), Weighting( "BINARY" ) ); // 0 or 1 if it occurs
te << closewindow;
// remove all-connected columns
allnames = dt1 << getColumnNames();
For( iname = N Items( allnames ), iname > N Items( originalnames ), iname -= 1,
// If( !Any( dt1[0, iname] ),
If( All( dt1[0, iname] ),
dt1 << deletecolumns( iname )
)
);
allnames = dt1 << getColumnNames();
cols = (N Items( originalnames ) + 1) :: N Items( allnames );
//
//dt1 << Hierarchical Cluster( Y( allnames[cols] ),
// Label( Transform Column( "Label", Character, Formula( Left( left(LoC Class,4)||:Subject||:Title, 20 ) ) ) ), // build your own identifier here
// Method( "Ward" ), Standardize Data( 1 ), Dendrogram Scale( "Distance Scale" ),
// Number of Clusters( 4 ), Constellation Plot( 1 ), Show Dendrogram( 0 ),
// SendToReport(Dispatch({"Constellation Plot"},"Clust Hier",FrameBox,{Frame Size( 1056, 716 )}))
//);
dt2 = dt1 << Data Table( "Fully Screened Subset With Abstracts and LatLongs" ) << Transpose(
columns( allnames[cols] ),
Label( :Title),
Label column name( "Title" ),
Output Table( "Transpose of Fully Screened Subset With Abstracts and LatLongs" )
);
dt2<<newcolumn("label",character,formula(Left( :Title, Length( :Title ) - 7 )));
dt2 << Hierarchical Cluster(
Y( (dt2 << getColumnNames)[2/*title at start*/ :: (N Cols( dt2 )-1/*label at end*/)]),
Label( dt2:label ),
Method( "Ward" ),
Standardize Data( 1 ),
Show Dendrogram( 0 ),
Dendrogram Scale( "Distance Scale" ),
Number of Clusters( 13 ),
Constellation Plot( 1 ),
SendToReport( Dispatch( {"Constellation Plot"}, "Clust Hier", FrameBox, {Frame Size( 948, 827 )} ) )
);
Since I added a label column, I had to accout for it in the column list.
There is no real difference between this display and the normal cluster dendogram display; the circled node in this diagram is the root of the dendogram and everything else is just fanned out in a circular pattern. This is easier to read. You might need to tell your audience to use the length of the connecting path between words (not just proximity) as a measure of connectedness.
You'll also want to play with the text explorer part; the stop words I used were specific to the Project Gutenberg document boiler plate in every book, and the word length of 6, and the stemming choice might not be optimal for your docs.
Edit: and combine some words maybe: northern, southern, western, eastern might all mean the same thing for your purpose?
Craige