Solved: Re: Text Explorer - Need help with Topic Analysis

ar2 · Jan 23, 2018 03:52 AM

Dear all - Am using Text explorer to analyse some interesting "incident" data in transport environment. I am Using Topic Analysis and have identified about 15 sensible "topics". Is it possible to find out how many documents in my sample set "include" each topic - haven't found a way to do that.

Any guidance welcome

Thanks

ih · Jan 23, 2018 12:15 PM

You should be able to use the document topic vectors. Maybe someone knows of a quantifyiable way to choose the decision points for each vector, I have done that visually and by checking documents:

Names default to here( 1 );

dt = Open( "$Sample_data/Aircraft Incidents.jmp" );

te = dt << Text Explorer(
	Text Columns( :Final Narrative ),
	Latent Semantic Analysis(
		1,
		Maximum Number of Terms( 2128 ),
		Minimum Term Frequency( 10 ),
		Weighting( "TF IDF" ),
		Number of Singular Vectors( 100 ),
		Centering and Scaling( "Centered and Scaled" )
	),
	Topic Analysis( 1, Number of Topics( 10 ) ),
	Tokenizing( "Basic Words" ),
	Language( "English" ),
	SendToReport(
		Dispatch( {}, "Term and Phrase Lists", OutlineBox, {Close( 1 )} ),
		Dispatch( {}, "SVD Plots", OutlineBox, {Close( 1 )} ),
		Dispatch( {}, "Topic Terms", OutlineBox, {Close( 1 )} ),
		Dispatch( {}, "Topic Scores Plots", OutlineBox, {Close( 0 )} )
	)
);

//Save the topic vectors
te << Save Document Topic Vectors;

//Decide what values relate to documents that contain the topic:
dt << Distribution(
	Continuous Distribution( Column( :Topic 1 ) ),
	Continuous Distribution( Column( :Topic 2 ) ),
	Continuous Distribution( Column( :Topic 3 ) ),
	Continuous Distribution( Column( :Topic 4 ) ),
	Continuous Distribution( Column( :Topic 5 ) ),
	Continuous Distribution( Column( :Topic 6 ) ),
	Continuous Distribution( Column( :Topic 7 ) ),
	Continuous Distribution( Column( :Topic 8 ) ),
	Continuous Distribution( Column( :Topic 9 ) ),
	Continuous Distribution( Column( :Topic 10 ) )
);

//Select rows with topic 1
dt << Select where( :Topic 1 > 5 );

//Or, count rows with topic 1:
Sum( (Column( dt, "Topic 1" ) << Get values) > 5 );
//returns 169

View solution in original post

ih · Jan 23, 2018 12:15 PM

You should be able to use the document topic vectors. Maybe someone knows of a quantifyiable way to choose the decision points for each vector, I have done that visually and by checking documents:

Names default to here( 1 );

dt = Open( "$Sample_data/Aircraft Incidents.jmp" );

te = dt << Text Explorer(
	Text Columns( :Final Narrative ),
	Latent Semantic Analysis(
		1,
		Maximum Number of Terms( 2128 ),
		Minimum Term Frequency( 10 ),
		Weighting( "TF IDF" ),
		Number of Singular Vectors( 100 ),
		Centering and Scaling( "Centered and Scaled" )
	),
	Topic Analysis( 1, Number of Topics( 10 ) ),
	Tokenizing( "Basic Words" ),
	Language( "English" ),
	SendToReport(
		Dispatch( {}, "Term and Phrase Lists", OutlineBox, {Close( 1 )} ),
		Dispatch( {}, "SVD Plots", OutlineBox, {Close( 1 )} ),
		Dispatch( {}, "Topic Terms", OutlineBox, {Close( 1 )} ),
		Dispatch( {}, "Topic Scores Plots", OutlineBox, {Close( 0 )} )
	)
);

//Save the topic vectors
te << Save Document Topic Vectors;

//Decide what values relate to documents that contain the topic:
dt << Distribution(
	Continuous Distribution( Column( :Topic 1 ) ),
	Continuous Distribution( Column( :Topic 2 ) ),
	Continuous Distribution( Column( :Topic 3 ) ),
	Continuous Distribution( Column( :Topic 4 ) ),
	Continuous Distribution( Column( :Topic 5 ) ),
	Continuous Distribution( Column( :Topic 6 ) ),
	Continuous Distribution( Column( :Topic 7 ) ),
	Continuous Distribution( Column( :Topic 8 ) ),
	Continuous Distribution( Column( :Topic 9 ) ),
	Continuous Distribution( Column( :Topic 10 ) )
);

//Select rows with topic 1
dt << Select where( :Topic 1 > 5 );

//Or, count rows with topic 1:
Sum( (Column( dt, "Topic 1" ) << Get values) > 5 );
//returns 169

ar2 · Jan 23, 2018 01:20 PM

Looks like a good approach - if anyone out there knows a quantifiable way to choose "cut -off" points for relevance of each topic vecto that would be great