I used LCA of text explorer form to cluster my text data, but everytime I run it. it gives me different results. Anyone know why it happens?
Also, what is the difference between these two clustering methods on text explorer platform, latent class analysis(LCA) and clustering documents in Latent semantic analysis?
LCA uses random seeds to begin the clustering process. I think you can set t]he random seed before each LCA run and reproduce previous fit.
Thank you! It works!
Also, I'm curious about the difference between LCA in text explore platform and cluster documents in Latent Semantic Analysis, any idea?
The LCA platform and the LCA available within the Text Explorer platform accomplish the same task. The LCA platform is a general tool for any multivariate data set. The LCA embedded within TE, however, has been customized for text analysis. First, the clustering results are presented in the context of finding similar documents in the corpus. Second, the sparse document-term matrix requires a new solution to the singular value decomposition.
I did not answer one of your original questions about the difference between latent class analysis and latent semantic analysis. Both of these methods produce clusters. Both methods are based on the expression of latent variables. LCA clusters documents based on the weighted document-term matrix, so the question is about similar documents. LSA clusters terms, also based on the weighted DTM, so the question is about terms. The clusters from LSA can identify latent topics.
I also notice LSA can cluster documents, does it have different results than clusters in LCA?
Well, both methods use random seeds for the initial clusters so there is the run-to-run difference that you observed.
The dedicated LCA method in the TE can handle much bigger matrices. The numerics might result in a difference, aside from the random seed aspect.
Have you tried it? You can save the DTM with weighting from TE and then analyze it with the LCA platform separate from TE.
Please not that the identity of the clusters is random but the composition of each cluster should be stable, though not necessarily identical. That is, cluster 1 in one run might become cluster 10 in another run or another platform but the constituents should be essentially the same. If there is not much similarity among documents, then there might be large changes in the clusters from run-to-run or platform-to-platform. The choice for the number of clusters can also affect the stability of the cluster composition.
You can also find a lot of answers in the JMP documentation.
See Help > Books > Basic Analysis > Text Explorer.
See Help > Books > Multivariate Methods > Latent Class Analysis.
I thought I would post a response I got from JMP Technical Support on setting the random seed.
"To generate reproducible results from Latent Class Analysis in Text Explorer, you must set the random seed before each using the Random Reset() JSL function.
Here is an example using the Pet Survey sample data that fits the LCA five times, with reseting the random seed before each. All 5 LCA results should be identical."
dt = open("$SAMPLE_DATA/Pet Survey.jmp");
te= Text Explorer(
Text Columns( :Survey Response ),
Set Regex( Library( "Words" ) ),
Language( "English" ),
for(i=1, i<=5, i++, //run LCA five times
RandomReset(123); //set the random seed before each
lca=te<< Latent Class Analysis(
Number of Clusters( 5 ),
Maximum Number of Terms( 143 ),
Minimum Term Frequency( 2 )