cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
markschahl
Level V

Text Explorer: find similar documents starting from a given document

So, I have a corpus with thousands of documents which contains the details of problems that have been solved in the past. It's a vast pile of knowledge.  Each document also has metadata (dubious quality+omissions). The documents have 100+ tokens that tell the story of what was done. 

 

Imagine this use case: through filters and metadata, i've found a document (ID#5678) that is pretty close to the problem that I am trying to solve now. How can I easily find all the documents that are similar to #5678?

 

With my current knowledge of the Text Explorer platform, I could do a Topic Analysis, then lookup which topic #5678 belongs to (say#10), then look at all the documents where Topic=10.

 

Is there a better way to do this?

 

Thanks in advance from stormy Kuala Lumpur, Malaysia!

2 ACCEPTED SOLUTIONS

Accepted Solutions

Re: Text Explorer: find similar documents starting from a given document

Hey, @markschahl!

 

Topic analysis is one way, I'm also a fan of the SVD plots themselves (example in the image below also JMP Docs 1 and JMP Docs 2).  The tendrils going away from the center of the plot tend to have similar themes.  The doc SVD and term SVD plots provide slightly different views of the data.  You'll want to have a look at both.  With any luck, your document of interest will sit far out on a tendril which should give you some good candidates.  

 

MikeD_Anderson_1-1718368454088.png

 

 

Since you have meta data and tokens, you might also give the Torch add-in a try.  It has some language models that might do the trick.  

View solution in original post

Re: Text Explorer: find similar documents starting from a given document

In addition to @MikeD_Anderson's suggestion, you may also get good results using Discriminant and Correspondence Analysis. Here's a blog entry that details the process.

View solution in original post

6 REPLIES 6

Re: Text Explorer: find similar documents starting from a given document

Hey, @markschahl!

 

Topic analysis is one way, I'm also a fan of the SVD plots themselves (example in the image below also JMP Docs 1 and JMP Docs 2).  The tendrils going away from the center of the plot tend to have similar themes.  The doc SVD and term SVD plots provide slightly different views of the data.  You'll want to have a look at both.  With any luck, your document of interest will sit far out on a tendril which should give you some good candidates.  

 

MikeD_Anderson_1-1718368454088.png

 

 

Since you have meta data and tokens, you might also give the Torch add-in a try.  It has some language models that might do the trick.  

markschahl
Level V

Re: Text Explorer: find similar documents starting from a given document

Mike:

 

What's interesting is comparing the Topic Analysis to SVD. SVD shows that ~56% are talking about the same thing(s).
I'm rethinking about how many topics/clusters I should ask for. Any guidance? Is there a scree-plot equivalent for this?

 

Distribution of Topic.Cluster, SVD.Cluster.png

Re: Text Explorer: find similar documents starting from a given document

In addition to @MikeD_Anderson's suggestion, you may also get good results using Discriminant and Correspondence Analysis. Here's a blog entry that details the process.

markschahl
Level V

Re: Text Explorer: find similar documents starting from a given document

Jed:

Thanks. I tried that. Instead of Authors I had manufacturing platform type (think processing unit type in a refinery or chemical plant). Can we predict the manufacturing platform from the words in the description text? The overall misclassification rate was 60%. Looking at the classification summary, the highest predicted rate was ~0.6. So, it's safe to conclude that one platform's problems are not unique to that platform, which makes sense since there is a lot of equipment/processes that are common across platforms. My hypothesis that we should look across platforms for solutions seems reasonable...

Craige_Hales
Super User

Re: Text Explorer: find similar documents starting from a given document

I think you want to exclude tokens describing the solution because you are looking for similar problems that had different solutions.

Craige
markschahl
Level V

Re: Text Explorer: find similar documents starting from a given document

Craige:

 

I have plenty of metadata on what tools/techniques were used to solve the problem. I have been cross-checking that with the free-form text story: i.e. can I trust the metadata or the story the real truth. I will give your suggestion a try: JMP is a high-frequency token in the corpus...

 

Hope retirement is treating you well!