- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Text Explorer: find similar documents starting from a given document
So, I have a corpus with thousands of documents which contains the details of problems that have been solved in the past. It's a vast pile of knowledge. Each document also has metadata (dubious quality+omissions). The documents have 100+ tokens that tell the story of what was done.
Imagine this use case: through filters and metadata, i've found a document (ID#5678) that is pretty close to the problem that I am trying to solve now. How can I easily find all the documents that are similar to #5678?
With my current knowledge of the Text Explorer platform, I could do a Topic Analysis, then lookup which topic #5678 belongs to (say#10), then look at all the documents where Topic=10.
Is there a better way to do this?
Thanks in advance from stormy Kuala Lumpur, Malaysia!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
Hey, @markschahl!
Topic analysis is one way, I'm also a fan of the SVD plots themselves (example in the image below also JMP Docs 1 and JMP Docs 2). The tendrils going away from the center of the plot tend to have similar themes. The doc SVD and term SVD plots provide slightly different views of the data. You'll want to have a look at both. With any luck, your document of interest will sit far out on a tendril which should give you some good candidates.
Since you have meta data and tokens, you might also give the Torch add-in a try. It has some language models that might do the trick.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
In addition to @MikeD_Anderson's suggestion, you may also get good results using Discriminant and Correspondence Analysis. Here's a blog entry that details the process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
Hey, @markschahl!
Topic analysis is one way, I'm also a fan of the SVD plots themselves (example in the image below also JMP Docs 1 and JMP Docs 2). The tendrils going away from the center of the plot tend to have similar themes. The doc SVD and term SVD plots provide slightly different views of the data. You'll want to have a look at both. With any luck, your document of interest will sit far out on a tendril which should give you some good candidates.
Since you have meta data and tokens, you might also give the Torch add-in a try. It has some language models that might do the trick.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
Mike:
What's interesting is comparing the Topic Analysis to SVD. SVD shows that ~56% are talking about the same thing(s).
I'm rethinking about how many topics/clusters I should ask for. Any guidance? Is there a scree-plot equivalent for this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
In addition to @MikeD_Anderson's suggestion, you may also get good results using Discriminant and Correspondence Analysis. Here's a blog entry that details the process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
Jed:
Thanks. I tried that. Instead of Authors I had manufacturing platform type (think processing unit type in a refinery or chemical plant). Can we predict the manufacturing platform from the words in the description text? The overall misclassification rate was 60%. Looking at the classification summary, the highest predicted rate was ~0.6. So, it's safe to conclude that one platform's problems are not unique to that platform, which makes sense since there is a lot of equipment/processes that are common across platforms. My hypothesis that we should look across platforms for solutions seems reasonable...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
I think you want to exclude tokens describing the solution because you are looking for similar problems that had different solutions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Text Explorer: find similar documents starting from a given document
Craige:
I have plenty of metadata on what tools/techniques were used to solve the problem. I have been cross-checking that with the free-form text story: i.e. can I trust the metadata or the story the real truth. I will give your suggestion a try: JMP is a high-frequency token in the corpus...
Hope retirement is treating you well!