Discussions

markschahl · Jun 12, 2024 03:40 AM

So, I have a corpus with thousands of documents which contains the details of problems that have been solved in the past. It's a vast pile of knowledge. Each document also has metadata (dubious quality+omissions). The documents have 100+ tokens that tell the story of what was done.

Imagine this use case: through filters and metadata, i've found a document (ID#5678) that is pretty close to the problem that I am trying to solve now. How can I easily find all the documents that are similar to #5678?

With my current knowledge of the Text Explorer platform, I could do a Topic Analysis, then lookup which topic #5678 belongs to (say#10), then look at all the documents where Topic=10.

Is there a better way to do this?

Thanks in advance from stormy Kuala Lumpur, Malaysia!

MikeD_Anderson · Jun 14, 2024 08:39 AM

Hey, @markschahl!

Topic analysis is one way, I'm also a fan of the SVD plots themselves (example in the image below also JMP Docs 1 and JMP Docs 2). The tendrils going away from the center of the plot tend to have similar themes. The doc SVD and term SVD plots provide slightly different views of the data. You'll want to have a look at both. With any luck, your document of interest will sit far out on a tendril which should give you some good candidates.

Since you have meta data and tokens, you might also give the Torch add-in a try. It has some language models that might do the trick.

View solution in original post

Jed_Campbell · Jun 14, 2024 10:55 AM

In addition to @MikeD_Anderson's suggestion, you may also get good results using Discriminant and Correspondence Analysis. Here's a blog entry that details the process.

View solution in original post

MikeD_Anderson · Jun 14, 2024 08:39 AM

Hey, @markschahl!

Topic analysis is one way, I'm also a fan of the SVD plots themselves (example in the image below also JMP Docs 1 and JMP Docs 2). The tendrils going away from the center of the plot tend to have similar themes. The doc SVD and term SVD plots provide slightly different views of the data. You'll want to have a look at both. With any luck, your document of interest will sit far out on a tendril which should give you some good candidates.

Since you have meta data and tokens, you might also give the Torch add-in a try. It has some language models that might do the trick.

markschahl · Jun 18, 2024 12:56 AM

Mike:

What's interesting is comparing the Topic Analysis to SVD. SVD shows that ~56% are talking about the same thing(s).
I'm rethinking about how many topics/clusters I should ask for. Any guidance? Is there a scree-plot equivalent for this?

Jed_Campbell · Jun 14, 2024 10:55 AM

In addition to @MikeD_Anderson's suggestion, you may also get good results using Discriminant and Correspondence Analysis. Here's a blog entry that details the process.

markschahl · Jun 18, 2024 12:47 AM

Jed:

Thanks. I tried that. Instead of Authors I had manufacturing platform type (think processing unit type in a refinery or chemical plant). Can we predict the manufacturing platform from the words in the description text? The overall misclassification rate was 60%. Looking at the classification summary, the highest predicted rate was ~0.6. So, it's safe to conclude that one platform's problems are not unique to that platform, which makes sense since there is a lot of equipment/processes that are common across platforms. My hypothesis that we should look across platforms for solutions seems reasonable...

Craige_Hales · Jun 15, 2024 10:16 PM

I think you want to exclude tokens describing the solution because you are looking for similar problems that had different solutions.

Craige

markschahl · Jun 18, 2024 06:47 PM

Craige:

I have plenty of metadata on what tools/techniques were used to solve the problem. I have been cross-checking that with the free-form text story: i.e. can I trust the metadata or the story the real truth. I will give your suggestion a try: JMP is a high-frequency token in the corpus...

Hope retirement is treating you well!

Discussions

Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Re: Text Explorer: find similar documents starting from a given document

Recommended Articles