turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Cosine Similarity Measure in Text Explorer

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Jan 31, 2017 11:44 AM
(889 views)

Is it possible to do Cosine Similarity Measure in Text Explorer to identify documents that are "close" to each other. I see that we can cluster documents and do Latent Semantic analysis but I don't see any way to compute Cosine Similarity Measure. Any help will be appreciated.

Rahul

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Feb 2, 2017 1:25 AM
(1675 views)

Solution

Because JMP allows you to save the document term matrix (DTM), you can always calculate this directly. The table attached was made from the 'Aircraft Incidents' sample data. The code below (which took a few seconds to run on my laptop) produced the similarity matrix in the second attached table.

// https://en.wikipedia.org/wiki/Cosine_similarity NamesDefaultToHere(1); dt = CurrentDataTable(); m = dt << getAsMatrix; n = NRow(dt); // Make some column headings for the final table cols = {}; for(i=1, i<=n, i++, InsertInto(cols, "Document in Row "||Char(i)); ); // Get the modulus of each feature vector modulus = J(n, 1, .); for(i=1, i<=n, i++, modulus[i] = sqrt(ssq(m[i,0])); ); // Get the cosine of the angle between each pair of feature vectors cosTheta = J(n, n, .); for(i=1, i<=n, i++, for(j=1, j<=i, j++, cosTheta[i,j] = Sum(m[i, 0] :* m[j, 0])/(modulus[i] * modulus[j]); ); ); dt2 = AsTable(cosTheta, << ColumnNames(cols)); dt2 << setName("Cosine between feature vectors in "||(dt << getName));

3 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Feb 2, 2017 1:25 AM
(1676 views)

Because JMP allows you to save the document term matrix (DTM), you can always calculate this directly. The table attached was made from the 'Aircraft Incidents' sample data. The code below (which took a few seconds to run on my laptop) produced the similarity matrix in the second attached table.

// https://en.wikipedia.org/wiki/Cosine_similarity NamesDefaultToHere(1); dt = CurrentDataTable(); m = dt << getAsMatrix; n = NRow(dt); // Make some column headings for the final table cols = {}; for(i=1, i<=n, i++, InsertInto(cols, "Document in Row "||Char(i)); ); // Get the modulus of each feature vector modulus = J(n, 1, .); for(i=1, i<=n, i++, modulus[i] = sqrt(ssq(m[i,0])); ); // Get the cosine of the angle between each pair of feature vectors cosTheta = J(n, n, .); for(i=1, i<=n, i++, for(j=1, j<=i, j++, cosTheta[i,j] = Sum(m[i, 0] :* m[j, 0])/(modulus[i] * modulus[j]); ); ); dt2 = AsTable(cosTheta, << ColumnNames(cols)); dt2 << setName("Cosine between feature vectors in "||(dt << getName));

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Feb 2, 2017 8:41 AM
(827 views)

Thanks for help. That is what I was thinking of doing. Get the matrix and do it myslef. I was wondering if it is built in?

Rahul

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

2 weeks ago
(123 views)

This is an excellent question! A cosine similarity coefficient will be identical to a correlation coefficient when the vectors considered are centered (i.e., have a mean of zero). Traditionally, in the information retrieval field, a data term matrix (DTM) is created and a singular value decomposition (SVD) is done directly on it, without centering or standardizing (i.e., centering and scaling) the vectors. Analogous to this, the cosine similarity is a measure of association with non-centered vectors.

However, there are important advantages to centering the DTM. Specifically, the first singular vector from the SVD (your first dimension or topic) won't necessarily be the most important singular vector in the multidimensional space if centering hasn't been done. This is one reason why JMP allows users to center and/or standardize the DTM for use in Latent Semantic Analysis or Topic Analysis. Similarly, correlation coefficients are obtained from the centered and scaled (i.e., standardized) DTM.

In sum, you can easily obtain a measure of vector similarity by saving the DTM to your data table and going to Analyze > Multivariate Methods > Multivariate, adding all your DTM columns into the Y role and clicking OK. The resulting correlation matrix will point to the degree of similarity of the vectors. That is, values of 1 indicate two identical vectors in the same direction, whereas values of -1 indicate two identical vectors pointing in opposite directions, and so on. Finally, you'll have to transpose your data prior to following the steps above if you want to find the similarity between documents instead of terms.

HTH,

~Laura

P.S. If you haven't upgraded to JMP 13.2 I strongly suggest you do so. Please take a look at my post here so you can learn about the improvements to Text Explorer in 13.2.

Laura C-S