Solved: Re: Cosine Similarity Measure in Text Explorer

Rahul · Feb 1, 2017 7:23 AM

Is it possible to do Cosine Similarity Measure in Text Explorer to identify documents that are "close" to each other. I see that we can cluster documents and do Latent Semantic analysis but I don't see any way to compute Cosine Similarity Measure. Any help will be appreciated.

Rahul

ian_jmp · Feb 2, 2017 6:08 AM

Because JMP allows you to save the document term matrix (DTM), you can always calculate this directly. The table attached was made from the 'Aircraft Incidents' sample data. The code below (which took a few seconds to run on my laptop) produced the similarity matrix in the second attached table.

// https://en.wikipedia.org/wiki/Cosine_similarity
NamesDefaultToHere(1);

dt = CurrentDataTable();
m = dt << getAsMatrix;
n = NRow(dt);

// Make some column headings for the final table
cols = {};
for(i=1, i<=n, i++,
	InsertInto(cols, "Document in Row "||Char(i));
);

// Get the modulus of each feature vector
modulus = J(n, 1, .);
for(i=1, i<=n, i++,
	modulus[i] = sqrt(ssq(m[i,0]));
);

// Get the cosine of the angle between each pair of feature vectors
cosTheta = J(n, n, .);
for(i=1, i<=n, i++,
	for(j=1, j<=i, j++,
		cosTheta[i,j] = Sum(m[i, 0] :* m[j, 0])/(modulus[i] * modulus[j]);
	);
);
dt2 = AsTable(cosTheta, << ColumnNames(cols));
dt2 << setName("Cosine between feature vectors in "||(dt << getName));

View solution in original post

ian_jmp · Feb 2, 2017 6:08 AM

Because JMP allows you to save the document term matrix (DTM), you can always calculate this directly. The table attached was made from the 'Aircraft Incidents' sample data. The code below (which took a few seconds to run on my laptop) produced the similarity matrix in the second attached table.

// https://en.wikipedia.org/wiki/Cosine_similarity
NamesDefaultToHere(1);

dt = CurrentDataTable();
m = dt << getAsMatrix;
n = NRow(dt);

// Make some column headings for the final table
cols = {};
for(i=1, i<=n, i++,
	InsertInto(cols, "Document in Row "||Char(i));
);

// Get the modulus of each feature vector
modulus = J(n, 1, .);
for(i=1, i<=n, i++,
	modulus[i] = sqrt(ssq(m[i,0]));
);

// Get the cosine of the angle between each pair of feature vectors
cosTheta = J(n, n, .);
for(i=1, i<=n, i++,
	for(j=1, j<=i, j++,
		cosTheta[i,j] = Sum(m[i, 0] :* m[j, 0])/(modulus[i] * modulus[j]);
	);
);
dt2 = AsTable(cosTheta, << ColumnNames(cols));
dt2 << setName("Cosine between feature vectors in "||(dt << getName));

Rahul · Feb 2, 2017 11:41 AM

Thanks for help. That is what I was thinking of doing. Get the matrix and do it myslef. I was wondering if it is built in?

Rahul

LauraCS · Sep 6, 2017 01:20 PM

This is an excellent question! A cosine similarity coefficient will be identical to a correlation coefficient when the vectors considered are centered (i.e., have a mean of zero). Traditionally, in the information retrieval field, a data term matrix (DTM) is created and a singular value decomposition (SVD) is done directly on it, without centering or standardizing (i.e., centering and scaling) the vectors. Analogous to this, the cosine similarity is a measure of association with non-centered vectors.

However, there are important advantages to centering the DTM. Specifically, the first singular vector from the SVD (your first dimension or topic) won't necessarily be the most important singular vector in the multidimensional space if centering hasn't been done. This is one reason why JMP allows users to center and/or standardize the DTM for use in Latent Semantic Analysis or Topic Analysis. Similarly, correlation coefficients are obtained from the centered and scaled (i.e., standardized) DTM.

In sum, you can easily obtain a measure of vector similarity by saving the DTM to your data table and going to Analyze > Multivariate Methods > Multivariate, adding all your DTM columns into the Y role and clicking OK. The resulting correlation matrix will point to the degree of similarity of the vectors. That is, values of 1 indicate two identical vectors in the same direction, whereas values of -1 indicate two identical vectors pointing in opposite directions, and so on. Finally, you'll have to transpose your data prior to following the steps above if you want to find the similarity between documents instead of terms.

HTH,

~Laura

P.S. If you haven't upgraded to JMP 13.2 I strongly suggest you do so. Please take a look at my post here so you can learn about the improvements to Text Explorer in 13.2.

Laura C-S