topic Hierarchical clustering: JMP & R disagree in Discussions
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276107#M53580
<P>When I cluster the same exact dataset in JMP and in R, using Ward's method, I get results that are non-identical. R provides two options (ward.D and ward.D2), wherein the latter squares the values in the distance matrix before clustering them. According to <A href="http://adn.biol.umontreal.ca/~numericalecology/Reprints/Murtagh_Legendre_J_Class_2014.pdf" target="_self">Murtagh & Legendre (2014)</A>, that's what JMP does too. But even when I specify method="ward.D2" in R, the output I get is non-trivially different: roughly 15% of the observations are mismatched with respect to what I get from JMP.</P><P> </P><P>So, Question 1: Am I wrong to expect that the same method should produce the same results in HCA? I wouldn't expect that in k-means clustering (without setting a seed, at least), but for HCA the results are totally stable *within* a given package, so I assumed they should also be the same *between* packages. </P><P> </P><P>Question 2: If they *should* be the same... why aren't they?</P><UL><LI>Possibility A: there's a difference in how the distance matrix is calculated</LI><LI>Possibility B: there's a difference in how Ward's method is implemented</LI><LI>Possibility C: the cutree() function in R does something different than the "Select number of clusters" function in JMP</LI><LI>Possibility X... ???</LI></UL><P>I've uploaded the dendrogram from R (apologies if that's poor netiquette!); you can see the problem I'm talking about by taking a close look at the labels along the x-axis. Those labels all end in a digit from 1-5, which represents the number of the cluster they were assigned to in JMP. As you can see, the green group ("3") was the only one where there was a perfect 1:1 match between JMP and R. </P><P> </P><P>Grateful for any guidance, -Matt</P><P><span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screen Shot 2020-06-29 at 10.29.25 PM.png" style="width: 999px;"><img src="https://community.jmp.com/t5/image/serverpage/image-id/24896i0B91B2036F2A7CD1/image-size/large?v=1.0&px=999" role="button" title="Screen Shot 2020-06-29 at 10.29.25 PM.png" alt="Screen Shot 2020-06-29 at 10.29.25 PM.png" /></span></P>Tue, 30 Jun 2020 02:41:57 GMTmatthall_resear2020-06-30T02:41:57ZHierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276107#M53580
<P>When I cluster the same exact dataset in JMP and in R, using Ward's method, I get results that are non-identical. R provides two options (ward.D and ward.D2), wherein the latter squares the values in the distance matrix before clustering them. According to <A href="http://adn.biol.umontreal.ca/~numericalecology/Reprints/Murtagh_Legendre_J_Class_2014.pdf" target="_self">Murtagh & Legendre (2014)</A>, that's what JMP does too. But even when I specify method="ward.D2" in R, the output I get is non-trivially different: roughly 15% of the observations are mismatched with respect to what I get from JMP.</P><P> </P><P>So, Question 1: Am I wrong to expect that the same method should produce the same results in HCA? I wouldn't expect that in k-means clustering (without setting a seed, at least), but for HCA the results are totally stable *within* a given package, so I assumed they should also be the same *between* packages. </P><P> </P><P>Question 2: If they *should* be the same... why aren't they?</P><UL><LI>Possibility A: there's a difference in how the distance matrix is calculated</LI><LI>Possibility B: there's a difference in how Ward's method is implemented</LI><LI>Possibility C: the cutree() function in R does something different than the "Select number of clusters" function in JMP</LI><LI>Possibility X... ???</LI></UL><P>I've uploaded the dendrogram from R (apologies if that's poor netiquette!); you can see the problem I'm talking about by taking a close look at the labels along the x-axis. Those labels all end in a digit from 1-5, which represents the number of the cluster they were assigned to in JMP. As you can see, the green group ("3") was the only one where there was a perfect 1:1 match between JMP and R. </P><P> </P><P>Grateful for any guidance, -Matt</P><P><span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screen Shot 2020-06-29 at 10.29.25 PM.png" style="width: 999px;"><img src="https://community.jmp.com/t5/image/serverpage/image-id/24896i0B91B2036F2A7CD1/image-size/large?v=1.0&px=999" role="button" title="Screen Shot 2020-06-29 at 10.29.25 PM.png" alt="Screen Shot 2020-06-29 at 10.29.25 PM.png" /></span></P>Tue, 30 Jun 2020 02:41:57 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276107#M53580matthall_resear2020-06-30T02:41:57ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276790#M53744
<P>Hi matthal_resear,</P>
<P>Comparing different packages is always kind of a hot topic. As I cannot take a look at the source codes in JMP I cannot check if there is a difference in any of the calculations you mentioned in point 2 between JMP and R. They both refer to the Murtagh & Legendre (2014), however this does not mean that the algorithm is implemented in the complete similar fassion. Especially before and after the core of the algorithm might be differences. Also in potential subroutines. So again, as there is no way to check the complete code one will not be able to answer this question. you may try to send your question to JMP's Technical Support, providing the detailed information about the R-Package you are using as well the concrete routine. Sometimes there are more than one R function available. Then they might have a chance to check for differences, not saying they will find something.</P>
<P> </P>
<P>Just another thought. Are the results within the packages stable for HCA? Or are they slightly different? This is also an important information. Last I would take a look at the constellation plot to identify differences there as well. </P>
<P> </P>
<P>Just my2cents</P>
<P> </P>
<P> </P>Fri, 03 Jul 2020 13:47:11 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276790#M53744martindemel2020-07-03T13:47:11ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276797#M53749
<P>This thread might also help with the distance matrix: </P>
<P><A href="https://community.jmp.com/t5/Discussions/which-distance-does-JMP-use-in-clustering/m-p/275384#M53438" target="_blank" rel="noopener">https://community.jmp.com/t5/Discussions/which-distance-does-JMP-use-in-clustering/m-p/275384#M53438</A> </P>Fri, 03 Jul 2020 14:34:52 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276797#M53749martindemel2020-07-03T14:34:52ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276800#M53752
<P>Is it possible that the data are being standardized in one package and not the other? The default in JMP is "Standardize Data".</P>Fri, 03 Jul 2020 15:15:54 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276800#M53752jerry_cooper2020-07-03T15:15:54ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276804#M53754
<P>Thanks for the suggestion; it's good to know that sending the code to the tech team is an option if I can't resolve this another way. </P><P> </P><P>To answer your other question, yes: the HCA results within each package are completely stable. </P>Fri, 03 Jul 2020 16:09:17 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276804#M53754matthall_resear2020-07-03T16:09:17ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276805#M53755
<P>Thanks for passing that along; it made me realize that it will be worth actually exporting the distance matrix from JMP and comparing those values to the distance matrix I get in R! That will be one way to narrow down where the discrepancy is coming from. </P>Fri, 03 Jul 2020 16:10:39 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276805#M53755matthall_resear2020-07-03T16:10:39ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276806#M53756
Good thought, but I don't think that's the issue: since the dataset I'm working with involves variables that are all on the same scale, I always de-select "Standardize Data" in JMP. And in order to standardize data in R, one has to use the scale() function, which I never do. So as far as I can determine both packages should be operating over unstandardized data.Fri, 03 Jul 2020 16:13:47 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276806#M53756matthall_resear2020-07-03T16:13:47ZRe: Hierarchical clustering: JMP & R disagree
https://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276877#M53778
<P>Hmmm... I just did a comparison on three different data sets in JMP and R (without using "standardize data" or "scale()") and found the classifications to match exactly in each when using the same number of clusters. The only methods I specified in R were to use 'euclidean' for the "dist" function and 'ward.D2' for the "hclust" function. Are you specifying anything else in your R and/or JMP workflows? How messy are your data (missing... outliers...)? Just offering some food for thought...</P>Sat, 04 Jul 2020 19:49:11 GMThttps://community.jmp.com/t5/Discussions/Hierarchical-clustering-JMP-amp-R-disagree/m-p/276877#M53778jerry_cooper2020-07-04T19:49:11Z