When I cluster the same exact dataset in JMP and in R, using Ward's method, I get results that are non-identical. R provides two options (ward.D and ward.D2), wherein the latter squares the values in the distance matrix before clustering them. According to Murtagh & Legendre (2014), that's what JMP does too. But even when I specify method="ward.D2" in R, the output I get is non-trivially different: roughly 15% of the observations are mismatched with respect to what I get from JMP.
So, Question 1: Am I wrong to expect that the same method should produce the same results in HCA? I wouldn't expect that in k-means clustering (without setting a seed, at least), but for HCA the results are totally stable *within* a given package, so I assumed they should also be the same *between* packages.
Question 2: If they *should* be the same... why aren't they?
- Possibility A: there's a difference in how the distance matrix is calculated
- Possibility B: there's a difference in how Ward's method is implemented
- Possibility C: the cutree() function in R does something different than the "Select number of clusters" function in JMP
- Possibility X... ???
I've uploaded the dendrogram from R (apologies if that's poor netiquette!); you can see the problem I'm talking about by taking a close look at the labels along the x-axis. Those labels all end in a digit from 1-5, which represents the number of the cluster they were assigned to in JMP. As you can see, the green group ("3") was the only one where there was a perfect 1:1 match between JMP and R.
Grateful for any guidance, -Matt