Hi @jpol ,
I came across this post and found it very interesting -- don't even remember now what I was searching for, but this was a really cool post and thanks for the original data set to play with! I tried another approach, which might be of interest to you. First off, I added some extra columns in the data table and took out other ones which I didn't need. I added more X and Y data, but with noise so the spirals weren't so nice looking (just to make the problem more difficult! ). I even transformed the X/Y coordinates into R/theta coordinates to test the clustering and modeling, but found that R and theta did not improve the clustering or modeling. The data table is also attached.
You are right that those "outlier" points wreak havoc with the single linkage method in hierarchical clustering, even when standardizing the data. I think it's because those points share something in common with the rest of the data and also end up inflating the standard deviation for the data within that column (i.e. X or Y). However, in JMP 11, they added the Standardize Robustly option which makes the outliers stick out even more. I don't know the whole algorithm for it, but I would guess that JMP probably does some kind of T^2, Jackknife, or KNN outlier test. In fact, if you use the Multivariate platform and perform a T^2 test, or use the Explore Outlier platform and run a KNN test, you can see that the 5 additional points are showing up as potential outliers. The only one in question might be the one in the center. You can see the reports below. They are also saved as scripts to the data table.
In short, check Standardize Robustly when you think you have outliers, this will minimize the impact of the outliers when JMP standardizes the data. When you get the dendrogram, you can select 8 clusters, and the Hierarchical Clustering platform will correctly distinguish the additional 5 data points as being their own individual clusters. See below HC report from JMP. You can see that all the outlier clusters are aligned on a single constellation arm. They cannot be placed into their own cluster because they don't share enough in common with each other while also sharing too much in common with the rest of the data set. You still need to specify 8 clusters, otherwise, it will group the center point and the entire spiral into one cluster and the 4 outer points each into their own cluster.
If you had a third dimension of data -- say star brightness (I only say that because this is much like a galaxy image) or Z coordinate, then you could have that added into the data in order to allow another method of distinction. If you know before hand that they are outliers, you can always specify the total number of clusters - 8 in this case - and the hierarchical clustering picks them out just fine (when standardizing robustly). If the points in question are things like fiducials to mark specific locations on a photographic plate or transistor layout on a Si wafer, then these can be removed from the data set at the get-go. But this is a-priori knowledge that you need to somehow include in the data.
I decided to play with this a little further and see if a machine algorithm could pick out the clusters. Of the methods JMP Pro has available, I think Support Vector Machines might be the best since it takes numerical inputs to derive classification outputs. This data set is essentially a classification problem. You are trying to distinguish each arm of the spiral from the other arms.
Using the clusters from the hierarchical clustering platform as output to model with the X & Y inputs for the Support Vector Machine platform in JMP Pro, I was able to model the different clusters perfectly. Not knowing what exact parameters to use in the model I did a tuning design with 40 runs and sure enough, the SVM platform was able to pick out the 8 clusters. The contingency table of Clusters vs Predicted Clusters has 0 misclassifications. *Update*: I forgot to mention that I also used 50% of the data as a validation set when training the SVM model.
Here is the output of the SVM report. You can see that each arm is uniquely identified as well as the 5 additional outliers.
Comparing the data colored by the HC method and SVM, we see they're essentially identical. (I changed the marker type for the outliers to * so it was easier to see).
And, the contingency table shows 0 misclassifications.
This was a really cool data set to play with, and I thank you for posting this question. Again, don't even remember what I was originally searching for, but found this extra cool. I might try some of the other platforms and see how they fare at modeling/classifying this data. Thanks!
I hope this helps!,
DS
**Update 2**:
So, I went back and tested fitting this data set using some of the other predictive modeling platforms in JMP and the only other two that came close were the decision tree and bootstrap forest platforms. Even then, neither of them perform as well as the SVM model. Both only picked out 3 clusters during modeling, the clusters are almost the same, but not quite. So, indeed SVM seems to be the better choice.
I even tried the new XGBoost platform, which also doesn't end up working out all that well, surprisingly. I tried several different fitting options, but might not have optimized it as well as it could be, as there are many different tuning parameters. The best it did was to classify the 3 arms of the spiral and only one outlier, the rest it misclassified. So, in the end, HC clustering to get the output and SVM modeling to model the output for potential unknown spirals is probably the best best. Lots of fun!