Solved: How to best perform cluster analysis on spiral data?

jpol · Jun 8, 2023 5:46 PM

Hi,

I would like to know which clustering platform, available in either JMP or JMP Pro is best suited to identifying clusters from these spiral data.

Using HDBSCAN I can correctly identify the clusters as illustrated below but I would like to use JMP if possible:

Table script attached.

Thanks jpol

Craige_Hales · Mar 3, 2022 07:09 AM

Single linkage. You'll need to know how many clusters you want, 3 for example, or it will break the spiral arms into more.

Hierarchical Cluster(
    Y( :X, :Y ),
    Method( "Single" ),
    Standardize Data( 1 ),
    Color Clusters( 1 ),
    Dendrogram Scale( "Distance Scale" ),
    Number of Clusters( 3 )
)

Dendrogram with cluster colorsGraph with cluster colors

I think this article might help, I read just far enough to see it mentioned "chain".

https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-...

Craige

View solution in original post

Craige_Hales · Mar 3, 2022 07:09 AM

Single linkage. You'll need to know how many clusters you want, 3 for example, or it will break the spiral arms into more.

Hierarchical Cluster(
    Y( :X, :Y ),
    Method( "Single" ),
    Standardize Data( 1 ),
    Color Clusters( 1 ),
    Dendrogram Scale( "Distance Scale" ),
    Number of Clusters( 3 )
)

Dendrogram with cluster colorsGraph with cluster colors

I think this article might help, I read just far enough to see it mentioned "chain".

https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-...

Craige

jpol · Mar 4, 2022 12:33 AM

Thanks Craig for indicating the correct method and the article link.

The Single Link method works very well on this relatively clean data set.

The table script I provided above includes a script to add a few more rows of data, "Add Rows".

These should be considered as outliers, or at least be identified as not belonging to any of the three main clusters. Something like the "-1" Cluster in HDBSCAN.

The addition of these extra points seems to play havoc with the single linkage method (when looking for 3 or 4 clusters).

Any idea how to avoid this?

- Philip

Craige_Hales · Mar 4, 2022 11:44 AM

I'm not the expert here; I'll defer to others. My guess would be a separate pass to discover points not that are further than average from other points, but that would fail if there were two outliers next to each other, so their cluster-of-two would need to be tested...perhaps the HDBSCAN is doing something like that.

Can you share what kind of process produces the spiral clusters?

Craige

SDF1 · Dec 16, 2022 8:13 AM

Hi @jpol ,

I came across this post and found it very interesting -- don't even remember now what I was searching for, but this was a really cool post and thanks for the original data set to play with! I tried another approach, which might be of interest to you. First off, I added some extra columns in the data table and took out other ones which I didn't need. I added more X and Y data, but with noise so the spirals weren't so nice looking (just to make the problem more difficult! ). I even transformed the X/Y coordinates into R/theta coordinates to test the clustering and modeling, but found that R and theta did not improve the clustering or modeling. The data table is also attached.

You are right that those "outlier" points wreak havoc with the single linkage method in hierarchical clustering, even when standardizing the data. I think it's because those points share something in common with the rest of the data and also end up inflating the standard deviation for the data within that column (i.e. X or Y). However, in JMP 11, they added the Standardize Robustly option which makes the outliers stick out even more. I don't know the whole algorithm for it, but I would guess that JMP probably does some kind of T^2, Jackknife, or KNN outlier test. In fact, if you use the Multivariate platform and perform a T^2 test, or use the Explore Outlier platform and run a KNN test, you can see that the 5 additional points are showing up as potential outliers. The only one in question might be the one in the center. You can see the reports below. They are also saved as scripts to the data table.

In short, check Standardize Robustly when you think you have outliers, this will minimize the impact of the outliers when JMP standardizes the data. When you get the dendrogram, you can select 8 clusters, and the Hierarchical Clustering platform will correctly distinguish the additional 5 data points as being their own individual clusters. See below HC report from JMP. You can see that all the outlier clusters are aligned on a single constellation arm. They cannot be placed into their own cluster because they don't share enough in common with each other while also sharing too much in common with the rest of the data set. You still need to specify 8 clusters, otherwise, it will group the center point and the entire spiral into one cluster and the 4 outer points each into their own cluster.

If you had a third dimension of data -- say star brightness (I only say that because this is much like a galaxy image) or Z coordinate, then you could have that added into the data in order to allow another method of distinction. If you know before hand that they are outliers, you can always specify the total number of clusters - 8 in this case - and the hierarchical clustering picks them out just fine (when standardizing robustly). If the points in question are things like fiducials to mark specific locations on a photographic plate or transistor layout on a Si wafer, then these can be removed from the data set at the get-go. But this is a-priori knowledge that you need to somehow include in the data.

I decided to play with this a little further and see if a machine algorithm could pick out the clusters. Of the methods JMP Pro has available, I think Support Vector Machines might be the best since it takes numerical inputs to derive classification outputs. This data set is essentially a classification problem. You are trying to distinguish each arm of the spiral from the other arms.

Using the clusters from the hierarchical clustering platform as output to model with the X & Y inputs for the Support Vector Machine platform in JMP Pro, I was able to model the different clusters perfectly. Not knowing what exact parameters to use in the model I did a tuning design with 40 runs and sure enough, the SVM platform was able to pick out the 8 clusters. The contingency table of Clusters vs Predicted Clusters has 0 misclassifications. *Update*: I forgot to mention that I also used 50% of the data as a validation set when training the SVM model.

Here is the output of the SVM report. You can see that each arm is uniquely identified as well as the 5 additional outliers.

Comparing the data colored by the HC method and SVM, we see they're essentially identical. (I changed the marker type for the outliers to * so it was easier to see).

And, the contingency table shows 0 misclassifications.

This was a really cool data set to play with, and I thank you for posting this question. Again, don't even remember what I was originally searching for, but found this extra cool. I might try some of the other platforms and see how they fare at modeling/classifying this data. Thanks!

I hope this helps!,

DS

**Update 2**:

So, I went back and tested fitting this data set using some of the other predictive modeling platforms in JMP and the only other two that came close were the decision tree and bootstrap forest platforms. Even then, neither of them perform as well as the SVM model. Both only picked out 3 clusters during modeling, the clusters are almost the same, but not quite. So, indeed SVM seems to be the better choice.

I even tried the new XGBoost platform, which also doesn't end up working out all that well, surprisingly. I tried several different fitting options, but might not have optimized it as well as it could be, as there are many different tuning parameters. The best it did was to classify the 3 arms of the spiral and only one outlier, the rest it misclassified. So, in the end, HC clustering to get the output and SVM modeling to model the output for potential unknown spirals is probably the best best. Lots of fun!

How to best perform cluster analysis on spiral data?

Re: How to best perform cluster analysis on spiral data?

Re: How to best perform cluster analysis on spiral data?

Re: How to best perform cluster analysis on spiral data?

Re: How to best perform cluster analysis on spiral data?

Re: How to best perform cluster analysis on spiral data?