Clustering: Metrics for models' comparison and for different clustering scenarii

Victor_G · ‎03-12-2023

Hi dear JMP Users and JMP team,

What inspired this wish list request?

Learning more and more about different applications of clustering and the different methods available (K-Means, Gaussian Mixtures, DBSCAN, Hierarchical ...), I had several wishes open for discussions about the platforms and metrics available in JMP :

Looking at the main different types of clustering (centroid-based, distribution-based, hierarchical and density-based, see Clustering Algorithms | Machine Learning | Google Developers), it seems that JMP is able to cover 3 of the 4 main types of clustering (respectively with K-Means, Normal Mixtures and Hierarchical clustering), but not density-based clustering. This topic was already in the JMP Wish list since 2021 : density based clustering - JMP User Community
Looking at the "evaluation" metrics to know which number of clusters seem the most reasonable, there are some differences between the platforms :
- For Hierarchical clustering, evaluation metric is distance-based,
- For K-Means clustering (centroïd clustering), evaluation is variance-based with Cubic Clustering Criterion (CCC),
- For Normal Mixtures (distribution clustering), evaluation is information theory-based with AICc and BIC.

It seems quite normal that based on the clustering specific algorithm, the evaluation can be different, but it would be great to be able to compare the clustering methods on the same basis/metric. Also in litterature ([2212.12189] Stop using the elbow criterion for k-means and how to choose the number of clusters ins...), variance criterion or BIC criterion seem to be the preferred options to compare different number of clusters for K-Means.

What are the improvements you would like to see?

Implementation of a density-based clustering method, like DBSCAN and/or HDBSCAN.
Implementation of a common evaluation metric for the different clustering methods (like AICc and BIC), in order to better assess which method(s) seem to be the most reasonable choice(s). The choice of a specific clustering methods may be guided by domain expertise and/or preferences, but having a "neutral" metric to comfort this choice or challenge it would be great.
For each clustering model, implementation of other clustering metrics based on variance (CCC, VRC, ...), distance (Dunn, Silhouette scores, ...) and information theory (AICc, BIC, ...).

Why is this idea important?

The clustering algorithms provided in JMP do represent the majority of expected cases, but density-based clustering is really missing and can be found in many different situations.
In the absence of a common metric to compare the different clustering techniques, it may be hard to choose the "right" method, or be sufficiently confident in the choice of a particular method.

Finally, due to the variety of clustering based on different data structures, it is important to have different evaluation metrics for each method that can help evaluate the "robustness" decision about the number of clusters.

hogi · ‎08-03-2024

Wish from 2021: density based clustering