cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
JMP Wish List

We want to hear your ideas for improving JMP software.

  1. Search: Please search for an existing idea first before submitting a new idea.
  2. Submit: Post your new idea using the Suggest an Idea button. Please submit one actionable idea per post rather than a single post with multiple ideas.
  3. Kudo & Comment Kudo ideas you like, and comment to add to an idea.
  4. Subscribe: Follow the status of ideas you like. Refer to status definitions to understand where an idea is in its lifecycle. (You are automatically subscribed to ideas you've submitted or commented on.)

We consider several factors when looking for what ideas to add to JMP. This includes what will have the greatest benefit to our customers based on scope, needs and current resources. Product ideas help us decide what features to work on next. Additionally, we often look to ideas for inspiration on how to add value to developments already in our pipeline or enhancements to new or existing features.

Choose Language Hide Translation Bar
0 Kudos

Clustering: Metrics for models' comparison and for different clustering scenarii

Hi dear JMP Users and JMP team,

 

What inspired this wish list request?

Learning more and more about different applications of clustering and the different methods available (K-Means, Gaussian Mixtures, DBSCAN, Hierarchical ...), I had several wishes open for discussions about the platforms and metrics available in JMP :

  • Looking at the main different types of clustering (centroid-based, distribution-based, hierarchical and density-based, see Clustering Algorithms  |  Machine Learning  |  Google Developers), it seems that JMP is able to cover 3 of the 4 main types of clustering (respectively with K-Means, Normal Mixtures and Hierarchical clustering), but not density-based clustering. This topic was already in the JMP Wish list since 2021 : density based clustering - JMP User Community 
  • Looking at the "evaluation" metrics to know which number of clusters seem the most reasonable, there are some differences between the platforms :
    • For Hierarchical clustering, evaluation metric is distance-based,
    • For K-Means clustering (centroïd clustering), evaluation is variance-based with Cubic Clustering Criterion (CCC),
    • For Normal Mixtures (distribution clustering), evaluation is information theory-based with AICc and BIC.

It seems quite normal that based on the clustering specific algorithm, the evaluation can be different, but it would be great to be able to compare the clustering methods on the same basis/metric. Also in litterature ([2212.12189] Stop using the elbow criterion for k-means and how to choose the number of clusters ins...), variance criterion or BIC criterion seem to be the preferred options to compare different number of clusters for K-Means.

 

What are the improvements you would like to see? 

  1. Implementation of a density-based clustering method, like DBSCAN and/or HDBSCAN.
  2. Implementation of a common evaluation metric for the different clustering methods (like AICc and BIC), in order to better assess which method(s) seem to be the most reasonable choice(s). The choice of a specific clustering methods may be guided by domain expertise and/or preferences, but having a "neutral" metric to comfort this choice or challenge it would be great. 
  3. For each clustering model, implementation of other clustering metrics based on variance (CCC, VRC, ...), distance (Dunn, Silhouette scores, ...) and information theory (AICc, BIC, ...).  

 

Why is this idea important? 

The clustering algorithms provided in JMP do represent the majority of expected cases, but density-based clustering is really missing and can be found in many different situations.
In the absence of a common metric to compare the different clustering techniques, it may be hard to choose the "right" method, or be sufficiently confident in the choice of a particular method.

Finally, due to the variety of clustering based on different data structures, it is important to have different evaluation metrics for each method that can help evaluate the "robustness" decision about the number of clusters.