Hi,
I use kmeans cluster to group data together. Now I want to find out for each cluster of data, which feature has high impact. I wonder if JMP pro K-mean provide estimation of most important features to each k-means clusters? Can someone provide suggestion how I can I calculate it if it is not available? I was thinking perhaps the distance can be used for estimation. Any though?
Hi @dadawasozo,
If I understand better your objective, you would like to better assess the disparities and differences between clusters based on the values of the features ?
If yes, you can look at the options "Parallel Coordinate Plot", "Scatterplot matrix" and the several graphs "Biplot" available in the red triangle next to "KMeans NCluster = (X)", as they provide, with the summary statistics "Cluster Means", interesting visualizations on how the several clusters differ based on the features mean values.
Example Parallel Coordinate Plot on Iris dataset (you can reproduce it with the Graph Builder to customize it and add values legends):
Example Scatterplot matrix on Iris dataset :
Example Biplot 2D (with Principal Components) on Iris dataset :
I hope I understand better your objective and that this answer may help you,
Hi @dadawasozo,
I think about two possible ways to explain the clustering done with K-means :
As an example:
On the very famous and public "Iris" dataset, I did the K-means clustering on the factors Sepal length, sepal width, petal length and petal width. I am able to find three clusters, that match closely with the three species found. I save the cluster formula.
There is a difference between the two platforms and an exchange between the two biggest clustering contributors factors, because petal length and petal width have both two non-normal distributions. Hence the Welch's test wouldn't be the most appropriate option here, but a Steel-Dwass All Pair test would be probably better.
On this dataset, the Predictor screening would be a safer option as it is robust against outliers, and does not require assumtions about the data distributions (the platform uses a Random Forest model).
I attached the Iris datatable so that you can look at the several analysis and reproduce the tests I have done.
Last aspect to take into consideration (even if out-of-scope of your question) is to make sure that K-Means approach is suitable for your dataset as it is based on two assumptions :
Other clustering approaches may be more flexible if these assumptions are not met (Gaussian/Normal Mixtures for example), or more "appropriate" depending on your clustering context : clustering based on points density, based on assumed underlying distributions, on hierarchical relations between points, ...
I hope this answer will help you,
Hi Victor,
is it possible to get the most important features for each cluster separately? K means helps clusters data into few groups. I want to understand the most impactful features for each cluster to understand what makes them different clusters.
Hi @dadawasozo,
If I understand better your objective, you would like to better assess the disparities and differences between clusters based on the values of the features ?
If yes, you can look at the options "Parallel Coordinate Plot", "Scatterplot matrix" and the several graphs "Biplot" available in the red triangle next to "KMeans NCluster = (X)", as they provide, with the summary statistics "Cluster Means", interesting visualizations on how the several clusters differ based on the features mean values.
Example Parallel Coordinate Plot on Iris dataset (you can reproduce it with the Graph Builder to customize it and add values legends):
Example Scatterplot matrix on Iris dataset :
Example Biplot 2D (with Principal Components) on Iris dataset :
I hope I understand better your objective and that this answer may help you,