Choose Language Hide Translation Bar
Highlighted
dlifke
Level II

How do I quantify the clustering of data?

Imagine a set of data randomly distributed in a square area. (Both the x and y coordinates are uniform random variables.) The only grouping or clustering is due to randomness.  Now imagine the alternate hypothesis Ha is that the data become clustered through time.  The same data in a week appears to be more clustered, at least visually. How can I measure the degree of clustering, where zero would be no clustering other than randomness and 1 would be all the data points moving to the same location?  Somewhere in between would be smaller clusters where data tend to wander and group into several clusters.

 

I can do a K Means Cluster using a Range of Clusters for each data set and choose the NCluster (number of clusters) with the best Cubic Clustering Criterion (CCC), where larger values of CCC indicate better fit. I can then compare the number of clusters, noting that the random data has more, but smaller, clusters. It seems there's a more statistically valid approach to comparing the degree of clustering, which is why I am reaching out to this community full of much smarter people than myself!

Article Labels

    There are no labels assigned to this post.