Hello, is there a method to calculate k means for a dataset but have some centers pre-defined? That is, I have the following random data and want to find k=3 centers, but 2 of the centers are already fixed at a given location. Does the functionality exist in JMP 14.2.0? Thanks!
Points:
Label | X | Y |
Point1 | 3 | 3 |
Point2 | 4 | 4 |
Point3 | 6 | 6 |
Point4 | 50 | 30 |
Point5 | 60 | 60 |
Point6 | 70 | 90 |
Point7 | 90 | 90 |
Point8 | 99 | 99 |
Point9 | 101 | 101 |
Point10 | 102 | 102 |
Find 3rd cluster if these two centers are already defined:
Label | X | Y |
Centroid1 | 5 | 5 |
Centroid2 | 100 | 100 |
The regular output alyzing the points works fine and optimizes the 3 clusters, but does not take into account the desired fixed clusters:
Hi @AvgRegression52,
Your idea seems counter-intuitive for an unsupervised algorithm like K-Means, since it seems you already have apriori information about the different clusters/groups. Out of curiosity, why would you change the location of centroïds found by K-Means algorithm ?
There may be a workaround to get results like you expect :
This way, the two newly added rows will have "artificially" a lot more frequency (or importance) than the other rows of your table, and two of the clusters centres will be heavily biased to be as close as possible to the coordinates of these two centroïds rows coordinates (see capture "Biased_K-Means").
Use this workaround with extra caution, as the last cluster centre location could be affected by this process, it depends on the distribution of points in the dimensions : in your example, the three clusters are well defined and far apart, so adding a lot more frequency/weight to two new "centroïd" coordinates values (that are at the extremes of the X and Y distributions) won't affect the location of the third cluster centroïd ("in the middle").
The datatable sample is also attached with the "biased" K-Means script if you want to have a look.
I hope this workaround will help you,
Thanks @Victor_G, that boosted frequency seems to work OK for my use case! Point taken on the manipluated 3rd cluster location, as the real-world data is not as obvious of a cluster as the sample.
Sorry, I cannot share the actual data or specific details, but the main data focuses on customer support networks. We currently have multiple facilities in a given city like Los Angeles and want to be located close to all points of service. However, we already own real estate at 2 locations today and are investigating a 3rd, so there are a few different scenarios:
1) If we gave up all of our current properties, where would we relocate all 3 new properties (k-means does this fine)?
2) If we keep the existing 2 locations, where is the ideal location of a 3rd property?
3) If property is not available in the ideal location, where is land for sale and how much "less ideal" is this location compared to example 2?
Hi @AvgRegression52,
Your idea seems counter-intuitive for an unsupervised algorithm like K-Means, since it seems you already have apriori information about the different clusters/groups. Out of curiosity, why would you change the location of centroïds found by K-Means algorithm ?
There may be a workaround to get results like you expect :
This way, the two newly added rows will have "artificially" a lot more frequency (or importance) than the other rows of your table, and two of the clusters centres will be heavily biased to be as close as possible to the coordinates of these two centroïds rows coordinates (see capture "Biased_K-Means").
Use this workaround with extra caution, as the last cluster centre location could be affected by this process, it depends on the distribution of points in the dimensions : in your example, the three clusters are well defined and far apart, so adding a lot more frequency/weight to two new "centroïd" coordinates values (that are at the extremes of the X and Y distributions) won't affect the location of the third cluster centroïd ("in the middle").
The datatable sample is also attached with the "biased" K-Means script if you want to have a look.
I hope this workaround will help you,
Thanks @Victor_G, that boosted frequency seems to work OK for my use case! Point taken on the manipluated 3rd cluster location, as the real-world data is not as obvious of a cluster as the sample.
Sorry, I cannot share the actual data or specific details, but the main data focuses on customer support networks. We currently have multiple facilities in a given city like Los Angeles and want to be located close to all points of service. However, we already own real estate at 2 locations today and are investigating a 3rd, so there are a few different scenarios:
1) If we gave up all of our current properties, where would we relocate all 3 new properties (k-means does this fine)?
2) If we keep the existing 2 locations, where is the ideal location of a 3rd property?
3) If property is not available in the ideal location, where is land for sale and how much "less ideal" is this location compared to example 2?