cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Choose Language Hide Translation Bar

K means with fixed centers

Hello, is there a method to calculate k means for a dataset but have some centers pre-defined?  That is, I have the following random data and want to find k=3 centers, but 2 of the centers are already fixed at a given location.  Does the functionality exist in JMP 14.2.0?  Thanks!

 

Points:

LabelXY
Point133
Point244
Point366
Point45030
Point56060
Point67090
Point79090
Point89999
Point9101101
Point10102102

 

Find 3rd cluster if these two centers are already defined:

LabelXY
Centroid155
Centroid2100100

 

The regular output alyzing the points works fine and optimizes the 3 clusters, but does not take into account the desired fixed clusters:

AvgRegression52_1-1672256171681.png

 

2 ACCEPTED SOLUTIONS

Accepted Solutions
Victor_G
Super User

Re: K means with fixed centers

Hi @AvgRegression52,

 

Your idea seems counter-intuitive for an unsupervised algorithm like K-Means, since it seems you already have apriori information about the different clusters/groups. Out of curiosity, why would you change the location of centroïds found by K-Means algorithm ?

There may be a workaround to get results like you expect :

 

  1. In your datatable, create two new rows ("Centroid1" and "Centroid 2") with the X and Y values you would like to have for these two centroïds (in your example, [5,5] and [100,100]).
  2. Then, create a numerical column "Weight", where all the previous observations are at a low value (for example : 1 or less), and the 2 newly added "centroïds" rows have very large values (for example : 1000000 or more).
  3. Then, when launching K-Means analysis platform, use X and Y as "Y, columns", and use the new column "Weight" in "Freq" or "Weight" to bias the centroïds locations. Launch K-Means with 3 clusters (or any number you prefer).

 

This way, the two newly added rows will have "artificially" a lot more frequency (or importance) than the other rows of your table, and two of the clusters centres will be heavily biased to be as close as possible to the coordinates of these two centroïds rows coordinates (see capture "Biased_K-Means").

Use this workaround with extra caution, as the last cluster centre location could be affected by this process, it depends on the distribution of points in the dimensions : in your example, the three clusters are well defined and far apart, so adding a lot more frequency/weight to two new "centroïd" coordinates values (that are at the extremes of the X and Y distributions) won't affect the location of the third cluster centroïd ("in the middle"). 


The datatable sample is also attached with the "biased" K-Means script if you want to have a look.

 

I hope this workaround will help you,

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Re: K means with fixed centers

Thanks @Victor_G, that boosted frequency seems to work OK for my use case!  Point taken on the manipluated 3rd cluster location, as the real-world data is not as obvious of a cluster as the sample.

 

Sorry, I cannot share the actual data or specific details, but the main data focuses on customer support networks.  We currently have multiple facilities in a given city like Los Angeles and want to be located close to all points of service.  However, we already own real estate at 2 locations today and are investigating a 3rd, so there are a few different scenarios:

 

1) If we gave up all of our current properties, where would we relocate all 3 new properties (k-means does this fine)?

 

2) If we keep the existing 2 locations, where is the ideal location of a 3rd property?

 

3) If property is not available in the ideal location, where is land for sale and how much "less ideal" is this location compared to example 2?

View solution in original post

3 REPLIES 3
Victor_G
Super User

Re: K means with fixed centers

Hi @AvgRegression52,

 

Your idea seems counter-intuitive for an unsupervised algorithm like K-Means, since it seems you already have apriori information about the different clusters/groups. Out of curiosity, why would you change the location of centroïds found by K-Means algorithm ?

There may be a workaround to get results like you expect :

 

  1. In your datatable, create two new rows ("Centroid1" and "Centroid 2") with the X and Y values you would like to have for these two centroïds (in your example, [5,5] and [100,100]).
  2. Then, create a numerical column "Weight", where all the previous observations are at a low value (for example : 1 or less), and the 2 newly added "centroïds" rows have very large values (for example : 1000000 or more).
  3. Then, when launching K-Means analysis platform, use X and Y as "Y, columns", and use the new column "Weight" in "Freq" or "Weight" to bias the centroïds locations. Launch K-Means with 3 clusters (or any number you prefer).

 

This way, the two newly added rows will have "artificially" a lot more frequency (or importance) than the other rows of your table, and two of the clusters centres will be heavily biased to be as close as possible to the coordinates of these two centroïds rows coordinates (see capture "Biased_K-Means").

Use this workaround with extra caution, as the last cluster centre location could be affected by this process, it depends on the distribution of points in the dimensions : in your example, the three clusters are well defined and far apart, so adding a lot more frequency/weight to two new "centroïd" coordinates values (that are at the extremes of the X and Y distributions) won't affect the location of the third cluster centroïd ("in the middle"). 


The datatable sample is also attached with the "biased" K-Means script if you want to have a look.

 

I hope this workaround will help you,

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Re: K means with fixed centers

Thanks @Victor_G, that boosted frequency seems to work OK for my use case!  Point taken on the manipluated 3rd cluster location, as the real-world data is not as obvious of a cluster as the sample.

 

Sorry, I cannot share the actual data or specific details, but the main data focuses on customer support networks.  We currently have multiple facilities in a given city like Los Angeles and want to be located close to all points of service.  However, we already own real estate at 2 locations today and are investigating a 3rd, so there are a few different scenarios:

 

1) If we gave up all of our current properties, where would we relocate all 3 new properties (k-means does this fine)?

 

2) If we keep the existing 2 locations, where is the ideal location of a 3rd property?

 

3) If property is not available in the ideal location, where is land for sale and how much "less ideal" is this location compared to example 2?

Victor_G
Super User

Re: K means with fixed centers

Hi @AvgRegression52,

Glad to hear this trick seems to work OK for you ! Yes, since the frequency/weight is manipulated to control the first two centroids locations, the last third centroid location may be affected by this change.

No problem, I didn't want to be intrusive, I now understand better why you wanted to keep 2 centroids, this sounds like iterative K-Means clustering (or K-Means clustering with constraints): being able to cluster new observations thanks to already existing clusters (from previous analysis for example) and newly created clusters (if needed). I already have seen a similar question on StackOverflow, that's why I was a bit curious about the context.

Thanks for your explanations !

All the best,
Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)