Solved: Re: K means with fixed centers

AvgRegression52 · Jun 8, 2023 5:58 PM

Hello, is there a method to calculate k means for a dataset but have some centers pre-defined? That is, I have the following random data and want to find k=3 centers, but 2 of the centers are already fixed at a given location. Does the functionality exist in JMP 14.2.0? Thanks!

Points:

Label	X	Y
Point1	3	3
Point2	4	4
Point3	6	6
Point4	50	30
Point5	60	60
Point6	70	90
Point7	90	90
Point8	99	99
Point9	101	101
Point10	102	102

Find 3rd cluster if these two centers are already defined:

Label	X	Y
Centroid1	5	5
Centroid2	100	100

The regular output alyzing the points works fine and optimizes the 3 clusters, but does not take into account the desired fixed clusters:

Victor_G · Jan 4, 2023 6:09 AM

Hi @AvgRegression52,

Your idea seems counter-intuitive for an unsupervised algorithm like K-Means, since it seems you already have apriori information about the different clusters/groups. Out of curiosity, why would you change the location of centroïds found by K-Means algorithm ?

There may be a workaround to get results like you expect :

In your datatable, create two new rows ("Centroid1" and "Centroid 2") with the X and Y values you would like to have for these two centroïds (in your example, [5,5] and [100,100]).
Then, create a numerical column "Weight", where all the previous observations are at a low value (for example : 1 or less), and the 2 newly added "centroïds" rows have very large values (for example : 1000000 or more).
Then, when launching K-Means analysis platform, use X and Y as "Y, columns", and use the new column "Weight" in "Freq" or "Weight" to bias the centroïds locations. Launch K-Means with 3 clusters (or any number you prefer).

This way, the two newly added rows will have "artificially" a lot more frequency (or importance) than the other rows of your table, and two of the clusters centres will be heavily biased to be as close as possible to the coordinates of these two centroïds rows coordinates (see capture "Biased_K-Means").

Use this workaround with extra caution, as the last cluster centre location could be affected by this process, it depends on the distribution of points in the dimensions : in your example, the three clusters are well defined and far apart, so adding a lot more frequency/weight to two new "centroïd" coordinates values (that are at the extremes of the X and Y distributions) won't affect the location of the third cluster centroïd ("in the middle").

The datatable sample is also attached with the "biased" K-Means script if you want to have a look.

I hope this workaround will help you,

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics

View solution in original post

AvgRegression52 · Jan 4, 2023 02:52 PM

Thanks @Victor_G, that boosted frequency seems to work OK for my use case! Point taken on the manipluated 3rd cluster location, as the real-world data is not as obvious of a cluster as the sample.

Sorry, I cannot share the actual data or specific details, but the main data focuses on customer support networks. We currently have multiple facilities in a given city like Los Angeles and want to be located close to all points of service. However, we already own real estate at 2 locations today and are investigating a 3rd, so there are a few different scenarios:

1) If we gave up all of our current properties, where would we relocate all 3 new properties (k-means does this fine)?

2) If we keep the existing 2 locations, where is the ideal location of a 3rd property?

3) If property is not available in the ideal location, where is land for sale and how much "less ideal" is this location compared to example 2?

View solution in original post

Victor_G · Jan 4, 2023 6:09 AM

Hi @AvgRegression52,

Your idea seems counter-intuitive for an unsupervised algorithm like K-Means, since it seems you already have apriori information about the different clusters/groups. Out of curiosity, why would you change the location of centroïds found by K-Means algorithm ?

There may be a workaround to get results like you expect :

In your datatable, create two new rows ("Centroid1" and "Centroid 2") with the X and Y values you would like to have for these two centroïds (in your example, [5,5] and [100,100]).
Then, create a numerical column "Weight", where all the previous observations are at a low value (for example : 1 or less), and the 2 newly added "centroïds" rows have very large values (for example : 1000000 or more).
Then, when launching K-Means analysis platform, use X and Y as "Y, columns", and use the new column "Weight" in "Freq" or "Weight" to bias the centroïds locations. Launch K-Means with 3 clusters (or any number you prefer).

This way, the two newly added rows will have "artificially" a lot more frequency (or importance) than the other rows of your table, and two of the clusters centres will be heavily biased to be as close as possible to the coordinates of these two centroïds rows coordinates (see capture "Biased_K-Means").

Use this workaround with extra caution, as the last cluster centre location could be affected by this process, it depends on the distribution of points in the dimensions : in your example, the three clusters are well defined and far apart, so adding a lot more frequency/weight to two new "centroïd" coordinates values (that are at the extremes of the X and Y distributions) won't affect the location of the third cluster centroïd ("in the middle").

The datatable sample is also attached with the "biased" K-Means script if you want to have a look.

I hope this workaround will help you,

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics

AvgRegression52 · Jan 4, 2023 02:52 PM

Thanks @Victor_G, that boosted frequency seems to work OK for my use case! Point taken on the manipluated 3rd cluster location, as the real-world data is not as obvious of a cluster as the sample.

Sorry, I cannot share the actual data or specific details, but the main data focuses on customer support networks. We currently have multiple facilities in a given city like Los Angeles and want to be located close to all points of service. However, we already own real estate at 2 locations today and are investigating a 3rd, so there are a few different scenarios:

1) If we gave up all of our current properties, where would we relocate all 3 new properties (k-means does this fine)?

2) If we keep the existing 2 locations, where is the ideal location of a 3rd property?

3) If property is not available in the ideal location, where is land for sale and how much "less ideal" is this location compared to example 2?

Victor_G · Jan 4, 2023 03:29 PM

Hi @AvgRegression52,

Glad to hear this trick seems to work OK for you ! Yes, since the frequency/weight is manipulated to control the first two centroids locations, the last third centroid location may be affected by this change.

No problem, I didn't want to be intrusive, I now understand better why you wanted to keep 2 centroids, this sounds like iterative K-Means clustering (or K-Means clustering with constraints): being able to cluster new observations thanks to already existing clusters (from previous analysis for example) and newly created clusters (if needed). I already have seen a similar question on StackOverflow, that's why I was a bit curious about the context.

Thanks for your explanations !

All the best,

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics