Re: Which columns to use for clustering - Doubt

NullMongoose240 · Jun 10, 2023 1:55 PM

I am trying to do clustering on below columns but not sure which columns to use while doing the Hierarchical Clustering, K-Means Clustering, Latent Class Analysis.

Attaching the screenshot for reference.

Craige_Hales · May 27, 2023 5:45 AM

(Question title updated)

I think you could use any or all. The AccountIdentifier (all unique values) won't help and should probably be left out. I've only used the hierarchical clustering; I think it works well with continuous numeric and ordinal and nominal character data.

The variables you choose will determine which rows are clustered together because they are similar for those variables.

You have to choose how many clusters you want.

(view in My Videos)

Craige

dale_lehman · May 27, 2023 10:56 AM

This is not an answer, but a related question. I'm hoping someone can comment on this. If one of those variables is something you want to predict, is it appropriate to use that in the clustering? My understanding is that potential explanatory factors might be clustered, but a response variable should not be part of the clustering - then we would subsequently build models to predict the response variable using the clusters as factors (this is assuming that this is a "supervised" analysis with a variable to predict rather than an "unsupervised" analysis to identify patterns without prediction).

So my question is whether/when a response variable would be appropriate to include in the clustering?

Victor_G · May 28, 2023 10:16 AM

Hi @NullMongoose240,

To be clear, clustering is a unsupervised technique, meaning that you don't have a response column Y you would like to match/predict. There is no "ground truth" to compare the results of the clustering with. The clustering goal is to find and create homogeneous groups (as best as possible) in a large, diverse and sometimes high-dimensional dataset, with the help of various techniques.
You may have to provide further informations so that we can help you on your topic : dataset, objectives, variables, context... I may actually have more questions than answers, and one of them is related to the goal of your analysis.

Would you like to cluster individuals or variables/features ? This is clearly two different use cases :

Clustering individuals may help you to find groups of people with a similar behaviour/pattern, and then have a different business objective/decision based on the group. For example in banking, depending on the buying behavior of bank customers, you can provide different services, predict churn rates, offer different kind of insurances, ...
Several methods are available in JMP, depending on the dataset and clustering objective :
- K Means Cluster (jmp.com) is a simple and quick centroid-based clustering method, very helpful to group similar observations/individuals, but sensitive to outliers and it will only create "spherical" clusters, not taking into account variances and distributions on the features used in the clustering. Number of clusters should be set priorhand (or a certain number of clusters can be tested to find which one seem "reasonable").
- Normal Mixtures (jmp.com) is a distribution-based clustering method, that offers more flexibility than K-Means, as the variances and distributions of the variables/features are taken into account. It is an helpful technique to find sub-populations in a global population. Number of clusters should be set priorhand (or a certain number of clusters can be tested to find which one seem "reasonable")
- Hierarchical Cluster (jmp.com) is an easy to do, understand and visualize clustering (thanks to the tree structure of the dendogram), and the number of clusters should not be set (or tested) before launching the analysis. Another advantage is that any number of clusters can be chosen by cutting the tree at the right level, so it offers some flexibility depending on the use case. The inconvenients of this method are that it doesn't work well with missing data, and it can be difficult or complex to launch this algorithm on very big datasets. This algorithm works best when using hierarchical data.
- Latent Class Analysis (jmp.com) is a clustering method to group observations/individuals for categorical variables
  
  More infos on clustering algorithms can be found in JMP Help, or here : Developers Google: clustering algorithms

Clustering variables may help you to group similar variables into representative groups. This technique is particularly interesting in high-dimensional problems, as it can be used as a dimension reduction method, and the variables clusters can be used in the modeling to avoid using a large set of features (with possible multicollinearity) and still get an interpretable model. More infos here : Cluster Variables

If you can provide more info regarding the context, objectives you're trying to achieve, dataset you have, ... it may be easier to help you.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · May 28, 2023 12:27 PM

Much of Victor has posted is useful. I'm not sure clustering is "unsupervised" (actually not exactly sure what this is?). All data analysis should be a function of what questions you want to answer (what hypotheses you want insight into) and how you got the data. So to the OP, what questions do you want to answer? How did you collect the data you want to use to possibly answer the questions?

Victor, I'm also confused by your statement: "There is no "ground truth" to compare the results of the clustering with." This is true of any analysis. Certainly us mere mortals have no absolute knowledge of the truth. I like Box's quote (in my signature).

"All models are wrong, some are useful" G.E.P. Box

Victor_G · May 28, 2023 01:19 PM

Hi @statman,

Completely agree with you, no matter the data, it should be used with clear objectives to answer questions with sufficient information and precision, depending on the context. There are indeed a lot of questions on this specific topic to answer before being able to help OP.

To answer shortly, "supervised" and "unsupervised" learnings are two common terms in Machine Learning I'm referring to.

In "supervised learning", we provide the machine learning model with labeled observations (containing one or several response values) to learn from, so it can predict the labels/responses of new, unseen examples.
"Unsupervised learning" involves training the model on unlabeled data, where the model tries to find patterns, groups or structures on its own, without any predefined labels/responses (clustering use case). This is what I imply by the absence of "ground truth" : absence of labels/response values to verify and improve the model.

I hope this response will make the understanding of my previous answer easier.
I'm not implying any "global ground truth" can be revealed on anything by any model (even if we already know that 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything).
I know this quote from a long time, and also use it a lot to explain statistical modeling and concepts.

@dale_lehman : Agree, if any response is present in the dataset, it should not be included in the clustering (to prevent data/information leakage), and your description looks related to the clustering of variables I'm referring to: use the clusters as inputs in the predictive model.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

dale_lehman · May 28, 2023 01:18 PM

I do realize the clustering is an "unsupervised" technique - but I believe it is only unsupervised for an initial step. A famous Deming quote: "The only useful function of a statistician is to make predictions, and thus to provide a basis for action." Most of the use cases I can think of for clustering is to eventually use the clusters in some sort of predictive model (the only exception I can think of is when the clusters are used for operations, such as to help create marketing channels or teams). So, for analyses where clusters are to be used for prediction, the issue arises of whether it is meaningful to include the intended response variable when creating the clusters - that is the genesis of my question.