cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
New to using JMP? Hit the ground running with the Early User Edition of Discovery Summit. Register now, free of charge.
Register for our Discovery Summit 2024 conference, Oct. 21-24, where you’ll learn, connect, and be inspired.
Choose Language Hide Translation Bar
pnogau
Level I

关于JMP聚类使用过程出现的两个疑问

我正在使用JMPpro进行一项样本量约10000份、包含20个变量的无监督机器学习聚类分析。在使用过程中出现了一些疑问,特此前来请教其他使用者或工程师,感谢你们能够抽出宝贵的时间阅读并耐心分析解答我的问题! 疑问一:我已认真阅读该软件聚类分析功能介绍,JMP中的层次聚类适用于小样本的任意数据类型。那么,若数据中的20个变量分别属于混合型数据(包含连续型变量、离散型变量、有序型变量、名义型变量),请问我在进行层次聚类时是否需要事先把连续型变量和离散型变量进行标准化处理,随后在【标准化依据】选择“未标准化”呢?还是选择“未标准化”后软件会自动识别连续型变量和离散型变量并进行标准化处理,且有序型变量和名义型变量保持原始值呢(即无需在层次聚类前事先手动标准化处理)? 疑问二:聚类分析属于机器学习中的无监督学习,若将19个变量进行层次聚类时,并在【依据】设置了某个(1个)二元名义型变量。那么,此次分析是否还被认为是无监督学习,或已经属于半监督学习呢?若被认为属于“半监督学习”,但是聚类本身属于无监督学习,它们之间的关系应该如何准确描述呢?

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User

Re: 关于JMP聚类使用过程出现的两个疑问

Hi @pnogau,

 

Welcome in the Community !

Concerning your questions :

  1. If you have data with mixed data and modeling type (numerical continuous, ordinal and nominal), then only the Hierarchical Cluster platform will be able to handle such various data type. You can have more info here, Overview of Platforms for Clustering Observations where this table is shown :
    Victor_G_0-1724048392887.png
    You don't need to do the processing of numerical continuous variables beforehand, there are several options to do the pre-processing directly in the platform by specifying data format, type of standardization, and missing data imputation : Launch the Hierarchical Cluster Platform 
  2. Not sure to fully understand your second question.
    Clustering is used when you don't know beforehand how many "groups"/clusters you have in your data and in which group your observations belong, so it's a unsupervised learning technique. Hierarchical clustering is an interesting technique and platform in JMP, as it enables to perform Two-Way clustering, where your observations are grouped in clusters but also the variables used, to see the similarity and correlations between the variables used. This analysis can be performed in addition of other multivariate platforms like Correlations and Multivariate Techniques or with visualizations done with Graph Builder, to better assess the correlations between your variables.
    Also if your binomial variable is some kind of target, you could perform the clustering "blindly" and see how many groups are recommended, and analyze the link between the groups and the binomial variable (which would be a combination of unsupervised learning for clustering, and then supervised learning to analyze the link between clusters and binomial target), or directly specifying that you want 2 clusters in the Hierarchical Clustering platform (which could then be considered as semi-supervised learning, since you already knwo the number of clusters to find and specify it), and see if/how the clustering matches the binomial target variable.

 

Hope this answer will help you, 

 

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics

View solution in original post

2 REPLIES 2
Victor_G
Super User

Re: 关于JMP聚类使用过程出现的两个疑问

Hi @pnogau,

 

Welcome in the Community !

Concerning your questions :

  1. If you have data with mixed data and modeling type (numerical continuous, ordinal and nominal), then only the Hierarchical Cluster platform will be able to handle such various data type. You can have more info here, Overview of Platforms for Clustering Observations where this table is shown :
    Victor_G_0-1724048392887.png
    You don't need to do the processing of numerical continuous variables beforehand, there are several options to do the pre-processing directly in the platform by specifying data format, type of standardization, and missing data imputation : Launch the Hierarchical Cluster Platform 
  2. Not sure to fully understand your second question.
    Clustering is used when you don't know beforehand how many "groups"/clusters you have in your data and in which group your observations belong, so it's a unsupervised learning technique. Hierarchical clustering is an interesting technique and platform in JMP, as it enables to perform Two-Way clustering, where your observations are grouped in clusters but also the variables used, to see the similarity and correlations between the variables used. This analysis can be performed in addition of other multivariate platforms like Correlations and Multivariate Techniques or with visualizations done with Graph Builder, to better assess the correlations between your variables.
    Also if your binomial variable is some kind of target, you could perform the clustering "blindly" and see how many groups are recommended, and analyze the link between the groups and the binomial variable (which would be a combination of unsupervised learning for clustering, and then supervised learning to analyze the link between clusters and binomial target), or directly specifying that you want 2 clusters in the Hierarchical Clustering platform (which could then be considered as semi-supervised learning, since you already knwo the number of clusters to find and specify it), and see if/how the clustering matches the binomial target variable.

 

Hope this answer will help you, 

 

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics
pnogau
Level I

Re: 关于JMP聚类使用过程出现的两个疑问

Hi. @Victor_G 

Thank you very much for your assistance and patient response to my question on the forum. Your answers and suggestions have provided me with valuable insights, and I believe that the issue I was facing has been tentatively resolved.

Question 1: Since my data contains mixed types of variables, it is indeed necessary to perform hierarchical clustering. Furthermore, following your advice, I carefully reread "Launch the Hierarchical Cluster Platform" and found the solution: To address the issue of different measurement scales for continuous and ordinal columns, it seems I should standardize the continuous and discrete variables first, and then select "Unstandardized" under "Standardize By."Standardize By 

pnogau_0-1724409595702.png

pnogau_4-1724409921842.png


Question 2: Your understanding of my doubts was very accurate, and your response has given me important inspiration. In fact, I aim to use cluster analysis to discover different clusters within a vast dataset (individuals) and to conduct visual analysis to explore the potential relationships between more than twenty variables, which is an unsupervised machine learning task. However, to obtain more ideal clustering results, it seems that choosing a certain binomial variable under "By" yields very satisfactory clustering outcomes. I am pondering whether this has now become semi-supervised or supervised learning.By 

pnogau_1-1724409741560.png

pnogau_5-1724409958565.png


Actually, I was fortunate enough to get in touch with an engineer responsible for JMP's university business in China, and I am planning to further verify my conjecture with the engineer. If you are interested, I will share the answers I receive with you.

I was very excited to receive your reply! Wishing you a happy life and smooth work ~