Solved: 关于JMP聚类使用过程出现的两个疑问

pnogau · Aug 18, 2024 08:20 AM

我正在使用JMPpro进行一项样本量约10000份、包含20个变量的无监督机器学习聚类分析。在使用过程中出现了一些疑问，特此前来请教其他使用者或工程师，感谢你们能够抽出宝贵的时间阅读并耐心分析解答我的问题！疑问一：我已认真阅读该软件聚类分析功能介绍，JMP中的层次聚类适用于小样本的任意数据类型。那么，若数据中的20个变量分别属于混合型数据（包含连续型变量、离散型变量、有序型变量、名义型变量），请问我在进行层次聚类时是否需要事先把连续型变量和离散型变量进行标准化处理，随后在【标准化依据】选择“未标准化”呢？还是选择“未标准化”后软件会自动识别连续型变量和离散型变量并进行标准化处理，且有序型变量和名义型变量保持原始值呢（即无需在层次聚类前事先手动标准化处理）？疑问二：聚类分析属于机器学习中的无监督学习，若将19个变量进行层次聚类时，并在【依据】设置了某个（1个）二元名义型变量。那么，此次分析是否还被认为是无监督学习，或已经属于半监督学习呢？若被认为属于“半监督学习”，但是聚类本身属于无监督学习，它们之间的关系应该如何准确描述呢？

Victor_G · Aug 19, 2024 02:43 AM

Hi @pnogau,

Welcome in the Community !

Concerning your questions :

If you have data with mixed data and modeling type (numerical continuous, ordinal and nominal), then only the Hierarchical Cluster platform will be able to handle such various data type. You can have more info here, Overview of Platforms for Clustering Observations where this table is shown :

You don't need to do the processing of numerical continuous variables beforehand, there are several options to do the pre-processing directly in the platform by specifying data format, type of standardization, and missing data imputation : Launch the Hierarchical Cluster Platform
Not sure to fully understand your second question.
Clustering is used when you don't know beforehand how many "groups"/clusters you have in your data and in which group your observations belong, so it's a unsupervised learning technique. Hierarchical clustering is an interesting technique and platform in JMP, as it enables to perform Two-Way clustering, where your observations are grouped in clusters but also the variables used, to see the similarity and correlations between the variables used. This analysis can be performed in addition of other multivariate platforms like Correlations and Multivariate Techniques or with visualizations done with Graph Builder, to better assess the correlations between your variables.
Also if your binomial variable is some kind of target, you could perform the clustering "blindly" and see how many groups are recommended, and analyze the link between the groups and the binomial variable (which would be a combination of unsupervised learning for clustering, and then supervised learning to analyze the link between clusters and binomial target), or directly specifying that you want 2 clusters in the Hierarchical Clustering platform (which could then be considered as semi-supervised learning, since you already knwo the number of clusters to find and specify it), and see if/how the clustering matches the binomial target variable.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

pnogau · Sep 10, 2024 07:41 AM

@Victor_G wrote:
Hi @pnogau,

Welcome in the Community !
Concerning your questions :
If you have data with mixed data and modeling type (numerical continuous, ordinal and nominal), then only the Hierarchical Cluster platform will be able to handle such various data type. You can have more info here, Overview of Platforms for Clustering Observations where this table is shown :

You don't need to do the processing of numerical continuous variables beforehand, there are several options to do the pre-processing directly in the platform by specifying data format, type of standardization, and missing data imputation : Launch the Hierarchical Cluster Platform
Not sure to fully understand your second question.
Clustering is used when you don't know beforehand how many "groups"/clusters you have in your data and in which group your observations belong, so it's a unsupervised learning technique. Hierarchical clustering is an interesting technique and platform in JMP, as it enables to perform Two-Way clustering, where your observations are grouped in clusters but also the variables used, to see the similarity and correlations between the variables used. This analysis can be performed in addition of other multivariate platforms like Correlations and Multivariate Techniques or with visualizations done with Graph Builder, to better assess the correlations between your variables.
Also if your binomial variable is some kind of target, you could perform the clustering "blindly" and see how many groups are recommended, and analyze the link between the groups and the binomial variable (which would be a combination of unsupervised learning for clustering, and then supervised learning to analyze the link between clusters and binomial target), or directly specifying that you want 2 clusters in the Hierarchical Clustering platform (which could then be considered as semi-supervised learning, since you already knwo the number of clusters to find and specify it), and see if/how the clustering matches the binomial target variable.

希望这个回答对你有所帮助，

嗨 @Victor_G ，我在上一期上取得了进展，想与您分享这个好消息。正如您所说，在对混合数据进行聚类时，JMP 会自动对连续和离散数据进行标准化，而无需事先进行手动标准化。虽然我前段时间与JMP中国大学区域业务经理取得了联系，但仍然没有收到明确的答复。我非常感谢您当时的帮助。再次感谢！

View solution in original post

Victor_G · Aug 19, 2024 02:43 AM

Hi @pnogau,

Welcome in the Community !

Concerning your questions :

If you have data with mixed data and modeling type (numerical continuous, ordinal and nominal), then only the Hierarchical Cluster platform will be able to handle such various data type. You can have more info here, Overview of Platforms for Clustering Observations where this table is shown :

You don't need to do the processing of numerical continuous variables beforehand, there are several options to do the pre-processing directly in the platform by specifying data format, type of standardization, and missing data imputation : Launch the Hierarchical Cluster Platform
Not sure to fully understand your second question.
Clustering is used when you don't know beforehand how many "groups"/clusters you have in your data and in which group your observations belong, so it's a unsupervised learning technique. Hierarchical clustering is an interesting technique and platform in JMP, as it enables to perform Two-Way clustering, where your observations are grouped in clusters but also the variables used, to see the similarity and correlations between the variables used. This analysis can be performed in addition of other multivariate platforms like Correlations and Multivariate Techniques or with visualizations done with Graph Builder, to better assess the correlations between your variables.
Also if your binomial variable is some kind of target, you could perform the clustering "blindly" and see how many groups are recommended, and analyze the link between the groups and the binomial variable (which would be a combination of unsupervised learning for clustering, and then supervised learning to analyze the link between clusters and binomial target), or directly specifying that you want 2 clusters in the Hierarchical Clustering platform (which could then be considered as semi-supervised learning, since you already knwo the number of clusters to find and specify it), and see if/how the clustering matches the binomial target variable.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

pnogau · Aug 23, 2024 06:53 AM

Hi. @Victor_G

Thank you very much for your assistance and patient response to my question on the forum. Your answers and suggestions have provided me with valuable insights, and I believe that the issue I was facing has been tentatively resolved.

Question 1: Since my data contains mixed types of variables, it is indeed necessary to perform hierarchical clustering. Furthermore, following your advice, I carefully reread "Launch the Hierarchical Cluster Platform" and found the solution: To address the issue of different measurement scales for continuous and ordinal columns, it seems I should standardize the continuous and discrete variables first, and then select "Unstandardized" under "Standardize By."Standardize By

Question 2: Your understanding of my doubts was very accurate, and your response has given me important inspiration. In fact, I aim to use cluster analysis to discover different clusters within a vast dataset (individuals) and to conduct visual analysis to explore the potential relationships between more than twenty variables, which is an unsupervised machine learning task. However, to obtain more ideal clustering results, it seems that choosing a certain binomial variable under "By" yields very satisfactory clustering outcomes. I am pondering whether this has now become semi-supervised or supervised learning.By

Actually, I was fortunate enough to get in touch with an engineer responsible for JMP's university business in China, and I am planning to further verify my conjecture with the engineer. If you are interested, I will share the answers I receive with you.

I was very excited to receive your reply! Wishing you a happy life and smooth work ~

pnogau · Sep 10, 2024 07:41 AM

@Victor_G wrote:
Hi @pnogau,

Welcome in the Community !
Concerning your questions :
If you have data with mixed data and modeling type (numerical continuous, ordinal and nominal), then only the Hierarchical Cluster platform will be able to handle such various data type. You can have more info here, Overview of Platforms for Clustering Observations where this table is shown :

You don't need to do the processing of numerical continuous variables beforehand, there are several options to do the pre-processing directly in the platform by specifying data format, type of standardization, and missing data imputation : Launch the Hierarchical Cluster Platform
Not sure to fully understand your second question.
Clustering is used when you don't know beforehand how many "groups"/clusters you have in your data and in which group your observations belong, so it's a unsupervised learning technique. Hierarchical clustering is an interesting technique and platform in JMP, as it enables to perform Two-Way clustering, where your observations are grouped in clusters but also the variables used, to see the similarity and correlations between the variables used. This analysis can be performed in addition of other multivariate platforms like Correlations and Multivariate Techniques or with visualizations done with Graph Builder, to better assess the correlations between your variables.
Also if your binomial variable is some kind of target, you could perform the clustering "blindly" and see how many groups are recommended, and analyze the link between the groups and the binomial variable (which would be a combination of unsupervised learning for clustering, and then supervised learning to analyze the link between clusters and binomial target), or directly specifying that you want 2 clusters in the Hierarchical Clustering platform (which could then be considered as semi-supervised learning, since you already knwo the number of clusters to find and specify it), and see if/how the clustering matches the binomial target variable.

希望这个回答对你有所帮助，

嗨 @Victor_G ，我在上一期上取得了进展，想与您分享这个好消息。正如您所说，在对混合数据进行聚类时，JMP 会自动对连续和离散数据进行标准化，而无需事先进行手动标准化。虽然我前段时间与JMP中国大学区域业务经理取得了联系，但仍然没有收到明确的答复。我非常感谢您当时的帮助。再次感谢！

pnogau · Sep 10, 2024 07:45 AM

Hi @Victor_G , I’ve made progress on my last issue and wanted to share this good news with you. Just as you said, when clustering mixed data, JMP automatically standardizes continuous and discrete data without the need for manual standardization beforehand. Although I got in touch with the JMP China regional business manager for universities some time ago, I still haven’t received a definitive answer. I am very grateful for your assistance at the time. Thank you again!

关于JMP聚类使用过程出现的两个疑问

Re: 关于JMP聚类使用过程出现的两个疑问

Re: 关于JMP聚类使用过程出现的两个疑问

Re: 关于JMP聚类使用过程出现的两个疑问

Re: 关于JMP聚类使用过程出现的两个疑问

Re: 关于JMP聚类使用过程出现的两个疑问

Re: 关于JMP聚类使用过程出现的两个疑问

Recommended Articles