Choose Language Hide Translation Bar
Highlighted
Ressel
Level III

Clustering of amino acid data - how to choose an appropriate method

Coincidentally I found this link for a jmp public post on social media, and it really intrigued me from a professional point of view. Luckily the data set, including scripts, was available for download as jmp file (see also the attachment to this post). The data set contains concentrations for a set of amino acids in protein supplements but has only 16 rows. Out of curiosity I added amino acid data for a material I'm interested in and repeated the cluster analysis using all available methods (Ward, average, centroid,...). Being relatively new to this topic, I was surprised to see that the results were highly dependent on the clustering method used. In some cases the result was what I wanted to see, in most it wasn't.

 

I'm wondering how to determine the best clustering method for a given set of data. Is it a matter of retaining part of the data set for validation to check how well a specific method works? 

 

Also, is it fair to say that 17 rows of data is too little for a meaningful analysis? Where would one set a limit with respect to a reasonable amount of data amenable to this type of analysis?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Clustering of amino acid data - how to choose an appropriate method

First of all, clustering is an unsupervised learning method. The clusters are unknown. The clusters are discovered in the data or proposed through the method. So strictly speaking, clustering is not a classification tool. I hope that your expectations are realistic, that is all.

 

Second, JMP offers so many choices for the clustering method reflects the fact that none of them is superior. A statistician would refer to a method as 'superior' if it always provided the best result. But what is the best result? The best result produces homogeneous clusters that are distinct from one another. Homogeneous clusters have members that are close to one another. Distinct clusters are not close to one another. Many of the methods differ by their measure of distance or similarity.

 

The other difference between clustering methods has to do with the agglomeration of observations into clusters.

 

The features of the data can also determine the performance of the different methods.

 

So you might want to re-visit the descriptions of the methods to see if one of them speaks to your sense of similarity or the way you would agglomerate observations based on your knowledge of the data.

 

The other way is empirical: try them all and select the one that produces the best result in your opinion.

Learn it once, use it forever!

View solution in original post

4 REPLIES 4
Highlighted
Thierry_S
Level VI

Re: Clustering of amino acid data - how to choose an appropriate method

Hi,
It looks like nobody wants to take a stab at your question, so let me share my humble opinion.
First, it would be useful to understand your goal for this analysis: do you want to visualize the data and/or identify inherent structure in the data, or something else?
Second, I think that you may be expecting too much of the clustering algorithms. In my mind, basic clustering is primarily a visualization method to group items based on their similarities; different methods provide different ways to evaluate and display similarities. Hence, the different "look" of the dendrograms according to the different methods.
Third, if you are interested in identifying a classifier for your supplements based on amino acid composition, you may want to explore the Discriminant or the Principal Component platforms.
Of note, I don't think that the size of the dataset affects the quality of the clustering per se: the more data you have add the more complex "branching" you will get. At a minimum, you need 3 items to run a clustering analysis: it may not provide deep insight but technically that is all you need.
Best,
TS
Thierry R. Sornasse
Highlighted
Ressel
Level III

Re: Clustering of amino acid data - how to choose an appropriate method

Thank you, this is what I wanted to achieve, to visualize similarities between the different types of protein based on their amino acid composition. (And to be honest, I just followed the lead by the original posting from a jmp statistician. He probably is better at describing what the goal is - i.e. see the link in my original post.)

I am just surprised that each distant measure or method clusters the proteins in substantially different ways. For my own reassurance I worked through a clustering tutorial using this data set for 3 types of wheat grains. Since it was a tutorial I expected results consistent with the grain labels in the data set, but that was just wishful thinking. The Warden method seemed to do OK, but who am I to say that this is the most suitable method? Maybe it is? I'm uncertain.

 

I have explored by PCA as well, and the results are more or less consistent with only one clustering method. 

Highlighted

Re: Clustering of amino acid data - how to choose an appropriate method

First of all, clustering is an unsupervised learning method. The clusters are unknown. The clusters are discovered in the data or proposed through the method. So strictly speaking, clustering is not a classification tool. I hope that your expectations are realistic, that is all.

 

Second, JMP offers so many choices for the clustering method reflects the fact that none of them is superior. A statistician would refer to a method as 'superior' if it always provided the best result. But what is the best result? The best result produces homogeneous clusters that are distinct from one another. Homogeneous clusters have members that are close to one another. Distinct clusters are not close to one another. Many of the methods differ by their measure of distance or similarity.

 

The other difference between clustering methods has to do with the agglomeration of observations into clusters.

 

The features of the data can also determine the performance of the different methods.

 

So you might want to re-visit the descriptions of the methods to see if one of them speaks to your sense of similarity or the way you would agglomerate observations based on your knowledge of the data.

 

The other way is empirical: try them all and select the one that produces the best result in your opinion.

Learn it once, use it forever!

View solution in original post

Highlighted
Ressel
Level III

Re: Clustering of amino acid data - how to choose an appropriate method

Thanks - as always very clarifying Mark! I now can calibrate my expectations and hope our marketing team can do the same.
Article Labels