Re: Two-way clustering vs. cluster variables

abmayfield · Aug 1, 2023 09:37 AM

I got a good tip from Scott Allen a few weeks ago for using variable clustering to reduce dataset complexity. I have a very wide data table: 16 rows and 35,000 or more columns. Strangely, if I use "Fast Ward" under hierarchical clustering and choose "two-way" under the red triangle, two-way clustering (by row and by column) occurs within seconds. However, I can't find the option to find the "most representative column" within each cluster, an option if you go directly to "cluster variables." If I add all 35,000 columns to cluster variables on my Mac with 64 GB of RAM, it won't even get close to completely (maybe 5-10% after three days!). My question is: is it possible to extract data from the Fast Ward output from hierarchical clustering that would give me the most representative column? I actually think there may be (e.g., regress raw data against the mean of the respective cluster). If not, I may need to put in a Wish List request that a "fast" algorithm be added to the cluster variables platform!

Anderson B. Mayfield

P_Bartell · Aug 1, 2023 11:12 AM

Wish Listing something like this is always a good idea. But 'till then...Not sure this would work...kind of klugey and I have no clue if the math behind what I'm suggesting is valid...but could you subset each cluster? Then do a PCA on each cluster. And then between the eigenvalue pareto, score and loading plot you can kind of backdoor your way to the most representative column? Love to hear what others on here think?

abmayfield · Aug 1, 2023 01:01 PM

Thanks. I think you're on to something, and it's worth mentioning that there appears to be a computational "break" somewhere around 20-25,000 columns. By that I mean: with 25,000 analytes (give or take), it will run in 24-36 hours (which is acceptable to me), but much over this number and it hangs. So I could likely use the predictor or response screen (or even the column viewer) to weed out samples with no or little variation, bring my ~35,000 columns down to 20-25,000, and then run it with those "variable" analytes. But I do think even just using PCA with ALL analytes could give me a sense of the effective redundancy, especially so many of these proteins will be correlated with one another!

Anderson B. Mayfield

Dan_Obermiller · Aug 1, 2023 02:18 PM

I think @P_Bartell 's suggestion is good. The variable clustering approach is using PCA "behind the scenes" and is using the eigenvalues to help determine the most representative variable.

The two-way clustering does not have eigenvalues or any other criteria to help determine the most representative. So, using PCA after the two-way clustering would be a way to mimic what is happening with variable clustering.

Dan Obermiller

abmayfield · Aug 2, 2023 11:48 AM

Great. Thank you both. I am going to try this out here in a bit, but it sounds like it should give me exactly what I need.

Anderson B. Mayfield