Dimension Reduction on large data sets

pcarroll1 · Jun 8, 2023 9:39 AM

I am often dealing with data that has thousands of terms, and many thousands, sometime hundreds of thousands, observations. Many of the terms are usually correlated, so I would like to reduce the dimensionality. But if I try to use Principal Components to do this the computation time is excessive or downright impractical. Is there another method that would work better to reduce the dimensions of the data in such a case?

peng_liu · Feb 3, 2023 09:59 AM

Some thoughts.

Are the data from the same source, e.g. they are similarly structured from data set to data set? If so, finding a more efficient way is worthwhile. Otherwise, you get one short deal every time anyway, and machine might still be faster each time.
What is the next step after dimension reduction? What the objective? You many not need dimension reduction (or a separate dimension reduction) at all, if one methodology can address your objective without preprocessing to reduce dimension. Think of advanced capabilities in Generalized Regression.
PCA operates on covariance matrix. So you have a large matrix and many rows to come up with that matrix. Would a heuristic dimension reduction method work, e.g. variable clustering?

Dimension Reduction on large data sets

Re: Dimension Reduction on large data sets