cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
pcarroll1
Level IV

Dimension Reduction on large data sets

I am often dealing with data that has thousands of terms, and many thousands, sometime hundreds of thousands, observations.  Many of the terms are usually correlated, so I would like to reduce the dimensionality.  But if I try to use Principal Components to do this the computation time is excessive or downright impractical.  Is there another method that would work better to reduce the dimensions of the data in such a case?

1 REPLY 1
peng_liu
Staff

Re: Dimension Reduction on large data sets

Some thoughts.

  1. Are the data from the same source, e.g. they are similarly structured from data set to data set? If so, finding a more efficient way is worthwhile. Otherwise, you get one short deal every time anyway, and machine might still be faster each time.
  2. What is the next step after dimension reduction? What the objective? You many not need dimension reduction (or a separate dimension reduction) at all, if one methodology can address your objective without preprocessing to reduce dimension. Think of advanced capabilities in Generalized Regression.
  3. PCA operates on covariance matrix. So you have a large matrix and many rows to come up with that matrix. Would a heuristic dimension reduction method work, e.g. variable clustering?