by John Sall john.sall, Co-Founder and Executive Vice President, SAS
With multivariate methods, you have to do things differently when you have wide (many thousands of columns) data.
The Achilles heel for the traditional approaches for wide data is that they start with a covariance or correlation matrix, which grows with the square of the number of columns. Genomics research, for example, tends to have wide data:
In one genomics study, there are 10,787 gene expression columns, though there are only 230 rows. That covariance matrix has more than 58 million elements and takes 465MB of memory.
Another study we work with has 49,041 genes on 249 individuals, which results in a covariance matrix that uses 9.6GB of memory.
Yet another study has 1,824,184 gene columns, yielding a covariance matrix of 13.3TB. Computers just don’t have that much memory – in fact, they don’t have that much disk space.
Even if you have the memory for these wide problems, it would take a very long time to calculate the covariance matrix, on the order of n*p2, where n is the number of rows, and p is the number of columns. Furthermore, it gets even worse when you have to invert the covariance matrix, which would cost on the order of p3. In addition, the covariance matrix is very singular, since there are far fewer rows than columns in the data.
So you would think that multivariate methods are impractical or impossible for these situations. Fortunately, there is another way, one that makes wide problems not only practical, but reasonably fast while yielding the same answers that you would get with a huge covariance matrix.
The number of effective dimensions in the highly multivariate space is on the order of the number of rows, since that is far smaller than the number of columns. All the variation in that wide matrix can be represented by the principal components, those associated with positive variance, that is, non-zero eigenvalues. There is another way to get the principal components using the singular value decomposition of the (standardized) data. Now that you are in principal component space, the dimension of these principal components is small; in fact, all the calculations simplify because the covariance matrix of principal components is diagonal, which is trivial to invert.
Suppose that X is the n by p matrix of standardized data, with p much greater than n, so the covariance matrix is the p by p matrix calculation X’X/n.
The SVD expresses any X as a transformation that is a rotation, a scaling, and then another rotation:
X = U diag(M) V’
Where U is and n by m matrix of orthogonal columns, M is a vector of m positive singular values, and V is a p by m matrix of orthogonal columns. It turns out that if you had taken the eigenvalue decomposition of the covariance matrix, that the eigenvalues of it would be M2 and the eigenvectors would be V, so since only need M and V for principal components, then you can do the SVD of the data, instead of the eigenvalues of the covariance matrix, and if p is in the thousands and n is in the hundreds, this is much faster. The principal components are just X*V.
For example, on a problem with 10,787 columns and 230 rows, the principal components through the correlation matrix takes 81 minutes, but with the SVD takes only 7.5 seconds, more than 600 times faster. I tried it on a genomics data set with 1,824,184 columns across 249 rows. I was able to do the principal components of that data using JMP 12 on my 6-core Mac Pro in only 94 minutes. Doing it the other way would clearly have been impossible.
With discriminant analysis, the main task is to calculate the distance from each point to the (multivariate) mean of each group. However, that distance is not the simple Euclidean distance, but a multivariate form of the distance -- the Mahalanobis distance, which uses the inverse covariance matrix.
But think about the standardized data – the principal components are the same data, but rotated through various angles. The difference is that the original data may have 10,000 columns, but the principal components only have a few hundred, in the wide data case. So if we want to measure distances between points or between a point and a multivariate mean (centroid), then we can do it in far fewer dimensions. Not only that, but the principal coordinates are uncorrelated, which means that the matrix in the middle of the Mahalanobis distance calculations is diagonal, so the distance is just a sum of squares of individual coordinate distances.
We take the principal components of the mean-centered data by group. Then we add back the mean to this new reduced-dimension data. The transformed mean is just the principal components transformation of the original group means.
There is a set of breast cancer genome expression data that is a benchmark used by the Microarray Quality Consortium with 230 rows and 10,787 gene expressions. A discriminant analysis using the wide linear method takes only 5.6 seconds (on Mac Pro) and produces a cross-validation misclassification rate of only 13.7% on the hardest-to-classify category, ER_status (Figure 1). This misclassification rate is close to the best achieved for this problem, given that the other methods are expensive.
Figure 1 Wide Discriminant analysis in the Principal Components platform
So far we have implemented SVD-based highly multivariate options in Discriminant, Principal Components, Hierarchical Cluster, and a form of missing value imputation. We are experimenting with using it further and also in exploring other similar methods useful in highly-multivariate analyses. Much of the hype around “Big Data” is about large numbers of rows, but large numbers of columns are also increasingly important.