Subscribe Bookmark


May 27, 2014

Principal Variance Components Analysis

Apart from truly novel and innovative breakthroughs, a basic recipe for scientific research is:

  1. Combine two or more established methods that have evolved independently, leveraging particular strengths of each.
  2. Give the new combination a clever name.
  3. Publish the heck out of it.
  4. In this blog post, I’d like to highlight Step 1 for a technique we’ve been using for a few years now called Principal Variance Components Analysis (PVCA). PVCA integrates two methods with rich histories: principal components analysis (PCA) and variance components analysis (VCA). PCA finds low dimensional linear combinations of data with maximal variability, whereas VCA attributes and partitions variability into known sources via a classical random effects model.

    PCA is arguably the most popular technique for high-dimensional data. For example, consider the normal baseline gene expression data from Boedigheimer et al. (2008), measuring around 30,000 genes on 483 rat samples. If we arrange the data in tall form, with genes as rows and samples as columns, then we have a 30,000 x 483 matrix. A standard PCA is computed by first centering the columns at their mean value and then computing the first few eigenvectors of the matrix. We thus obtain eigenscores for both the rows and columns, and if we focus on the first two sets of scores for the samples (a.k.a. loadings), we can plot them against one another to see which samples exhibit similar expression profiles in the directions of maximal separation.

    A key difficulty with PCA is attaching an interpretation to each component. Factor analysis tries to address this via rotations, but these are somewhat subjective and can be difficult to manage. A simpler way to interpret principal components is to color and/or mark a PCA plot by known sample attributes to see if any of them align with the scores or explain clusters. In many cases this works, but as concomitant data become more abundant, the risk of missing something becomes much higher. In the example above, 35 different sample attributes (e.g., tissue type, gender, strain, diet) are available. To make matters more difficult, the data are observational (not a designed experiment), and many of the attributes are significantly confounded with each other. With some painstaking and tedious work, several key factors can be identified visually, but a comprehensive, quantitative screen is lacking.

    This is where the strength of VCA comes into play. Using the principal component scores as responses, we can set up a mixed model with just an intercept as the sole fixed effect and all known sources of variability as random effects. This is purely for purposes of partitioning variability, and so even classic fixed effects like gender and tissue are still considered to be random in order to put all effects on an equal footing. The model fit via REML lets the effects compete with each other to explain the total variability in the data. Partitioning total variability in this fashion has an advantage over classical Type 1 ANOVA partitioning because it is effectively independent of the ordering of effects (except in cases of severe confounding). The technique can be fully automated and provides a nice quantitative assignment to all known sources of variability for each principal component, and the residual variance component accommodates all unspecified effects. A summary of results across all principal components can be constructed as a weighted average of the individual estimates, using eigenvalues as weights, and displayed as a bar or pie chart.

    In another example, we used PVCA to compare Affymetrix microarray data with Illumina next-gen RNA-Seq data in a schizophrenia study (Mudge et al. 2008). PVCA provides a quantitative breakdown of several key sources of variability and reveals that the next-gen data does a better job of distinguishing schizophrenic status (Diagnosis in the plot below) for this study.

    In a third example, Li et al. (2009) apply PVCA to batch correction in expression data and provide a detailed description of the approach.

    PVCA itself is a clear example of Step 1 in the research recipe above. Regarding Step 2, I’m not sure if “PVCA” is very clever, but it does at least capture the essence of the approach and makes double use of “component.” (Remember also to avoid the longstanding mistake of using “principle” instead of “principal”!)

    As for Step 3, well, this blog post itself is a genuflection in that direction. PVCA is available in JMP Genomics under the Quality Control > Correlation and Principal Components process. You can also do it by hand in JMP by running Prinicipal Components, merging the resulting sample loadings with known sample attributes, then running Fit Model with the loadings as Y variables and the sample attributes as random effects. Give it a try!


    Boedigheimer, M.J., Wolfinger, R.D., Bass, M.B., Bushel, P.B., Chou, J.W, Cooper, M., Corton, J.C., Fostel, J., Hester, S., Lee, J.S., Liu, F., Qian, H.-R., Quackenbush, J., Petit, S., Thompson, K.L., (2008), Sources of Variation in Baseline Gene Expression Levels from Toxicogenomic Study Control Animals across Multiple Laboratories. BMC Genomics, 9:285.

    Li, J., Bushel, P., Chu, T.-M., and Wolfinger, R.D. (2009) Principal Variance Components Analysis: Estimating Batch Effects in Microarray Gene Expression Data, Batch Effects and Noise in Microarray Experiments: Sources and Solutions, ed. A. Scherer, John Wiley & Sons.

    Mudge, J., Miller, N.A., Khrebtukova, I., Lindquist, I.E., May, G.D., Huntley, J.J., Luo, S., Zhang, L., van Velkinburgh, J.C., Farmer, A.D., Lewis, S., Beavis, W.D., Schilkey, F. D., Virk, S.M., Black, C.F., Myers, M.K., Madar, L.C. Langley, R.J., Utsey, J.P., Kim, R.W., Roberts, R.C., Khalsa, S.K., Garcia, M., Ambriz-Griffith, V., Harlan, R., Czika, W., Martin, S., Wolfinger, R.D., Perrone-Bizzozero, N.I., Schroth, G. P. Kingsmore, S. F. (2008) Genomic Convergence Analysis of Schizophrenia: mRNA Sequencing Reveals Altered Synaptic Vesicular Transport in Post-Mortem Cerebellum PLos One 2008 PMID: 18985160

    Community Member

    JMH wrote:

    Thanks, Russ. The PLS option is intriguing, but MI fits in well with some other analyses that we want to try with similar sets of data.I tried PROC MI with MCMC replacement of missing values in several iterations, and I got good results.

    Community Member

    Russ Wolfinger wrote:

    Thanks JMH. PROC MI sounds okay, or if you use PROC PLS to compute the PCA (specify the same variables as both Y and X), then it has the MISSING=AVG or MISSING=EM options.

    Community Member

    JMH wrote:

    This is fantastic! I was looking for a way to analyze multivariate data that were collected within a designed experiment in agriculture, and I was specifically looking for a way to include fixed effects (e.g. hybrids) and random effects (e.g. locations) in the same analyses to quantify the relative variation from various sources. It's nice to have the validation to treat traditionally fixed effects as random for this purpose.

    One question: For missing values, is it better to delete variables with missing values from the PVCA procedure or to estimate missing values with, say, PROC MI and to do the PVCA analyses on the completed data set? Thanks!

    Community Member

    Russ Wolfinger wrote:

    Sean, I believe each of the referenced papers have links to their data, although this method works with any multivariate data set for which you have factors grouping the variables. Also, JMP Genomics ships with a couple of simple examples. Regards, Russ

    Community Member

    sean s wrote:

    Where can I get the data to try this analysis out myself?