Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
For JMP Genomics 3.2, we had several requests from users to implement GCRMA background correction in our CEL import process. Similar to the request to implement RMA for JMP Genomics 3.0, this was primarily motivated by customers’ desire to process relatively large data sets without being limited by RAM memory. Several customers (and our own testers) reported being limited to processing about 60-80 HU133 CEL files at a time using the GCRMA implemented in R/Bioconductor on a 32-bit 2GB Windows XP machine. We saw this when using R/Bioconductor as a standalone application or through the Bioconductor Expresso wrapper in JMP Genomics.
The JMP Genomics GCRMA implementation overcomes this memory limitation and allows processing of hundreds or even a thousand CEL files at a time. During the process, developer Tzu-Ming Chu worked through intermediate steps to implement the algorithm in SAS code. Similar to other commercial implementations of GCRMA, our implementation uses a version (2.1) of the algorithm, which depends on mismatch probes to estimate the background correction.
JMP Genomics GCRMA gives results that are highly correlated with the R/Bioconductor implementation. Correlations among log2 GCRMA normalized intensities generated by the two implementations for the same arrays ranged from 0.995 to 0.999 for several data sets we tested. Incidentally, Tzu-Ming did find that several steps in the R implementation (e.g., line fits to a subset of data for estimating non-specific binding affinity) were sensitive to the nature of the random subsets chosen by different runs of the algorithm. He also found that the current released version of the gcrma package differs from the beta version currently under development, so users of this package should be aware of potential changes coming in a later version of the algorithm. Though the GCRMA normalization was introduced in Wu et al. (2004), the implementation in Bioconductor varies from version to version and may not be exactly the same as the original paper stated.
Tzu-Ming and the other developers always urge me to point out that normalization methods including quantile normalization, such as RMA and GCRMA, are rather severe methods. In a Nature Biotechnology publication from MAQC I, Tong et al. (2006) showed clearly that Affymetrix external controls (probe sets with prefix AFFX) tend to perform inconsistently after GCRMA and RMA normalizations. Though the correlation among arrays does improve after GCRMA and RMA normalizations are applied, users of these methods may risk over-normalizing their data. Our primary motivators for implementing these methods in JMP Genomics were demand from commercial customers and the popularity of these algorithms in the genomics analysis market. As I sometimes remind the development team, we can’t tell the market what it wants. For our young product, this is the best example yet of this phenomenon.