Support for Larger Genotype Data Sets in JMP Genomics 3.2
Jul 23, 2008 8:27 AM
The rapid growth of SNP data sets due to the introduction of 1 million SNP chips from Affy and Illumina and NextGen sequencing has led to larger and larger data sets. While JMP Genomics 3.1 supported analysis of SNP data sets as large as 1 million SNPs x 4,000 individuals, JMP Genomics 3.2 includes a number of processes that support data sets as large as 1 million SNPs x 10,000 individuals: Marker Properties, Missing Genotype by Trait Summary, Recode Genotypes, Case-Control Association, SNP-Trait Association and PCA.
Because of numerous code improvements in JMP Genomics genetics processes, performance for large data sets has also improved dramatically. The main limitation for working with these data sets is hard drive space, since these data sets are processed in SAS code, and SAS is file-based rather than RAM-limited. A simulated 1 million SNP x 10,000 sample data set is about 80 GB and requires about twice as much free space on the hard drive to analyze. Users working with such data sets commonly use 1 terabyte or larger hard drives.
To deal more efficiently with testing the limits on data set size, all JMP Genomics team members recently upgraded their testing and development Windows XP PC workstations to Dell 755s with large dual hard drives, dual-core processors and 3-4 GB RAM. Even so, delivery of a large genetics data set to be analyzed for MAQC sent us searching for extra storage for one of our main testing servers. Although we’re just getting started analyzing that data set, it’s given us a new appreciation for the challenges of storing and analyzing such large amounts of raw data. We all are excited to dig into a real, large data set. Simulated sets may help us test the limits of our code, but there is nothing like digging into real data.