Solved: Re: simple random samples of columns

abmayfield · Aug 9, 2018 04:42 PM

Hello,

I know how to use JMP to create a SRS using rows but not with columns. I have 118,000 columns as part of a transcriptome, and I want to calculate the standard deviation across all genes (i.e., across rows). Doing so will cause JMP to crash on a Macbook Air with 8 GB of RAM. Therefore, I am wondering if I can just pick groups of 5,000 or 10,000 genes and calculate a standard deviation across them. Then, I can take the mean standard deviation of several SRSs (each column is a unique gene.). If I transpose the table to where the genes are in the rows, the transposition step itself will cause JMP to crash! Any ideas?

PS: I would attach the data file, but it's 80 MB!

Anderson B. Mayfield

uday_guntupalli · Aug 9, 2018 04:55 PM

@abmayfield,
If I understand what you are after, you want the ability to randomly select "n" columns in a data table ? Try to look at this:

dt = Open( "$SAMPLE_DATA/Airline Delays.jmp" );

nColumns = n Cols(dt); 

nSampleCols = 2; 

SelCols = As List(Random Index(nColumns,nSampleCols));

dt << Select columns(SelCols);

I think this allows to randomly sample "n" columns from the data table. Now you can use the random sample to do what you would like, please let me know if this is what you are after.

Best
Uday

View solution in original post

uday_guntupalli · Aug 9, 2018 04:55 PM

@abmayfield,
If I understand what you are after, you want the ability to randomly select "n" columns in a data table ? Try to look at this:

dt = Open( "$SAMPLE_DATA/Airline Delays.jmp" );

nColumns = n Cols(dt); 

nSampleCols = 2; 

SelCols = As List(Random Index(nColumns,nSampleCols));

dt << Select columns(SelCols);

I think this allows to randomly sample "n" columns from the data table. Now you can use the random sample to do what you would like, please let me know if this is what you are after.

Best
Uday

abmayfield · Aug 9, 2018 05:57 PM

Thanks so much. That's exactly what I needed. Now I can do SRS of columns OR stack data. Just out of curiousity, do you know if this can be down WITHOUT scripting in JMP14? Thanks,

Anderson

Anderson B. Mayfield

uday_guntupalli · Aug 9, 2018 06:01 PM

@abmayfield,
I am not aware of a way to do it interactively without using a script.

Best
Uday

abmayfield · Aug 9, 2018 05:02 PM

To answer my own question, 118,000 columns CAN be stacked, which allows for me to look at the standard deviation across all genes by my treatment factors. But I will also try this random column selection script, too.

Anderson B. Mayfield

gzmorgan0 · Aug 10, 2018 04:59 PM

Congratulations on finding a solution!

If you are using JMP 14, you might want to try a the following column formula in your table. Since you said you have limited memory, and a crash is possible, save anything important first.

Std Dev( Current Data Table()[Row(), Index( c1, c2 )] )

where c1 is the first column and c2 the last, for example 6, 118005.

If there is no crash, I would remove the formula.

BTW, the stacked table is likely more efficient for summarizing. Just posing a possible alternative.

abmayfield · Aug 10, 2018 05:08 PM

Yes, I did try to calculate a standard deviation across 118,000 columns, but it freezes JMP, and I usually give up after a few hours. I may try to let it run overnight tonight, though.

Also, I learned that when you stack 118000 columns on top of each other for 12 samples, the resulting JMP table is over 2 GB and is unstable! This is weird because a 118000 column x 12 row table is only 50 MB. I wonder why a 6 column x 1.5 million row table is so much larger?

Anderson B. Mayfield