Subscribe Bookmark
Craige_Hales

Staff

Joined:

Mar 21, 2013

Load Compressed Data

Just had a request for a way to move data from an external program into JMP.  Generally I'd recommend using CSV files for this, but if compression is an issue, maybe something else is needed.  Here's a couple of proof-of-concept scripts (you'll need to make sure they do what you need).  Since I don't have an external program, each script is in two parts: create a zip file from a table, then recreate the table from the zip file.  This assumes you are in control of how an external process might package and present the data to JMP.

Key points: zip files can hold binary (first example) or printable (second example) data.  Jsl matrixToBlob and BlobToMatrix are fast.  Don’t loop, or just loop over a few columns.  Avoid looping for every row for a lot of rows.  JMP’s zip file API appends new members, possibly changing the actualname to avoid earlier members…notice the deleteFile().  Clearing the zip (za=0) is NOT required; it is hinting that the file is outside of JMP (on disk).   Converting numbers to printable and back is slow, and might be lossy as well if the formatted values don't have enough digits.


The first script puts raw data in a zip file member in "little-endian" format, a row at a time.  Fast because it doesn't convert back and forth to printable.

The second script puts formatted data for a column into a zip file member, one column per member.  Numeric data goes in a matrix, character in a list.

You can combine ideas from both scripts.


Example 1

This example assumes binary numeric data.  It should be really fast.  Character data won’t work like this…

// (this code is untested!) make sample numeric data

dt=New Table( "people",  Add Rows( 1e7 ),

  New Column( "fred", Numeric, Continuous, Format( "Best", 12 ), Formula( Random Normal() )  ),

  New Column( "ralph", Numeric, Continuous, Format( "Best", 12 ), Formula( Random Normal() )  ),

  New Column( "george", Numeric, Continuous, Format( "Best", 12 ), Formula( Random Normal() ),  )

);

// make sample zip file with binary data

datamat = (dt:fred<<getasmatrix) || (dt:ralph<<getasmatrix) || (dt:george<<getasmatrix);

blobmat = matrixtoblob(datamat,"float",8,"little");

za = open("$temp/deleteme2.zip", "zip");

actualname = za<<write( "data", blobmat );

// clear zip

za = 0;

// re-open zip

za = open("$temp/deleteme2.zip", "zip");

show(za<<dir); // check members

start = tickseconds();

blobextract = za<<read(actualname,format(blob));

dataextract = blobtomatrix( blobextract, "float", 8, "little", 3 /*columns*/);

dtextract = newtable();

dtextract << setmatrix(dataextract);

stop=tickseconds();

show(stop-start);

stop - start = 2.01666666666279; // decompressed+loaded 10,000,000 rows x 3 columns in 2 to 3 seconds


Example 2

Here’s another variation, slower but flexible (handles numeric and character):

// (this code is untested!) make sample numeric data

dt=New Table( "people",  Add Rows( 1e6 ),

  New Column( "fred", Numeric, Continuous, Format( "Best", 12 ), Formula( Random Normal() )  ),

  new column("fred char", character, formula(char(randominteger(1000,99999)))),

  New Column( "ralph", Numeric, Continuous, Format( "Best", 12 ), Formula( Random Normal() )  ),

  new column("ralph char", character, formula(char(randominteger(1000,99999)))),

  New Column( "george", Numeric, Continuous, Format( "Best", 12 ), Formula( Random Normal() ),  ),

  new column("george char", character, formula(char(randominteger(1000,99999))))

);

dt<<runformulas();

// make sample zip file with binary data

try(deletefile("$temp/deleteme.zip"));

za = open("$temp/deleteme.zip", "zip");

collist = dt<<getcolumnreference;

for(i=1,i<=nitems(collist),i++,

    data = collist[i]<<getasmatrix;

    name = collist[i]<<getname;

    za<<write(name,char(data));

);

// clear zip

za = 0;

// re-open zip

start = tickseconds();

za = open("$temp/deleteme.zip", "zip");

colnames = za<<dir; // check members

dtextract = newtable("extracted");

for(i=1,i<=nitems(colnames),i++,

    txt = za<<read( colnames[i]);

    dtextract<<newcolumn( colnames[i], values(parse(txt)));

);

stop=tickseconds();

show(stop-start);

stop - start = 12.3500000000349; // decompressed+loaded 1,000,000 x 6 in 12 to 14 seconds



Article Tags