cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
nikles
Level VI

Help on opening hdf5 files

Hi,

 

I'm trying to use JMP to open a hdf5 file using a script.  I can successfully open it using the Open() command:

blobpath = "/Documents/myblob";
dtblob_lis = Open(blobpath);

but I get a pop-up asking which data set I'd like to open: 

nikles_0-1734559582775.png

My question: is there an option in the Open() command I could use to specify which data set within the blob I wish to open, and avoid the pop-up?  Thanks.

13 REPLIES 13
txnelson
Super User

Re: Help on opening hdf5 files

Brian,

Where is this capability documented?

Jim

Re: Help on opening hdf5 files

Hi Jim,

 

There is an example of this at https://www.jmp.com/support/help/en/18.1/index.shtml#page/jmp/import-data.shtml under the section "Import HDF5 Files".  Admittedly a bit challenging to find, but it does show up in Google.

 

Brian

nikles
Level VI

Re: Help on opening hdf5 files

This is exactly the solution I was looking for.  Works.  Thanks.

jljmp
Level I

Re: Help on opening hdf5 files

I was looking for faster & more efficient ways of transferring data between Python and JMP 17, because both pandas and JMP are quite slow at reading/writing CSV, and even the GetTable() function seems to use CSV under the hood, at least in JMP 17. Pandas' to_hdf function doesn't work because instead of storing string data in a HDF-compliant dataset, it instead uses some custom shenanigans to store a Numpy array as a string, which is not only space inefficient but locks you in to reading it with Python.

 

I wrote a custom proof-of-concept Python code that saves a .h5 table using h5py, and JSL code that reads the table and reconstructs the original table from the separate datasets.

 

import pandas as pd
import numpy as np
from pathlib import Path
import h5py
from time import perf_counter


last_time = perf_counter()
compression = "gzip"


def log_time(msg):
    global last_time
    current_time = perf_counter()
    elapsed_time = current_time - last_time
    print(f"{elapsed_time:1.3f} s: {msg}")
    last_time = current_time

# read dataframe (just for the development example)
df = pd.read_csv(Path(__file__).parent / "my_table.csv")
log_time("Read df")

string_columns = df.select_dtypes(include=["object"]).columns
# min_itemsize = {col: df[col].str.len().max() for col in string_columns}
max_string_length = np.max([df[col].str.len().max() for col in string_columns]) if len(string_columns) > 0 else 1
max_string_column_length = np.max([len(col) for col in string_columns]) if len(string_columns) > 0 else 1
log_time("max_string_length =" + str(max_string_length))

float_columns = df.select_dtypes(include=["float"]).columns
max_float_column_length = np.max([len(col) for col in float_columns]) if len(float_columns) > 0 else 1
log_time("max_float_column_length =" + str(max_float_column_length))
int_columns = df.select_dtypes(include=["int"]).columns
max_int_column_length = np.max([len(col) for col in int_columns]) if len(int_columns) > 0 else 1
log_time("max_int_column_length =" + str(max_int_column_length))

max_column_length = np.max(
    [max_float_column_length, max_int_column_length, max_string_column_length]
)
log_time("max_column_length =" + str(max_column_length))

file = h5py.File(Path(__file__).parent / "sample.h5", "w")
log_time("Opened h5 file")
string_data = df[string_columns].astype(str).values
log_time("Converted string data to str")
string_data_ascii = np.array(
    [[s.encode("ascii", "ignore") for s in row] for row in string_data]
)
log_time("Converted string data to ascii")
file.create_dataset(
    "string_data",
    data=string_data_ascii,
    dtype=f"S{max_string_length}",
    compression=compression,
)
log_time("Created string_data dataset")
file.create_dataset(
    "string_columns",
    data=np.array(string_columns, dtype=f"S{max_string_column_length}"),
    compression=compression,
)
log_time("Created string_columns dataset")
file.create_dataset(
    "float_data", data=df[float_columns].values, dtype="f4", compression=compression
)
log_time("Created float_data dataset")
file.create_dataset(
    "float_columns",
    data=np.array(float_columns, dtype=f"S{max_float_column_length}"),
    compression=compression,
)
log_time("Created float_columns dataset")
file.create_dataset(
    "int_data", data=df[int_columns].values, dtype="i4", compression=compression
)
log_time("Created int_data dataset")
file.create_dataset(
    "int_columns",
    data=np.array(int_columns, dtype=f"S{max_int_column_length}"),
    compression=compression,
)
log_time("Created int_columns dataset")
file.create_dataset(
    "column_order",
    data=np.array(
        df.columns,
        dtype=f"S{max_column_length}",
    ),
    compression=compression,
)
log_time("Created column_order dataset")

file.flush()
file.close

 

CloseAll(datatables, nosave);
last_time = TickSeconds();
log_time = Function({msg},
	now = TickSeconds();
	print(Char(now - last_time) || " - " || Char(msg));
	last_time = now;
);
	
hdf5_file_path = "my_table.h5";
dataset_names = {
	// "/column_order",
	"/float_columns",
	"/float_data",
	"/int_columns",
	"/int_data",
	"/string_columns",
	"/string_data"
};
// Open the hdf file
hdf5_file = Open(hdf5_file_path, dataset_names, invisible);
log_time("Opened HDF5 file: " || hdf5_file_path);

// Get list 
// columns = As Column(DataTable("-column_order"), "Column 1"n) << get as matrix;
string_columns = As Column(DataTable("-string_columns"), "Column 1"n) << get as matrix;
float_columns = As Column(DataTable("-float_columns"), "Column 1"n) << get as matrix;
int_columns = As Column(DataTable("-int_columns"), "Column 1"n) << get as matrix;
Close(DataTable("-string_columns"), nosave);
Close(DataTable("-float_columns"), nosave);
Close(DataTable("-int_columns"), nosave);
log_time("Got columns");

dt_int = DataTable("-int_data");
ForEach({cname, i}, int_columns,
	Column(dt_int, "Column " || Char(i)) << SetName(cname)
);
log_time("Assembled int table");

dt_float = DataTable("-float_data");
ForEach({cname, i}, float_columns,
	Column(dt_float, "Column " || Char(i)) << SetName(cname)
);
log_time("Assembled float table");

dt_string = DataTable("-string_data");
ForEach({cname, i}, string_columns,
	Column(dt_string, "Column " || Char(i)) << SetName(cname)
);
log_time("Assembled string table");

dt_string << Update(
	With( dt_int ),
	Replace Columns in Main Table( None )
);
Close(dt_int, nosave);
dt_string << Update(
	With( dt_float ),
	Replace Columns in Main Table( None )
);
Close(dt_float, nosave);
log_time("Combined table");

// For( i = N Items( columns ), i >= 1, i--,
// 	Eval( Parse( "combined_dt << Move Selected Columns(\!"" || columns[i] || "\!"n, To first )" ) )
// );
// log_time("Moved columns");
dt_string << SetName("my_table");
dt_string << ShowWindow(1);

I disabled the code that attempts to reconstruct the original column order, because for wide tables, it's very very slow. Also, pls note that I didn't account for the case where there are no columns of a particular type. That'll require a little bit of extra work in the jsl side if you care about that.

 

Results: For a data table with 80 rows and 115,791 columns, which would normally take pandas 20 seconds to save as csv and jmp 24 seconds to open, it took my script 1.7 seconds to save and JMP 11.5 seconds to open. Additionally, the file went down from 88.8 MB to 24.5 MB when using gzip compression. That's a total savings of 70% in terms of time, and 72% in terms of space.

Recommended Articles