Discussions

nikles · Dec 18, 2024 05:18 PM

Hi,

I'm trying to use JMP to open a hdf5 file using a script. I can successfully open it using the Open() command:

blobpath = "/Documents/myblob";
dtblob_lis = Open(blobpath);

but I get a pop-up asking which data set I'd like to open:

My question: is there an option in the Open() command I could use to specify which data set within the blob I wish to open, and avoid the pop-up? Thanks.

txnelson · Dec 19, 2024 08:47 AM

Brian,

Where is this capability documented?

Jim

briancorcoran · Dec 19, 2024 10:17 AM

Hi Jim,

There is an example of this at https://www.jmp.com/support/help/en/18.1/index.shtml#page/jmp/import-data.shtml under the section "Import HDF5 Files". Admittedly a bit challenging to find, but it does show up in Google.

Brian

nikles · Dec 19, 2024 11:57 AM

This is exactly the solution I was looking for. Works. Thanks.

jljmp · Oct 2, 2025 08:09 PM

I was looking for faster & more efficient ways of transferring data between Python and JMP 17, because both pandas and JMP are quite slow at reading/writing CSV, and even the GetTable() function seems to use CSV under the hood, at least in JMP 17. Pandas' to_hdf function doesn't work because instead of storing string data in a HDF-compliant dataset, it instead uses some custom shenanigans to store a Numpy array as a string, which is not only space inefficient but locks you in to reading it with Python.

I wrote a custom proof-of-concept Python code that saves a .h5 table using h5py, and JSL code that reads the table and reconstructs the original table from the separate datasets.

import pandas as pd
import numpy as np
from pathlib import Path
import h5py
from time import perf_counter


last_time = perf_counter()
compression = "gzip"


def log_time(msg):
    global last_time
    current_time = perf_counter()
    elapsed_time = current_time - last_time
    print(f"{elapsed_time:1.3f} s: {msg}")
    last_time = current_time

# read dataframe (just for the development example)
df = pd.read_csv(Path(__file__).parent / "my_table.csv")
log_time("Read df")

string_columns = df.select_dtypes(include=["object"]).columns
# min_itemsize = {col: df[col].str.len().max() for col in string_columns}
max_string_length = np.max([df[col].str.len().max() for col in string_columns]) if len(string_columns) > 0 else 1
max_string_column_length = np.max([len(col) for col in string_columns]) if len(string_columns) > 0 else 1
log_time("max_string_length =" + str(max_string_length))

float_columns = df.select_dtypes(include=["float"]).columns
max_float_column_length = np.max([len(col) for col in float_columns]) if len(float_columns) > 0 else 1
log_time("max_float_column_length =" + str(max_float_column_length))
int_columns = df.select_dtypes(include=["int"]).columns
max_int_column_length = np.max([len(col) for col in int_columns]) if len(int_columns) > 0 else 1
log_time("max_int_column_length =" + str(max_int_column_length))

max_column_length = np.max(
    [max_float_column_length, max_int_column_length, max_string_column_length]
)
log_time("max_column_length =" + str(max_column_length))

file = h5py.File(Path(__file__).parent / "sample.h5", "w")
log_time("Opened h5 file")
string_data = df[string_columns].astype(str).values
log_time("Converted string data to str")
string_data_ascii = np.array(
    [[s.encode("ascii", "ignore") for s in row] for row in string_data]
)
log_time("Converted string data to ascii")
file.create_dataset(
    "string_data",
    data=string_data_ascii,
    dtype=f"S{max_string_length}",
    compression=compression,
)
log_time("Created string_data dataset")
file.create_dataset(
    "string_columns",
    data=np.array(string_columns, dtype=f"S{max_string_column_length}"),
    compression=compression,
)
log_time("Created string_columns dataset")
file.create_dataset(
    "float_data", data=df[float_columns].values, dtype="f4", compression=compression
)
log_time("Created float_data dataset")
file.create_dataset(
    "float_columns",
    data=np.array(float_columns, dtype=f"S{max_float_column_length}"),
    compression=compression,
)
log_time("Created float_columns dataset")
file.create_dataset(
    "int_data", data=df[int_columns].values, dtype="i4", compression=compression
)
log_time("Created int_data dataset")
file.create_dataset(
    "int_columns",
    data=np.array(int_columns, dtype=f"S{max_int_column_length}"),
    compression=compression,
)
log_time("Created int_columns dataset")
file.create_dataset(
    "column_order",
    data=np.array(
        df.columns,
        dtype=f"S{max_column_length}",
    ),
    compression=compression,
)
log_time("Created column_order dataset")

file.flush()
file.close

CloseAll(datatables, nosave);
last_time = TickSeconds();
log_time = Function({msg},
	now = TickSeconds();
	print(Char(now - last_time) || " - " || Char(msg));
	last_time = now;
);
	
hdf5_file_path = "my_table.h5";
dataset_names = {
	// "/column_order",
	"/float_columns",
	"/float_data",
	"/int_columns",
	"/int_data",
	"/string_columns",
	"/string_data"
};
// Open the hdf file
hdf5_file = Open(hdf5_file_path, dataset_names, invisible);
log_time("Opened HDF5 file: " || hdf5_file_path);

// Get list 
// columns = As Column(DataTable("-column_order"), "Column 1"n) << get as matrix;
string_columns = As Column(DataTable("-string_columns"), "Column 1"n) << get as matrix;
float_columns = As Column(DataTable("-float_columns"), "Column 1"n) << get as matrix;
int_columns = As Column(DataTable("-int_columns"), "Column 1"n) << get as matrix;
Close(DataTable("-string_columns"), nosave);
Close(DataTable("-float_columns"), nosave);
Close(DataTable("-int_columns"), nosave);
log_time("Got columns");

dt_int = DataTable("-int_data");
ForEach({cname, i}, int_columns,
	Column(dt_int, "Column " || Char(i)) << SetName(cname)
);
log_time("Assembled int table");

dt_float = DataTable("-float_data");
ForEach({cname, i}, float_columns,
	Column(dt_float, "Column " || Char(i)) << SetName(cname)
);
log_time("Assembled float table");

dt_string = DataTable("-string_data");
ForEach({cname, i}, string_columns,
	Column(dt_string, "Column " || Char(i)) << SetName(cname)
);
log_time("Assembled string table");

dt_string << Update(
	With( dt_int ),
	Replace Columns in Main Table( None )
);
Close(dt_int, nosave);
dt_string << Update(
	With( dt_float ),
	Replace Columns in Main Table( None )
);
Close(dt_float, nosave);
log_time("Combined table");

// For( i = N Items( columns ), i >= 1, i--,
// 	Eval( Parse( "combined_dt << Move Selected Columns(\!"" || columns[i] || "\!"n, To first )" ) )
// );
// log_time("Moved columns");
dt_string << SetName("my_table");
dt_string << ShowWindow(1);

I disabled the code that attempts to reconstruct the original column order, because for wide tables, it's very very slow. Also, pls note that I didn't account for the case where there are no columns of a particular type. That'll require a little bit of extra work in the jsl side if you care about that.

Results: For a data table with 80 rows and 115,791 columns, which would normally take pandas 20 seconds to save as csv and jmp 24 seconds to open, it took my script 1.7 seconds to save and JMP 11.5 seconds to open. Additionally, the file went down from 88.8 MB to 24.5 MB when using gzip compression. That's a total savings of 70% in terms of time, and 72% in terms of space.

Discussions

Help on opening hdf5 files

Re: Help on opening hdf5 files

Re: Help on opening hdf5 files

Re: Help on opening hdf5 files

Re: Help on opening hdf5 files

Recommended Articles