Apache Parquet file importer

1 Kudo

This add-in imports interactively Apache Parquet files (https://parquet.apache.org/) into JMP tables.

It consists of two commands:

Single file: select a single .parquet file, checks its validity, then after a confirmation message creates and opens the resulting JMP data table, or issues a warning message if the file is not valid;
Multiple files: select a folder with all .parquet files and opens the resulting JMP tables (note: the folder must contain only valid .parquet files, and no other files or subfolder)

System requirements:

JMP 17;
A Python distribution from python.org from release 3.7.x or higher;
Required Python Pandas and Numpy packages, as documented in https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/install-python.shtml
PyArrow Python package: https://arrow.apache.org/docs/python/index.html

Use the standard Python update tools to install these packages for your Python configuration.

Note: in order to test the add-in, you can find sample Parquet file here: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet

dmmdiego · ‎02-13-2024

Massimo,

I tested the add-in in JMP Pro 17.2, to check loading a parquet file generated through pandas, and it did not work, it does not recognize the file (even if I can read it with parquet reader programs such as Tad).

Looking forward for any help, thanks!

MassimoMartucci · ‎02-14-2024

Hi dmdiego,

did you install all needed Python packages?

If yes, please share the JMP log so we can see better what is happening.

By the way, JMP V18 (that is expected to ship March 2024) will have a more direct JMP-Python link, so chances are that we will update the add-in.

Cheers, Massimo

dmmdiego · ‎02-14-2024

Massimo,

I think I got all the packages, this is my setup:

Python version 3.11.7

numpy version 1.26.3

pandas version 2.1.4

matplotlib version 3.8.0

scipy version 1.11.4

sqlite3 (included in Python base)

PyQt5 version 5.15.10

pyarrow version 14.0.2

The only thing I can think of, I manage my packages on the base environment through conda, not sure if this is a conflict. Or the way Python was setup on my computer...

This is the log from JMP:

/*:
/**********/

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Read a Table from Parquet format
table = pq.read_table(parquet_file)

# Convert Table to a pandas-compatible DataFrame
df = table.to_pandas()

#print(df)

/**********/

Even with simple sample parquet files, I get the same error:

MassimoMartucci · ‎02-14-2024

Hi dmdiego,

configuration seems ok, and conda should not be an issue.

The log is not saying much.

Could you please share with us the parquet file you are using, so that we can test directly with it?

Thanks, Massimo

dmmdiego · ‎02-14-2024

Using the sample files in your original link: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet, did not work either. So I cannot load any of those files, I get the same error message.

MassimoMartucci · ‎02-16-2024

Hi, it should not be like that, but possibly it must be related to the add-in installation, that may conflict with JMP Pro v17.2.

Could you please try to copy and to run the script below in a script window (without the add-in), and tell me if it works?

Thanks, Massimo

Names Default To Here( 1 );

// Pick Parquet file from directory
parquet_file_pf = Pick File( "Select Parquet File", , {"Parquet Files|parquet"} );

// Create Parquet_file string to pass to Python - exclude initial slash 
parquet_file = Substr( parquet_file_pf, 2 );

// Extract Parquet file name without .parquet extension - to be used as JMP table name
parquet_filename = Word( -2, parquet_file, "/." );

// Show( parquet_file );

// Init Python connection
Python Init();

// Send Parquet complete file string to Python
Python Send( parquet_file );

//Python Submit( "print(parquet_file)" );

// Read Parquet file as a table, then read table as pandas DataFrame
Python Submit(
	"\[
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Read a Table from Parquet format
table = pq.read_table(parquet_file)

# Convert Table to a pandas-compatible DataFrame
df = table.to_pandas()

#print(df)

]\"
);

// Get DataFrame as a JMP Data Table
dt = Python Get( df );

// check dt successfully imported
Try(
	Is Missing( dt );
	table_imported = 0;
	,
	table_imported = 1
);

// if dt is not a valid data table, set error message and close program
If( (table_imported == 0),
	nw = New Window( "Import not successful", Modal, Text Box( "File " || parquet_filename || ".parquet is not a valid Parquet file" ) );
	Stop();
);

// If dt is valid, continue execution
// Rename JMP Data Table as Parquet file name without extension
dt << Set Name( parquet_filename );

// Modal message window to confirm import
import_message = "File " || parquet_filename || ".parquet successfully imported";
nw = New Window( "Import successful", Modal, Text Box( import_message ) );

// Open JMP Data Table
dt << New Data View;

// Save JMP Data Table
// dt << Save();

// Terminate Python session
Python Term();

dmmdiego · ‎02-19-2024

Massimo,

The script works very well, and I can successfully load parquet files. Indeed this will be super useful, in lieu of the add-in, I can just run the script. Thanks for the kind help!

But indeed there is an issue with the add-in which is not working in my current installation of JMP Pro. But the script will work.

Paul_Nelson · ‎03-07-2024

Straight from my JMP Discovery Summit presentation, a JMP 18 script for processing parquet files, where we go directly from the Parquet column to a jmp.DataTable Python object. JMP 18 allows live creation, reading, modification of JMP data tables from Python. The following script uses one of the Teradtata sample files, accesses the parquet file's schema and creates the comparable data table column directly in Python. The following runs directly from the Python aware script editor in JMP 18. The import jmp package is only available from within JMP as it is written as a C++ extension within the JMP executable's code base.

# parquet.py
# Author: Paul R. Nelson
#         JMP Statistical Discovery LLC
#
# Description:
# 	Directly read a pyarrow table and create a JMP datatable. Discovered 
#   some issues in EA-5, that prevented this from working as desired, 
#   There is a conditional based on jmp.__version__ to check for EA-6
#   or newer. 
# 
# The data for this file comes from a repository that is Apache licensed.
# https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet
#
import jmp
import pyarrow.parquet as pq
from pyarrow import types as ptypes

# verify JMP 18 EA-6 or newer
ver = jmp.__version__.split('.')
if(ver[0] == '0' and ver[1] < '6'):
	print('This version requires JMP 18.0 EA-6 or newer.')

pq_Table = pq.read_table('/Users/me_me_me/Presentation/Parquet/userdata1.parquet')
	
# Create a helper function to build a JMP Column 				
#   on DataTable 'dt' using parquet 'table', having column 'name'
def dtcol_from_pq_col_name(dt, table, name):
    col = None
    idx = table.schema.get_field_index(name)
    col_type = table.schema[idx].type
    print(f'{name}: {table.schema[idx].type}')
    if ptypes.is_string(col_type):
       col = dt.new_column(name, jmp.DataType.Character)
    elif ptypes.is_date(col_type):
       # This table does not contain a date column do this is unimplemented 
       print("FixMe: got a date column")
       return None
    elif ptypes.is_timestamp(col_type):
       col = dt.new_column(name, jmp.DataType.Numeric) 
       # add JMP epoch time correction factor - 1Jan1904
       timestamps = [ts.timestamp()+ 2082844800.0 for ts in table.column(idx).to_pylist()]  
       dt[name] = timestamps
       # change column info to timestamp, and set an appropriate column width.
       jmp.run_jsl(f'Data Table( "{dt.name}" ):{name} << Input Format( "yyyy-mm-dd" ) << Format( "yyyy-mm-ddThh:mm:ss", 19 ) << Set Display Width( 200 );')
    else:
      col = dt.new_column(name, jmp.DataType.Numeric)   # Numeric
    dt[name] = table.column(idx).to_pylist()
    return col
    

# Create JMP data table with four of the columns from the table by name
dt = jmp.DataTable('From Parquet', pq_Table.num_rows)

# could just as easily create all columns
#     for n in pq_Table.column_names:  
# instead of iterating on the list below
col_list = [ 'registration_dttm', 'first_name', 'last_name', 'salary']
for n in col_list:
    dtcol_from_pq_col_name(dt, pq_Table, n)
   
# set column properties by calling JSL from Python. (formats & widths)
jmp.run_jsl('''
// Change column display width: last_name
Data Table( "From Parquet" ):last_name << Set Display Width( 105 );
// Change column display width: Salary
Data Table( "From Parquet" ):salary << Set Display Width( 150 ) << Format( "Currency", "USD", 17, 2 );	
''')

# display data about the parquet file
print(f'Table Rows: {len(pq_Table)}' )
print(f'Table Shape: {pq_Table.shape}')
print(f'Table Schema:\n{pq_Table.schema}')
print(pq_Table.schema.field('salary').type)
print(f'Column Names: {pq_Table.column_names}')
print(f'Num Columns: {pq_Table.num_columns}')

hogi · ‎10-26-2024

When will the AddIn get fixed?

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Getting Started with JMP: On Demand Course

Distribution new features for JMP 17