cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Apache Parquet file importer

This add-in imports interactively Apache Parquet files (https://parquet.apache.org/) into JMP tables.

It consists of two commands:

  • Single file: select a single .parquet file, checks its validity, then after a confirmation message creates and opens the resulting JMP data table, or issues a warning message if the file is not valid;
  • Multiple files: select a folder with all .parquet files and opens the resulting JMP tables (note: the folder must contain only  valid .parquet files, and no other files or subfolder)

System requirements:

Use the standard Python update tools to install these packages for your Python configuration.

Note: in order to test the add-in, you can find sample Parquet file here: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet

Comments
dmmdiego

Massimo,

I tested the add-in in JMP Pro 17.2, to check loading a parquet file generated through pandas, and it did not work, it does not recognize the file (even if I can read it with parquet reader programs such as Tad).

Looking forward for any help, thanks!

Hi dmdiego,

did you install all needed Python packages?

If yes, please share the JMP log so we can see better what is happening.

By the way, JMP V18 (that is expected to ship March 2024) will have a more direct JMP-Python link, so chances are that we will update the add-in.

Cheers, Massimo

 

dmmdiego

Massimo,

I think I got all the packages, this is my setup:

Python version 3.11.7

numpy version 1.26.3

pandas version 2.1.4

matplotlib version 3.8.0

scipy version 1.11.4

sqlite3 (included in Python base)

PyQt5 version 5.15.10

pyarrow version 14.0.2

 

The only thing I can think of, I manage my packages on the base environment through conda, not sure if this is a conflict. Or the way Python was setup on my computer...


This is the log from JMP:

/*:
/**********/

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Read a Table from Parquet format
table = pq.read_table(parquet_file)

# Convert Table to a pandas-compatible DataFrame
df = table.to_pandas()

#print(df)


/**********/

 

Even with simple sample parquet files, I get the same error: 

dmmdiego_0-1707920721925.png

 

 

 

 

Hi dmdiego,

configuration seems ok, and conda should not be an issue.

The log is not saying much.

Could you please share with us the parquet file you are using, so that we can test directly with it?

Thanks, Massimo

 

 

dmmdiego

Using the sample files in your original link: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet, did not work either. So I cannot load any of those files, I get the same error message.

Hi, it should not be like that, but possibly it must be related to the add-in installation, that may conflict with JMP Pro v17.2.

 

Could you please try to copy and to run the script below in a script window (without the add-in), and tell me if it works?

 

Thanks, Massimo

 

Names Default To Here( 1 );

// Pick Parquet file from directory
parquet_file_pf = Pick File( "Select Parquet File", , {"Parquet Files|parquet"} );

// Create Parquet_file string to pass to Python - exclude initial slash 
parquet_file = Substr( parquet_file_pf, 2 );

// Extract Parquet file name without .parquet extension - to be used as JMP table name
parquet_filename = Word( -2, parquet_file, "/." );

// Show( parquet_file );

// Init Python connection
Python Init();

// Send Parquet complete file string to Python
Python Send( parquet_file );

//Python Submit( "print(parquet_file)" );

// Read Parquet file as a table, then read table as pandas DataFrame
Python Submit(
	"\[
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Read a Table from Parquet format
table = pq.read_table(parquet_file)

# Convert Table to a pandas-compatible DataFrame
df = table.to_pandas()

#print(df)

]\"
);

// Get DataFrame as a JMP Data Table
dt = Python Get( df );

// check dt successfully imported
Try(
	Is Missing( dt );
	table_imported = 0;
	,
	table_imported = 1
);

// if dt is not a valid data table, set error message and close program
If( (table_imported == 0),
	nw = New Window( "Import not successful", Modal, Text Box( "File " || parquet_filename || ".parquet is not a valid Parquet file" ) );
	Stop();
);

// If dt is valid, continue execution
// Rename JMP Data Table as Parquet file name without extension
dt << Set Name( parquet_filename );

// Modal message window to confirm import
import_message = "File " || parquet_filename || ".parquet successfully imported";
nw = New Window( "Import successful", Modal, Text Box( import_message ) );

// Open JMP Data Table
dt << New Data View;

// Save JMP Data Table
// dt << Save();

// Terminate Python session
Python Term();

 

 

dmmdiego

Massimo,

The script works very well, and I can successfully load parquet files. Indeed this will be super useful, in lieu of the add-in, I can just run the script. Thanks for the kind help!

But indeed there is an issue with the add-in which is not working in my current installation of JMP Pro. But the script will work.

Straight from my JMP Discovery Summit presentation, a JMP 18 script for processing parquet files, where we go directly from the Parquet column to a jmp.DataTable Python object.  JMP 18 allows live creation, reading, modification of JMP data tables from Python.  The following script uses one of the Teradtata sample files, accesses the parquet file's schema and creates the comparable data table column directly in Python.  The following runs directly from the Python aware script editor in JMP 18.  The import jmp package is only available from within JMP as it is written as a C++ extension within the JMP executable's code base.

 

 

# parquet.py
# Author: Paul R. Nelson
#         JMP Statistical Discovery LLC
#
# Description:
# 	Directly read a pyarrow table and create a JMP datatable. Discovered 
#   some issues in EA-5, that prevented this from working as desired, 
#   There is a conditional based on jmp.__version__ to check for EA-6
#   or newer. 
# 
# The data for this file comes from a repository that is Apache licensed.
# https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet
#
import jmp
import pyarrow.parquet as pq
from pyarrow import types as ptypes

# verify JMP 18 EA-6 or newer
ver = jmp.__version__.split('.')
if(ver[0] == '0' and ver[1] < '6'):
	print('This version requires JMP 18.0 EA-6 or newer.')

pq_Table = pq.read_table('/Users/me_me_me/Presentation/Parquet/userdata1.parquet')
	
# Create a helper function to build a JMP Column 				
#   on DataTable 'dt' using parquet 'table', having column 'name'
def dtcol_from_pq_col_name(dt, table, name):
    col = None
    idx = table.schema.get_field_index(name)
    col_type = table.schema[idx].type
    print(f'{name}: {table.schema[idx].type}')
    if ptypes.is_string(col_type):
       col = dt.new_column(name, jmp.DataType.Character)
    elif ptypes.is_date(col_type):
# This table does not contain a date column do this is unimplemented print("FixMe: got a date column") return None elif ptypes.is_timestamp(col_type): col = dt.new_column(name, jmp.DataType.Numeric) # add JMP epoch time correction factor - 1Jan1904 timestamps = [ts.timestamp()+ 2082844800.0 for ts in table.column(idx).to_pylist()] dt[name] = timestamps # change column info to timestamp, and set an appropriate column width. jmp.run_jsl(f'Data Table( "{dt.name}" ):{name} << Input Format( "yyyy-mm-dd" ) << Format( "yyyy-mm-ddThh:mm:ss", 19 ) << Set Display Width( 200 );') else: col = dt.new_column(name, jmp.DataType.Numeric) # Numeric dt[name] = table.column(idx).to_pylist() return col # Create JMP data table with four of the columns from the table by name dt = jmp.DataTable('From Parquet', pq_Table.num_rows) # could just as easily create all columns # for n in pq_Table.column_names: # instead of iterating on the list below col_list = [ 'registration_dttm', 'first_name', 'last_name', 'salary'] for n in col_list: dtcol_from_pq_col_name(dt, pq_Table, n) # set column properties by calling JSL from Python. (formats & widths) jmp.run_jsl(''' // Change column display width: last_name Data Table( "From Parquet" ):last_name << Set Display Width( 105 ); // Change column display width: Salary Data Table( "From Parquet" ):salary << Set Display Width( 150 ) << Format( "Currency", "USD", 17, 2 ); ''') # display data about the parquet file print(f'Table Rows: {len(pq_Table)}' ) print(f'Table Shape: {pq_Table.shape}') print(f'Table Schema:\n{pq_Table.schema}') print(pq_Table.schema.field('salary').type) print(f'Column Names: {pq_Table.column_names}') print(f'Num Columns: {pq_Table.num_columns}')