cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Register for May 2 Mastering JMP Demo - Circumventing Common Pitfalls in Predictive Modeling
Choose Language Hide Translation Bar
View Original Published Thread

Importing large Parquet files into JMP

simpml
Level II

Hello,

 

I need to efficiently import large datasets in Parquet format (may be multiple GB) into JMP. I saw the instruction video for CSV import here and used the same approach using the pd.read_parquet instead. My worry in this case is the fact that the import will be very inefficient because the tables need to occupy memory both as Pandas df object and then as the JMP dt. Is there a better approach that may be better suited for PCs with modest specs?

 

Thanks a lot.

7 REPLIES 7


Re: Importing large Parquet files into JMP

First and foremost - You will need plenty of RAM at least 2x the binary file size probably 3x or better, just to do this in-memory directly from Parquet to JMP data table. Use the Python Integration, and the PyArrow Python package.

import jmp
from jmputils import jpip
jpip('install','pyarrow')

 

Here is a sample script for JMP 18 that directly walks a parquet file and builds a data table from Python with the data, all in-memory.

 

 In JMP 18 you will need to then walk each column in the data and data table column at a time from Python.  You can directly walk the Parquet schema and data or you can convert the parquet table to a pandas dataframe  An example of converting from a pandas dataframe to a jmp.DataTable object can be found in the $SAMPLE_SCRIPTS/Python/dt2pandas2dt.jsl and .py scripts.  

# parquet.py
# Author: Paul R. Nelson
#         JMP Statistical Discovery LLC
#
# Description:
# 	Directly read a pyarrow table and create a JMP datatable. 
# The data for this file comes from a repository that is Apache licensed.
# https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet
#
import jmp
import pyarrow.parquet as pq

ver = jmp.__version__.split('.')
if(ver[0] == '0' and ver[1] < '5'):
	print('This version requires JMP 18.0 EA-5 or newer.')

pq_Table = pq.read_table('/Your_path/to/userdata1.parquet')

#print(pq_Table)
print(f'Table Rows: {len(pq_Table)}' )

print(f'Table Shape: {pq_Table.shape}')
print(f'Table Schema:\n{pq_Table.schema}')
print(f'Column Names: {pq_Table.column_names}')
#print(pq_Table.columns)
print(f'Num Columns: {pq_Table.num_columns}')

dt = jmp.DataTable('From Parquet', pq_Table.num_rows)
dt.new_column('first_name', jmp.DataType.Character)
dt.new_column('last_name', jmp.DataType.Character)
dt.new_column('Salary')

# set column properites (widths)
jmp.run_jsl('''
// Change column display width: last_name
Data Table( "From Parquet" ):last_name << Set Display Width( 105 );
// Change column display width: Salary
Data Table( "From Parquet" ):Salary << Set Display Width( 90 );	
''')
#Create a data table column from a Python list
dt[0] = pq_Table.column(2).to_pylist()
dt[1] = pq_Table.column(3).to_pylist()
dt[2] = pq_Table.column(10).to_pylist()
# =================================================================================== # Copyright © 2025 JMP Statistical Discovery LLC, Cary, NC, USA. All rights reserved. # # JMP STATISTICAL DISCOVERY LLC ("JMP") PERMITS THE USE OF THIS COMPUTER SOFTWARE # CODE ("CODE") ON AN AS-IS BASIS AND AUTHORIZES YOU TO USE THE CODE SUBJECT TO # THE TERMS LISTED HEREIN. BY USING THE CODE, YOU AGREE TO THESE TERMS. YOUR USE # OF THE CODE IS AT YOUR OWN RISK. JMP MAKES NO REPRESENTATION OR WARRANTY, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND TITLE, # WITH RESPECT TO THE CODE. # # You may use the Code solely as part of a software product you currently have # licensed from JMP, JMP's parent company, SAS Institute Inc. ("SAS US") or one # of SAS' subsidiaries (together with SAS US, "SAS") or authorized agents (the # "Software"), and not for any other purpose. The Code is designed to either # correct an error in the Software or to add functionality to the Software but # has not necessarily been tested. Accordingly, JMP makes no representation or # warranty that the Code (1) will operate error-free or (2) will not contain any # viruses or other applications or executables (including, without limitation, # any "trap doors," "worms" and "time bombs") that will degrade or infect any # software product that you license from JMP or any other software or your # network or systems. JMP is under no obligation to maintain, support, or # continue to distribute the Code. # # Neither JMP nor its licensors shall be liable to you or any third party for any # general, special, direct, indirect, consequential, incidental, or other damages # whatsoever arising out of or related to your use or inability to use the Code, # even if JMP has been advised of the possibility of such damages. Except as # otherwise provided above, the Code is governed by the same agreement that # governs the Software. If you do not have an existing agreement with JMP or SAS # governing the Software, you may not use the Code. # # US export laws and regulations apply to the Code and any other JMP-provided # technology ("Controlled Material"). The Controlled Material originates from the # United States. Customer agrees to comply with these and other applicable export # and import laws and regulations, except as prohibited or penalized by law # ("Trade Law"). Customer warrants that Customer and its users are not: (a) # prohibited by Trade Law from accessing Controlled Material without US # government approval; (b) located in or under control of any country or other # territory subject to general export or trade embargo under Trade Law; or (c) # engaged in any of the following end-uses: nuclear, chemical or biological # weapons; nuclear facilities not under International Atomic Energy Agency # safeguards; missiles or unmanned aerial vehicles capable of long-range use or # weapons delivery, military training or assistance, military or intelligence # end-use in Russia or in any country in Country Group D:5 of the United States # Export Administration Regulations; deep water, Arctic offshore or shale oil or # gas exploration involving Russia or Russian companies, or Russian energy export # pipelines. Customer will not import or use any data within the System that is # subject to the US International Traffic Arms Regulations. United States export # classification information for JMP software and its affiliates is available at # jmp.com/export. # # JMP and all other JMP Statistical Discovery LLC product or service names are # registered trademarks or trademarks of SAS Institute Inc. in the USA and other # countries. ® indicates USA registration. Other brand and product names are # registered trademarks or trademarks of their respective companies. # # ===================================================================================

 

For JMP 19 EA-4+ the situation is much better.

 

import jmp
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

#parque sample files from
# https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet
pq_Table = pq.read_table('/Your_path/to/userdata1.parquet')
pd_table = pq_Table.to_pandas()

#create table in memory from pandas datadata frame.
dt = jmp.from_dataframe(pd_table)

#Same Disclaimer applies.

 

 

Craige_Hales
Super User


Re: Importing large Parquet files into JMP

The Python approach is probably better than what follows. But this does work, at least with a toy parquet file.

There is an Apache Drill project that can read parquet files; it wants to be a JDBC (java, not ODBC unfortunately) and the ODBC drivers for it, if they still exist, might not be free. But it has a REST based api and JSL can do that.

This post steps 1,2,3 pointed in the right direction (thanks @ robertspierre). Not sure how you'll extract the file on win; Linux can do it and some window tool can probably do it too. I used https://drill.apache.org/download/ and picked "Drill for Hadoop 3 and non-Hadoop environments, direct download". I also got microsoft-jdk-21.0.6-windows-x64.msi from https://learn.microsoft.com/en-us/java/openjdk/download .

I left the expanded apache-drill-1.21.2 directory (from the tar.gz) on the desktop and started it like this:

Craige_Hales_0-1744504395897.png

cd into apache-drill-1.21.2 then run bin/drill-embedded.bat.  I then played with a query against a supplied parquet file, which you don't have to do because...start JMP and run this script:

fields = Associative Array();
fields["queryType"] = "SQL";
fields["query"] = "SELECT * FROM `dfs`.`C:\Users\c\Desktop\apache-drill-1.21.2\sample-data\nation.parquet`";
s = New HTTP Request( URL( "localhost:8047/query.json" ), Method( "POST" ), JSON( Fields( fields ) ), Headers( {"Accept: application/json"} ) ) <<Send;

dt = open(chartoblob(s),json,guess("tall"))

and get this table:

 

Craige_Hales_1-1744504821964.png

 

When you start the drill process above, it is launching a webserver that handled the REST api request from JSL; it also has a web interface, which I don't think is going to be interesting for reading parquet files.

Craige_Hales_2-1744505945837.png

https://drill.apache.org/docs/rest-api-introduction/#query might help with the api.

 

interesting reads before you build too much on this:

https://www.starburst.io/blog/the-death-of-apache-drill/

https://stackoverflow.com/questions/59754457/mapr-driver-discontinued-for-apache-drill-what-now

and yet there is still activity:

https://github.com/apache/drill

 

Craige
simpml
Level II


Re: Importing large Parquet files into JMP

Thank you!

simpml
Level II


Re: Importing large Parquet files into JMP

Thank you!

simpml
Level II


Re: Importing large Parquet files into JMP

Where can i try the JMP EA version? My Parquets have too many columns to explicitly define each of them, so the use of from_dataframe() is very alluring.


Re: Importing large Parquet files into JMP

Contact JMP sales, there is an early adopter program. But I think if you have access to JMP 18 through MyJMP portal, you should have access to the Early Adopter releases.  But maybe there is an additional non-disclosure form / step.

lala
Level VIII


Re: Importing large Parquet files into JMP

I'm so slow with the grok3 method

2025-04-14_18-46-20.png