cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
MeanChris
Level III

Python in JMP: dataframe column to datatable column best practices and computation speed implications

Background

* Using Python in JMP19 scripting environment.

* Creating dataframe (df) from pandas from existing JMP datatable (dt). 

* Running function that requires df input and returns a new df result.  Lets call it df_results.

* Transferring results back to JMP datatable column.

This works, but there are two things that make me suspect I'm not doing this properly.  For one, I get a warning in the local log output.  And two, the first time I run this on a very large number of rows it takes a long time.  My code timer said 25 seconds for 100k rows.  If I rerun it, the second time it takes 3 seconds.  I set up the timer to give me details on each step so I could find out if one line of code was taking most of the time.

This is the single line of code I'm questioning.  Took 25 seconds for 100k rows ONLY the first time the initially constant (or null - tried both) column was filled out.  Then almost 10x faster every subsequent rerun after that.

dt[ColumnName] = df_results[ColumnName]

This is the warning message in the embedded log

FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

Also, I did this for 4 different columns total from 2 different dataframes.  They all exhibited the same behavior (and warning).  Slow first run.  Fast every repeat.  Negligible difference between Column 1 , 2, 3 or 4.

This is only a subset of a large table with 1.3 million rows.  So, 25 seconds x 4 columns will become 250+ seconds x 4 columns if I can't figure out how to get the faster speeds on the first try.

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

It would be setting the entire column using the dt[index] wrapped with jsl Begin Data Update

jmp.run_jsl( 'dt << Begin Data Update;' )
dt[ index ] = list_like_object       # numpy array, list, dataframe column,...
jmp.run_jsl( 'dt << End Data Update;' )

Or by creating the data table invisible until you have filled the table, then make it visible.  

import jmp
dt = jmp.DataTable('Powered By Python', 40, visibility='Invisible')
...
# fill table
...

jmp.globals['jdt'] = dt     # set jdt in JSL to reference dt
jmp.run_jsl('::jdt << Show Window(1);')

The optional visibility parameter may be JMP 19 and above.  The jmp.globals[ ] is a 19 feature. 

 

View solution in original post

25 REPLIES 25

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

In JMP 18 it was necessary to build up a data table column by column from a data frame.  JMP 19 supports the portable data frame protocol, so it is 1 line of code to go from a dt -> df, and 1 line of code to go df -> dt in memory.

Also understand that the very first time a Python script is run in JMP, Python itself needs to be initialized.  First time you import a package it needs to be loaded, possibly including shared libraries or byte code compilation of the .py files.

The warning is coming from the pandas package.  To clear the warning you need to do their suggested ser.iloc[pos].  The JMP data table dt can be indexed by 0-based column index or by column name.  Index would be faster since it would avoid string comparisons.

Help -> Scripting Index   'Python'  category,   jmp.from_dataframe.  'JMP to Pandas' example.

import jmp
import jmputils

try:
if not jmputils.is_installed('pandas'):
jmputils.jpip('install', 'pandas', echo=False)
except Exception as e:
print(f'Install failed with exception: {e}')

import pandas as pd

dt = jmp.open(jmp.SAMPLE_DATA + "Big Class.jmp")
pandas_df = (pd.api.interchange.from_dataframe(dt))
print(pandas_df)

...
# new dt_result data table from dataframe df_result
dt_result = jmp.from_dataframe( df_result )

# or continue to do the loop on columns if you need to update the original data table from the resulting dataframe.
hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

And what is the fastest way to fill a single JMP column with new values from Python?

e.g. for a data table with 100k rows?

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

It would be setting the entire column using the dt[index] wrapped with jsl Begin Data Update

jmp.run_jsl( 'dt << Begin Data Update;' )
dt[ index ] = list_like_object       # numpy array, list, dataframe column,...
jmp.run_jsl( 'dt << End Data Update;' )

Or by creating the data table invisible until you have filled the table, then make it visible.  

import jmp
dt = jmp.DataTable('Powered By Python', 40, visibility='Invisible')
...
# fill table
...

jmp.globals['jdt'] = dt     # set jdt in JSL to reference dt
jmp.run_jsl('::jdt << Show Window(1);')

The optional visibility parameter may be JMP 19 and above.  The jmp.globals[ ] is a 19 feature. 

 

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

The jmp.run_jsl( 'dt << Begin Data Update;' )  is assuming dt is the JSL variable for the table in JSL and then on the Python side as well. Such as caused by a Python Send(dt); from JSL, but not necessarily the same if the table was created from Python. 

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Many thanks for the explanation. 

So, the code:

dt[ index ] = list_like_object 

in Python fills the column step by step? Which is causing the speed issue?


It looks similar to 

dt[ 0, columnname] = [values ...]

in JSL, which fills the column in one step - and doesn't need to be wrapped by dt << Begin/end Data Update

 

Good to know. Where can I find further documentation about the details?

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

I don't know how many users know how essential 

dt << Begin Data Update;

is for writing good code. But I am sure that many users don't know.

 

Related  wish from @bswedlove  for for each row():

for each row(data update=1, ...)

Great idea to institutionalize a solution - to make the issue disappear for the user (the solution is always there).
Holds for JSL and python.

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

This all happens within the magic of Python's Protocol Buffers, where the data is mapped in memory between the supported blocks of memory.  It might not need bracketing with Begin | End Data Update, but if it can't map the entire column at once, it will fallback to iterating across the values so bracketing with the Data Update statements is a good safety net.

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Actually for the second case it needs to be a private table, visibility='Private'. The 'Invisible' table still has a GUI even though it is hidden, and as such is still laggy due to the GUI updates taking place in the data table on every cell edit.

MeanChris
Level III

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Thanks Paul, I will implement the jmp.from_dataframe to get back to datatables to avoid the error.

The other part of my question was about why there was such a huge speed difference, and in order to answer that I had to ensure I could reliably cause the SLOW calculation response.  At first I couldn't.  All runs, even the first after opening JMP and loading Python were plenty fast.  Then I realized there may be one critical difference.  If I ADD a new column (manually, hadn't put it in code yet) and run the code it will ONLY be slow updating that new column and ONLY for the first time. 

As a test I updated 4 columns, with 3 being pre-existing and the 4th being newly created.  It took 1 sec to update 100k rows of Col 1, another second to update Col 2, another to update Col 3, and then 19 seconds to update Col 4.

I made another table and another 'new' 4th column.  If I SAVE AND CLOSE the datatable with the new column,, then REOPEN, then run the code... the new 4th column updates in ~ 1 second, just as fast as ##1, 2 and 3  Critically, if I only save, but do NOT CLOSE & REOPEN the datatable with the new 4th column, then run the code, it will still take the longer ~ 19 seconds to update column 4.

This tells me the speed has to do with some memory management quirk of JMP.

I don't know if this is a problem that can/must be solved, and maybe it still is only a problem if I don't use the 'from_dataframe' step.  Just very odd behavior that I'm glad I was able to recreate.  Now I can try the better version of the code and see if the issue persists.

Recommended Articles