cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
MeanChris
Level III

Python in JMP: dataframe column to datatable column best practices and computation speed implications

Background

* Using Python in JMP19 scripting environment.

* Creating dataframe (df) from pandas from existing JMP datatable (dt). 

* Running function that requires df input and returns a new df result.  Lets call it df_results.

* Transferring results back to JMP datatable column.

This works, but there are two things that make me suspect I'm not doing this properly.  For one, I get a warning in the local log output.  And two, the first time I run this on a very large number of rows it takes a long time.  My code timer said 25 seconds for 100k rows.  If I rerun it, the second time it takes 3 seconds.  I set up the timer to give me details on each step so I could find out if one line of code was taking most of the time.

This is the single line of code I'm questioning.  Took 25 seconds for 100k rows ONLY the first time the initially constant (or null - tried both) column was filled out.  Then almost 10x faster every subsequent rerun after that.

dt[ColumnName] = df_results[ColumnName]

This is the warning message in the embedded log

FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

Also, I did this for 4 different columns total from 2 different dataframes.  They all exhibited the same behavior (and warning).  Slow first run.  Fast every repeat.  Negligible difference between Column 1 , 2, 3 or 4.

This is only a subset of a large table with 1.3 million rows.  So, 25 seconds x 4 columns will become 250+ seconds x 4 columns if I can't figure out how to get the faster speeds on the first try.

Thanks

 

25 REPLIES 25

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

One of my colleagues pointed out in JSL there is the Start Data Update and End Data Update, so....

jmp.run_jsl( 'dt << Begin Data Update;' )

# your python code filling new column
...

jmp.run_jsl( 'dt << End Data Update;' )

Most likely we will want to do support this from Python. Probably using a with block to suspend the UI updates of the table when filling / updating large number of rows.

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications


@Paul_Nelson wrote:
...
Most likely we will want to do support this from Python. Probably using a with block to suspend the UI updates of the table when filling / updating large number of rows.

 

great idea!

Please take care that you switch back on the interactivity even when Python throws an error within the "with" block.

 

As an alternative:

Please add an indicator to the data grid to show when interactivity is disabled because there was a

begin data update without an end data update.

Indicator: interactivity disabled for this data table 

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Findings of the previous posts - compressed to: 

import jmp

# open data table and wait for the "Interactivity Manager" to get active

jmp.run_jsl('''
dt = new table("test", add rows(100000), new column ("col"));
Python send(dt);
wait(0) // disable the delay to make it fast
'''
);

nr=dt.nrows

# Python data update: slow!
dt[ 0 ] = range(nr)

# --- run step by step: -----

# same values: fast
dt[ 0 ] = range(nr)

#new values: slow
dt[ 0 ] = [1]*nr

 

Updating  a JSL column via Python is very slow.

Exceptions:

  • write access right after creating/opening the data table (before some kind of "Interactivity Manager" is active)

  • no update at all (values don't change)

    and the workarounds mentioned in @Paul_Nelson 's post:

  • private data tables

  • Python data update enclosed by JSL Begin/End Data Update()

hogi_3-1762324466209.png

 

 

Fast Python data access - just with the support of JSL ; )
I suppose the issues are significant enough to be resolved in the next ".x"  release ...

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Yes, this is significant.  We are indeed looking at a fix.  We want the performance from Python to be such that users want to run from Python and not have to use backdoor magic methods to get performance.  

I'm surprised this didn't show up in JMP 18 or during JMP 19 EA testing.  The jmp.DataTable object at its heart is the JSL datatable.  The Python object is just a reference to the underlying JMP DataTable, and we are calling the JMP Data Table's internal methods, which is why the run_jsl( 'dt << Begin Data Update') makes a difference.  The entire reason for the jmp.run_jsl(''' code '''') was to allow calling JSL from Python for things that we have not yet implemented, or as in this case rectify an oversight on column update performance.

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Great to hear that the issue gets fixed.

Why was it not detected before?

I guess many users use Python for data import/ data curation, before they transfer the data to JMP - where they don't touch the data via Python. If they use  Python get(dt)[JSL]  or  from_dataframe() [Python] they won't face the issue.

Caveat: column-by-column-transfer like in Getting started with Python integration in JMP 18.
Here it depends on the size of the data table. If the table is small enough, the new columns get added before the "Interactivity Manager" is activated.

When we detected the issue at our side, we had access to JMP19EA and were happy to see that we can speed up the table generation via from_dataframe() - with the additional benefit of a much easier code.


On the other hand: The issue will show up when Python is used to update individual columns of an existing JMP data table, like in   JMPyFacade: Bridging JMP and Python for Seamless Engaging Analysis  As a JSL pro, @jthi definitely added the necessary Private-Table-merge data - workaround without wondering too much ; )

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Info shared in today's Scripters Club:

the speed issue with

dt[ index ] = list_like_object 

 will be fixed in JMP 19.0.3     :  )

MeanChris
Level III

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

I ran numerous experiments and different methods.  Some threw warnings.  Some did not.  The simplest seemed to be to create a dt from the df and then use the dt to transfer data.  That never threw an error.

However, as to the speed problem... the ONLY thing that made a difference was whether I was trying to write values to a column that had been newly created, or if I was trying to write values to a column that existed when the datatable was opened (or created).  Newly created columns updated extremely slow.  Preexisting columns updated plenty fast.

Results of last text on 100k rows.  4 parameters already existed and 5 new columns added before I ran this code.  9 total columns updated.

#These 'xfers' all used dt['colname'] = dt_results['colname'] syntax. No DataFrames involved after DF was used to create dt_results.

Step df_results creation time: 0.3434 seconds
Step Param 1 xfer time: 0.1795 seconds
Step Param 2 xfer time: 0.1758 seconds
Step df_results creation time: 0.0231 seconds #This is a new set of result of different df function return.
Step Param 3 xfer time: 0.1838 seconds
Step Param 4 xfer time: 0.1890 seconds


#Code here was: dt['NewCol1Test'] = df_results[stringnameofcolumn]

FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
df to dt with names xfer time: 15.8251 seconds


#Code here was: dt['NewCol2Test'] = df_results.iloc[:,dfColumnIndex]

FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
#I don't know why that warning was still thrown when I used iloc

df to dt with iloc xfer time: 16.0874 seconds


#Code here was: dt['NewCol3Test'] = dt_results[dfColumnIndex]

dt index to dt name xfer time: 14.5659 seconds


#Code here was: dt[NewColumnIndex] = dt_results[dfColumnIndex]

dt index to dt index xfer time: 14.4699 seconds

 

#Code here was:

jmp.run_jsl('dt << Begin Data Update;')
dt['NewCol5Test'] = dt_results[dfColumnIndex]
jmp.run_jsl('dt << End Data Update;')

#Warning/Error Result:

"Name Unresolved: dt in access or evaluation of 'dt' , dt/*###*/

 

at line 1
Name Unresolved: dt in access or evaluation of 'dt' , dt/*###*/

 

at line 1"

 

#This print was actually on the same line as the second 'at line 1' error message.

dt to dt w indices wrapped in begin/end update xfer time: 14.3627 seconds


#Final summary of run time on 100k rows.

Total Execution time: 77.0904 seconds

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

save you data before running this code  - or enable the begin/end data update!

import jmp
import numpy as np
dt = jmp.open(jmp.SAMPLE_DATA + "Wafer Stacked.jmp")

jmp.globals['jdt'] = dt

jmp.run_jsl('Graph Builder( Variables( Y( :defects ) ), Elements( Bar( Y, Label( "Label by Value" ) ) ))');

# jmp.run_jsl('jdt << Begin Data Update;')
dt[ 5 ] = [np.random.randint(1000)]*177875
# jmp.run_jsl('jdt << End Data Update;')

 

 

(*) Just kidding, you will not lose your data - just take a coffee and wait till JMP finishes.

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications


@hogi wrote:

... a second time (**)



(**) TS - 00251136
@Dahlia_Watkins noticed that sometimes it takes 2 execution of the code to face the speed issue.

Looks like a timing "issue".
Maybe:  when you run the code the first time, the plot is not created till Python finishes the row-by-row-data update.


edit:
Actually, the plot is not necessary as an interactive element to make the Python data update slow.
Further examples without interactive elements can be found in the posts below.

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Recommended Articles