cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
  • Sign-in to the JMP Community will be unavailable intermittently Dec. 6-7 due to a system update. Thank you for your understanding!
  • We’re retiring the File Exchange at the end of this year. The JMP Marketplace is now your destination for add-ins and extensions.
  • JMP 19 is here! Learn more about the new features.

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
MeanChris
Level III

Python in JMP: dataframe column to datatable column best practices and computation speed implications

Background

* Using Python in JMP19 scripting environment.

* Creating dataframe (df) from pandas from existing JMP datatable (dt). 

* Running function that requires df input and returns a new df result.  Lets call it df_results.

* Transferring results back to JMP datatable column.

This works, but there are two things that make me suspect I'm not doing this properly.  For one, I get a warning in the local log output.  And two, the first time I run this on a very large number of rows it takes a long time.  My code timer said 25 seconds for 100k rows.  If I rerun it, the second time it takes 3 seconds.  I set up the timer to give me details on each step so I could find out if one line of code was taking most of the time.

This is the single line of code I'm questioning.  Took 25 seconds for 100k rows ONLY the first time the initially constant (or null - tried both) column was filled out.  Then almost 10x faster every subsequent rerun after that.

dt[ColumnName] = df_results[ColumnName]

This is the warning message in the embedded log

FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

Also, I did this for 4 different columns total from 2 different dataframes.  They all exhibited the same behavior (and warning).  Slow first run.  Fast every repeat.  Negligible difference between Column 1 , 2, 3 or 4.

This is only a subset of a large table with 1.3 million rows.  So, 25 seconds x 4 columns will become 250+ seconds x 4 columns if I can't figure out how to get the faster speeds on the first try.

Thanks

 

25 REPLIES 25
hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

For the original issue: "slow data access IF there is a new column"

here is is an illustrative  example:

import jmp

dt = jmp.DataTable('dt_new', 50000)
col = dt.new_column("newCol")

# ---------   cut here to make it slow - why? ------------- 


dt[ 0 ] = [1]*50000

 

Cut into 2 steps, it will take JMP several seconds to get the 50,000 entries updated.


puzzling:
When executing both blocks in one run, JMP has no issue.

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

Besides the workarounds mentioned in @Paul_Nelson 's post, I thought there were some additional ones like:

  1. Use JSL to to create the column / the data table
  2. Use JSL to fill the column with dummy values
  3. use JSL to save the data table, close it and re-open it


Many of them looks good at first sight - but it's  misleading.
The high speed is just  because same values are written again into the column and JMP doesn't trigger a "data update alert":

e.g.

jmp.globals['jdt'] = dt
jmp.run_jsl('''
nc = n cols(jdt);
Column(jdt,nc) << set each value(1)
''');

nc = dt.ncols
data = [1]*50000
dt[ nc - 1 ] =data


the same miracle via Python:

# 1st: slow
nc = dt.ncols
dt[ nc - 1 ] = range(50000)

# fast !   (same values!)
dt[ nc - 1 ] = range(50000)

# new values: slow again
dt[ nc - 1 ] = [5]*50000
hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

One can set up a trigger to get a notification when JMP thinks that data gets updated.

import jmp
dt = jmp.open(jmp.SAMPLE_DATA + "Big Class.jmp")

jmp.run_jsl('''
New Column( "trigger",
	Formula(
		If( Row() == 1,
			Try( Caption( Char( ::i=::i+1 ) ), ::i = 1 )
		);
		:age;i
	)
)
''')


dt[ 1 ] = range(40)

dt[ 1 ] = range(40) # no data update !

dt[ 1 ] = [2]*40

# for loop without begin/end data update -> finished before JMP notices the data update
jmp.run_jsl('''
for each row(:age = row());
wait(5);
''')

# for loop without begin/end data update - and some delay to give JMP the time to notice:
jmp.run_jsl('''
for each row(:age = 2*row();wait(0))
''')

 

 

So, indeed, there is no "data update" when the original and new values are identical.
Implemented to combat infinite loops? Great!

 

Interesting, 

import jmp
dt = jmp.open(jmp.SAMPLE_DATA + "Wafer Stacked.jmp")

jmp.run_jsl('''
delete symbols(::i)
New Column( "trigger",
	Formula(
		If( Row() == 1,
			Try( Caption( Char( ::i=::i+1 ) ), ::i = 1 )
		);
		:Y_Die;i
	)
)
''')

nr = dt.nrows
dt[ 4 ] = range(nr)


takes more than 1 minute -  And i counts up to 2.

 

MeanChris
Level III

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

"The JMP data table dt can be indexed by 0-based column index or by column name.  Index would be faster since it would avoid string comparisons."

Trying to rewrite code using indices but can't seem to find the equivalent syntax for determining datatable column index from a name that I have for dataframes.

dfColumnIndex = df.get_loc['NameofDataFrameColumn']

dtColumnIndex = ???

hogi
Level XIII

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

@jthi , what was the name for "adding some code to fix an issue that is not existing"?

Re: Python in JMP: dataframe column to datatable column best practices and computation speed implications

In the case where df was created from a dt, the assumption would be that they have the same index.  If you've added columns to the data frame or one was not the origin of the other then using the dt['col_name'] =    is probably your only recourse.  It shouldn't be that costly compared to an index but it is a string compare vs direct numeric indexing.

Recommended Articles