cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Creating JSL functions implemented in Python

Recently I gave an internal JMP talk on Python Tips and Tricks. There's one tip in particular that I think is powerful, so I'd like to share it with you: creating a JSL function that wraps the call to a Python function. In other words, this is utility code that can be written in Python but given a JSL interface. For ease of illustration, I've implemented a simple Python function that takes two objects and adds them together, then wraps that call in a JSL function. I start with three different types of data (numeric, string, and lists) and a function that adds the two Python objects together.

 

Names Default to Here(1);
one = 1;
two = 2;
author = "Paul Nelson";
devel = "Developer: ";
list1 = { 1, 3, 5, 7 };
list2 = { 2, 4, 6, 8 };

Python Submit("\[

def plus(a,b):
   print(a, b)
   return a+b
]\");

Any Python object that supports the "+" operator can be handled by this Python function. 

// Python Execute is an underutilized gem.  You can send a list
// of inputs and get a list of results and execute code all in one 
// JSL statement.
Python Execute( {one, two}, {result}, "result = plus(one,two)" );
show(result);

The result:

result = plus(one,two)
/*:
1.0
2.0

result = 3;

This is where things get fun! In particular, a JSL function that wraps the Python defined plus() function.

// create a JSL function that wraps the the Python function
blow_mind = Function( {first, second}, // parameters
    {pyresult}, // local variables
    Python Execute( {first, second}, {pyresult}, "pyresult = plus(first,second)");
    Return( pyresult ); // value to return
); // end of the definition

// Call the Python function through our JSL function.  Good way to extend JSL
// using code written in Python.  Could be useful for add-ins. 
// Notice the same Python function handling different types of Python objects.
show( blow_mind(devel, author) );
show( blow_mind(list1, list2) );

Giving the results:

//:*/
pyresult = plus(first,second)
/*:
Developer:  
Paul Nelson

blow_mind(devel, author) = "Developer: Paul Nelson";
//:*/
pyresult = plus(first,second)
/*:
[1, 3, 5, 7]
 [2, 4, 6, 8]

blow_mind(list1, list2) = {1, 3, 5, 7, 2, 4, 6, 8};

Our trivial plus() function handled summation of numeric values, as well as concatenation of both strings and lists, which are all callable as a JSL function. The rest of the JSL code is blissfully unaware that what was actually called occurred in Python.

 

Complete file attached.

Last Modified: Apr 30, 2024 9:30 AM
Comments
Craige_Hales
Super User

Thanks for the improved Python support @Paul_Nelson !


Browser Scripting with Python Selenium shows this

// 1: navigate to jmp.com

nav = Function( {url}, {rc},
    Python Execute( {url}, {rc}, 
"\[
try:
    driver.get(url)
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\" );
    return(rc);
);

rc = nav( "https://www.jmp.com/" );
if( rc != "ok", throw("nav: "||char(rc)));

The last two lines are calling through the wrapper function to make a web browser navigate to the JMP web site. The JSL function works exactly the same way as Paul's example.

 

hogi
Level XII

global, multiline Regex search with multiple return values

 

Add Custom Functions(
	New Custom Function(
		"hogi",
		"Regex",
		Function({input, pattern},
		
			SubstituteInto(input,"\!n"," ","\!r", " ") ;
			Python Send( input );
			Python Send( pattern );

			error = Python Submit( "\[
import re
print(input)
matches = re.findall(pattern,input)
]\" );

			matches = Python Get( matches );			
		),
		<<Description(
"Regex via Python with Global setting and multiple return values
	Arguments:
	1) text
	2) pattern
		"
		),
		<<Example( Expr( hogi:Regex( input = "hello\!n hallo hol\!na", pattern = "h.*?l.*?[ao]") ) )
	)
);
Paul_Nelson
Staff

Note regarding @hogi I believe the Function() should have a return(matches); statement.

Also using Python Execute( ) does away with the need to do sends and gets.

 

Add Custom Functions(
	New Custom Function(
		"hogi",
		"Regex",
		Function({input, pattern},
		
			SubstituteInto(input,"\!n"," ","\!r", " ") ;
			Python Execute( {input, pattern}, {matches}, "\[
import re
matches = re.findall(pattern,input)
]\" );
			return( matches);
		),
		<<Description(
"Regex via Python with Global setting and multiple return values
	Arguments:
	1) text
	2) pattern
		"
		),
		<<Example( Expr( hogi:Regex( input = "hello\!n hallo hol\!na", pattern = "h.*?l.*?[ao]") ) )
	)
);

show( hogi:Regex( input = "hello\!n hallo hol\!na", pattern = "h.*?l.*?[ao]") );
hogi
Level XII

ah, right. Python Execute, much nicer : )

In JSL functions, the last expression is the return value - even without enclosing  return().

john_madden
Level VI

I'm not adept at Python, but there are some Python libraries (like xxhash) that I'd like to use.

I have a question about memory usage. If I JSL-function-wrap some Python in this way, and a Python object gets created in the wrapped Python code, will a call to my JSL wrapper function create a *new* persistent Python object every time the function is called (e.g. using up memory if I call the function tens of millions of times)?

What if I create a named Python object once in my (e.g.) jmpstart.jsl code, and reference that one object inside my procedure; would that be better?

 

Below is an example of what I want to do with xxhash. (Note I import xxhash into the Python environment in my jmpstart.jsl script. I'm stuck emitting a JSL float from this function because JSL doesn't support 8-byte integers...)

Add Custom Functions(
	New Custom Function(
		"Global",
		"xxhash64",
		Function( {str},
			{digest},
			Python Execute( {str}, {digest}, "digest = xxhash.xxh64( str ).hexdigest()" );
			Return( Hex To Number( digest ) );
		)
	)
)

 

john_madden
Level VI

PS. I noticed this is lousy code because there is a ~0.8% incidence of the Hex digest being an invalid IEEE float, resulting in the procedure returning a JSL missing value. That said, I'm still wondering about memory management as indicated.

Paul_Nelson
Staff

hashlib and base64 are both builtin Python libraries.

I meant to look into this case specifically yesterday.  I can tell you that at worst, it should 'leak' a single reference to digest during the entire time JMP is running.  When JMP shuts down that variable will be released as Python shuts down. The Python Set() code creates the digest variable and places it in the Python environment's globals() dictionary.  This allows the Python Get() to be able to look up the value by name and return the value to JSL. Python Execute() is basically a convenience function for Python Send/Set(), Python Submit(), Python Get().  So the dictionary has a reference to digest, however once Python Execute() goes out of scope, the dictionary should be the only one holding a reference.  On the second loop, the value is overwritten with the new value so the old reference should be released, allowing garbage collection to return that memory.  

john_madden
Level VI

Thank you Paul. Regarding memory, that was what I hoped. I'll look at hashlib and base64. 

Background is: I wanted an 8-byte hash, and I wrote one that used the JSL native MD5 functions and truncated them to 8-bytes –but with my string inputs the hashes seem to end up being clustered. Somebody recommended xxhash64 to me as giving very randomly distributed hash values. So I used that and it seemed to give well-distributed hashes.

As mentioned, xxhash64 can emit a longint, but I can't get that across into JSL. I can get the result across as a JSL BLOB, but JMP won't let me use BLOBS as LINK IDs – which is what I want to use the hashes for. It's a bit off the original question, but if you happen to have any thoughts, those would be much appreciated.

john_madden
Level VI

PS. could just use the string value of the hex as my Link ID, but I hankered for something small and fast. I'm dealing with tables with ~80 million rows.

Craige_Hales
Super User

@john_madden  JMP's 64-bit double precision numbers can hold a signed integer from -2^52 .. 2^52. I think you can keep 13 hex digits of the 16 hex digit hash by using hextonumber and avoid any question about invalid floating point numbers. But 52 bits probably isn't enough.

I've been playing in this neighborhood too and found this birthday attack table useful. I'm using a hash of a file to make a unique key and wanted to know how many bits I'd need. From the table, I see adding 6,100,000 items into a 64 bit hash table has a 1/1,000,000 chance of a collision.

For my use case, the file server's disk is 1e11 and a typical file is 1e4, or a limit of 1e7 files. The 64 bit hash is too small and the 96 would be ok. I went with 256 of course...

Craige_Hales_0-1735014374620.png

 

john_madden
Level VI

Hi Craig. This is great! What JMP type are you using for your 256 bit values?

 

As a sidelight, it would be nice if the developers would add a more modern hash function than MD5 to JSL at some point.

john_madden
Level VI

Here's my latest attempt, which seems to work okay:

 

"Accepts a string as input, and returns a JSL floating point number that encodes a 'nearly'-64-bit hash of the string. The hash is generated using the Python xxhash package with the default (0) seed.
Python returns the 64-bit hash as a 16-character hexadecimal string. If the bits 2-12 are 11111111111, they are changed to 01111111111 before being processed further by JSL.
This avoids generating an IEEE 754 floating point value of NaN or ∞, which JSL would interpret as Missing.
This trick cuts the number of possible floating point exponent values
from 2048 to 2046, resulting in a hash with 2^63 * 2046 possible values (= 1.89e22).
For 10 million different inputs, this should yield a collision probability of roughly 1 in 380 milion
according to Craige's reference if I calculcated right. On my M3 Max MacBook Pro, the routine takes around 26 μsec per iteration with random strings of length 11.

 

Names Default To Here( 1 );

Add Custom Functions(
	New Custom Function(
		"Global",
		"xxhash64",
		Function( {str},
			{digest},
			Python Execute( {str}, {digest}, "digest = xxhash.xxh64( str ).hexdigest()", echo( 0 ) );
			startChars = Substr( digest, 1, 3 );
			Hex To Number(
				Match( startChars,
					"7ff", "3ff" || Substr( digest, 4, 13 ),
					"fff", "bff" || Substr( digest, 4, 13 ),
					digest
				)
			);
		)
	) << Description(
		"Accepts a string as input, and returns a JSL floating point number that encodes a 'nearly'-64-bit hash of the string.
	
	Python must have already been initialized as follows:
	
Python Execute( {}, {},
	\!"
import jmputils
jmputils.jpip('install', 'numpy pandas scikit-learn xxhash')
import jmp
import numpy
import pandas as pd
import xxhash
	\!""
	) //
	<<Prototype( "::xxhash64 (string)" ) //
	<<Example( "z = ::xxhash64 ( \!"This is a string\!" ); Show ( Hex ( z ) )" ) //
	<<Parameter( "String", "Input string" ) // 
	<<Formula Category( "Hash" ) //
	<<Scripting Index Category( "Utility" ) //
);
Paul_Nelson
Staff

My guess would be that Craige is using string of hex digits or base64 encoded digits.

 

One of the great things about the Python integration is that you don't need JMP to 'catch-up' and supply more modern hash functions.  You have the power via Python.  One of the reasons I mention hashlib and base64 is that they are part of the Python standard library so all JMP user's will have them available without installing an external package.

Craige_Hales
Super User

@john_madden  - you want 2^52 * 2046 rather than 2^63 * 2046 . Or just use the 64 bit row.

A long time ago I knew how JSL strings were kept in memory; as I recall there was a significant performance change around 24 characters. You might want to use the 96 bit version and make sure it stays at or below 24 hex digits. For 10,000,000 strings the space is 1/4 GB on your 16 GB machine...and you've gone from 1e-6 to 1e-15 chance of hitting an error.  If it wasn't for bad luck, I'd have no luck at all - https://quoteinvestigator.com/2020/09/02/bad-luck/ .

I'm working with python/django/postgresql currently and I store the hex digits as a database key...a 64 character string gives me the 256 bits.

john_madden
Level VI

Again, thanks Paul & Craige. Incidentally, the Python integration has me excited enough to now try and learn Python, since my IT colleagues here want to understand some of my JSL code and always ask me what the Python equivalent would be. Super grateful to Paul for all this work.

I also need to correct the calculation of the number of hash values in my script above - I think it's more like 2^64 - 2^63, a lot less than I said.

Regarding storing the hash values, you convinced me not to try to be so cutesy. I think you've led me to be convinced that just storing the hexadecimal characters is better in the end. I'll have to experiment doing some joins on huge tables to see if it truly makes any speed difference whether I join on columns of hex strings or columns of floats; or whether I use strings or floats as Link IDs.

If I ever export my big JMP tables to an RDB, I could convert them to longints at that point.

john_madden
Level VI

Craige apropos the number of hash values, you are absolutely right and I was confused again and hadn't read your post. 2^52 * 2046 is right.