Choose Language Hide Translation Bar
Highlighted
john_madden
Level V

Hashing a string

I want a hash function in JMP for short strings. (I need it to create anonymized integer database keys from row data.)
It doesn't need to be cryptographic quality, just needs to have 64-bit (or 128-bit) result and have a reasonably low collision rate. I tried using Blob MD5() (with truncation to 64bits), but I can't believe how slow it is. (Just try it on a table with 500,000 rows!). Has anybody created or linked to a fast hash algorithm in JSL? What about seeding JSL's random number generator with some function of the input string? I don't need NIST-quality, but I do need something quick and not terribly dirty that will give the same result on all platforms.

Can you call from within JSL one of the many C++ or Java hash algorithms that are out there? How?

P.S. Maybe JMP 14.1 could include some better hash function(s) in its standard library. What about including SHA?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
gzmorgan0
Super User

Re: Hashing a string

John_Madden,

I am not an expert and this might not meet your needs. Below is a VBscript for both SHA256 and MD5 encryption, Find it here at this link.

https://gist.github.com/jermity/557de47a978f7a7c4a74

 

I only know Windows.  Copy the attached file, and change the extension to .vbs. Open a Command window, cd to the save direcory and type jerhash.vbs to get the usage syntax.  The embedded script converts the statenames. This vbs script does not appear to allow a list of values. The 52 state names took 8.2 seconds, and I am sure the overhead for connecting run program adds to that. The attached script shows an alternate method of creating a .bat file. it took on the avg about 0.1 seconds for the 52 items. I did not read in the file. That is another approach. Generate the 500k offline then read in the file.

 

That's all I have.  Regards.

image.png

mytbl = New Table("testit", New Column("States", Character, 
	Values({"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
"Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin",
"Wyoming"}) ), NewColumn("Code",Character)
);

t1 = tick seconds();
for(i=1, i<=nrow(mytbl), i++,
   :Code[i] = Trim(RunProgram(Executable("cscript"),
	Options(EvalInsert("\[//nologo c:\temp\jerhash.vbs /A:sha256 /S:"^:States[i]^"]\") ),
	ReadFunction("text")
   ));

);
t2= tickseconds();
show(t2-t1);

 

 

View solution in original post

5 REPLIES 5
Highlighted
gzmorgan0
Super User

Re: Hashing a string

John_Madden,

I am not an expert and this might not meet your needs. Below is a VBscript for both SHA256 and MD5 encryption, Find it here at this link.

https://gist.github.com/jermity/557de47a978f7a7c4a74

 

I only know Windows.  Copy the attached file, and change the extension to .vbs. Open a Command window, cd to the save direcory and type jerhash.vbs to get the usage syntax.  The embedded script converts the statenames. This vbs script does not appear to allow a list of values. The 52 state names took 8.2 seconds, and I am sure the overhead for connecting run program adds to that. The attached script shows an alternate method of creating a .bat file. it took on the avg about 0.1 seconds for the 52 items. I did not read in the file. That is another approach. Generate the 500k offline then read in the file.

 

That's all I have.  Regards.

image.png

mytbl = New Table("testit", New Column("States", Character, 
	Values({"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
"Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin",
"Wyoming"}) ), NewColumn("Code",Character)
);

t1 = tick seconds();
for(i=1, i<=nrow(mytbl), i++,
   :Code[i] = Trim(RunProgram(Executable("cscript"),
	Options(EvalInsert("\[//nologo c:\temp\jerhash.vbs /A:sha256 /S:"^:States[i]^"]\") ),
	ReadFunction("text")
   ));

);
t2= tickseconds();
show(t2-t1);

 

 

View solution in original post

Highlighted
john_madden
Level V

Re: Hashing a string

Thanks for this solution!
Unfortunately, I'm on Mac, so I can't actually use vbs. But now I see from your example how to use RunProgram, and I'm going to try and get it to work for CityHash (https://opensource.googleblog.com/2011/04/introducing-cityhash.html).

Highlighted
Craige_Hales
Staff (Retired)

Re: Hashing a string

This runs in about 10 seconds on 1,000,000 rows.

start = Tick Seconds();
dt = New Table( "Untitled",
    Add Rows( 1e6 ),
    New Column( "Column 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( Row() ) ),
    New Column( "Column 2", Character, "Nominal", Formula( Hex( Blob MD5( Char To Blob( Char( :Column 1 ) ) ) ) ) )
);

dt << runformulas; // run the formulas now, not in the background
stop = Tick Seconds();

Show( stop - start );

Using MD5 for hashUsing MD5 for hash

Craige
Highlighted
john_madden
Level V

Re: Hashing a string

Craige,

Maybe I was being a too hard on Blob MD5. It's slower than that on my machine with my input strings. Nevertheless, thanks for the objective check. 

John

Highlighted
john_madden
Level V

Re: Hashing a string

By the way, here's the solution I finally settled on:

 

// ==========================
// ::String Hash()
// ==========================
myStringHash = New Custom Function(
	"Global",
	"String Hash",
	Function( {str, length = 64},
		{bytes},
		bytes = Match( length, 64, 8, 32, 4, Throw( "Bad argument in ::String Hash; hash length must be 64 (default) or 32." ) );
		Subscript( Blob To Matrix( Blob Peek( Blob MD5( Char To Blob( str, "utf-8" ) ), 0, bytes ), "int", bytes, "little" ), 1 );
	)
);


myStringHash << Description(
	"Accepts a string as input, and returns an unsigned 64-bit or 32-bit integer hash of the string, generated by truncation of the (128-bit) MD5 hash of the input string.
	
	The first parameter is the input string to be hashed.
	
	The second, optional parameter is either the number 64 or the number 32. This selects whether the result will be a 64-bit (8-byte) integer, or a 32-bit (4-byte) integer. If not specified, the function defaults to 64-bit (8-byte). If a value other than 64 or 32 is entered, an error results."
);
myStringHash << Prototype( "::String Hash (string, <64|32>)" );
myStringHash << Example( "z = ::String Hash ( \!"This is a string\!" ); Show ( Format( z, 20, 0) )" );
myStringHash << Parameter( "String", "Input string" );
myStringHash << Parameter( "Number", "<64 or 32 (default is 64)>" );
myStringHash << Formula Category( "Database" );

Add Custom Functions( {myStringHash} );
Article Labels

    There are no labels assigned to this post.