BookmarkSubscribeRSS Feed

Re: Hashing a string

john_madden

Community Trekker

Joined:

Aug 29, 2016

I want a hash function in JMP for short strings. (I need it to create anonymized integer database keys from row data.)
It doesn't need to be cryptographic quality, just needs to have 64-bit (or 128-bit) result and have a reasonably low collision rate. I tried using Blob MD5() (with truncation to 64bits), but I can't believe how slow it is. (Just try it on a table with 500,000 rows!). Has anybody created or linked to a fast hash algorithm in JSL? What about seeding JSL's random number generator with some function of the input string? I don't need NIST-quality, but I do need something quick and not terribly dirty that will give the same result on all platforms.

Can you call from within JSL one of the many C++ or Java hash algorithms that are out there? How?

P.S. Maybe JMP 14.1 could include some better hash function(s) in its standard library. What about including SHA?

1 ACCEPTED SOLUTION

Accepted Solutions
gzmorgan0

Community Trekker

Joined:

Jul 25, 2016

Solution

John_Madden,

I am not an expert and this might not meet your needs. Below is a VBscript for both SHA256 and MD5 encryption, Find it here at this link.

https://gist.github.com/jermity/557de47a978f7a7c4a74

 

I only know Windows.  Copy the attached file, and change the extension to .vbs. Open a Command window, cd to the save direcory and type jerhash.vbs to get the usage syntax.  The embedded script converts the statenames. This vbs script does not appear to allow a list of values. The 52 state names took 8.2 seconds, and I am sure the overhead for connecting run program adds to that. The attached script shows an alternate method of creating a .bat file. it took on the avg about 0.1 seconds for the 52 items. I did not read in the file. That is another approach. Generate the 500k offline then read in the file.

 

That's all I have.  Regards.

image.png

mytbl = New Table("testit", New Column("States", Character, 
	Values({"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
"Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin",
"Wyoming"}) ), NewColumn("Code",Character)
);

t1 = tick seconds();
for(i=1, i<=nrow(mytbl), i++,
   :Code[i] = Trim(RunProgram(Executable("cscript"),
	Options(EvalInsert("\[//nologo c:\temp\jerhash.vbs /A:sha256 /S:"^:States[i]^"]\") ),
	ReadFunction("text")
   ));

);
t2= tickseconds();
show(t2-t1);

 

 

4 REPLIES
gzmorgan0

Community Trekker

Joined:

Jul 25, 2016

Solution

John_Madden,

I am not an expert and this might not meet your needs. Below is a VBscript for both SHA256 and MD5 encryption, Find it here at this link.

https://gist.github.com/jermity/557de47a978f7a7c4a74

 

I only know Windows.  Copy the attached file, and change the extension to .vbs. Open a Command window, cd to the save direcory and type jerhash.vbs to get the usage syntax.  The embedded script converts the statenames. This vbs script does not appear to allow a list of values. The 52 state names took 8.2 seconds, and I am sure the overhead for connecting run program adds to that. The attached script shows an alternate method of creating a .bat file. it took on the avg about 0.1 seconds for the 52 items. I did not read in the file. That is another approach. Generate the 500k offline then read in the file.

 

That's all I have.  Regards.

image.png

mytbl = New Table("testit", New Column("States", Character, 
	Values({"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
"District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
"Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin",
"Wyoming"}) ), NewColumn("Code",Character)
);

t1 = tick seconds();
for(i=1, i<=nrow(mytbl), i++,
   :Code[i] = Trim(RunProgram(Executable("cscript"),
	Options(EvalInsert("\[//nologo c:\temp\jerhash.vbs /A:sha256 /S:"^:States[i]^"]\") ),
	ReadFunction("text")
   ));

);
t2= tickseconds();
show(t2-t1);

 

 

john_madden

Community Trekker

Joined:

Aug 29, 2016

Thanks for this solution!
Unfortunately, I'm on Mac, so I can't actually use vbs. But now I see from your example how to use RunProgram, and I'm going to try and get it to work for CityHash (https://opensource.googleblog.com/2011/04/introducing-cityhash.html).

Highlighted
Craige_Hales

Staff

Joined:

Mar 21, 2013

This runs in about 10 seconds on 1,000,000 rows.

start = Tick Seconds();
dt = New Table( "Untitled",
    Add Rows( 1e6 ),
    New Column( "Column 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( Row() ) ),
    New Column( "Column 2", Character, "Nominal", Formula( Hex( Blob MD5( Char To Blob( Char( :Column 1 ) ) ) ) ) )
);

dt << runformulas; // run the formulas now, not in the background
stop = Tick Seconds();

Show( stop - start );

Capture.PNGUsing MD5 for hash

Craige
john_madden

Community Trekker

Joined:

Aug 29, 2016

Craige,

Maybe I was being a too hard on Blob MD5. It's slower than that on my machine with my input strings. Nevertheless, thanks for the objective check. 

John