BookmarkSubscribeRSS Feed

Super User


Jun 23, 2011

Slow performance of WORDS function for long strings

Hello fellow JMPers.  Thought I'd share some my JSL performance improvement story with you.  If you're using WORDS to parse very long strings (> 200,000 characters) this will help.

I was having some performance problems in JSL.  I traced it to slow performance of the WORDS function for really long strings.  The WORDS function parses a string into a list using either a space or a supplied delimiter.  Normally this function is quite fast, but it starts to slow down for strings around 200,000 characters in length.  Yes I have strings that are that long and longer.  Here's a graph of performance:

1215_Words Performance.png

So I wrote a better mousetrap.  In fact I wrote two mousetraps as you can see from the graph.  The first one (MYWORDS1) uses OPEN to open a text string into a dataset.  The second one (MYWORDS2) chops the data into approximate 100,000-byte chunks and runs WORDS on the chunks.  Both have excellent performance relative to WORDS, at the higher character counts.

Although the MYWORDS1 function is slightly better, it has trouble if the text has carriage returns and line feeds, so I’ll probably go with MYWORDS2 for production usage. 

The technique used in MYWORDS1 has lots of possibilities!  Thanks to Peter Wiebe for the original idea of using OPEN, and for reviewing and then improving my code.  Here's a short example showing how to open a text string into a dataset, parsing each delimited string into a separate row:

big_string = "This is a sample string delimited by spaces";

tmpdt = open(char to blob("WordColumn" || " " || big_string), text,

            endofline(other), eolother(" "), endoffield(tab), invisible);

BTW these results were with JMP 9.0.2 on a 32-bit XP machine with 3 GB of ram and an i5-2520M CPU 2.50GHz processor.  I've tested the functions under Windows 7 using JMP 9 and JMP 10 beta, and the results were comparable.

I’ve attached the code for the functions and some testing code.