Fast JSON Parsing using Pattern Matching

Craige_Hales · Jan 2, 2015 12:50 PM

Xan posted very cool code to read JSON data into JMP. Xan noted the JSL was slow loading large JSON files; here's a look at some replacement JSL to make it faster. (JSON is better suited for small objects, but if you must load 100K or more, this will help.)

Xan's original code has a parsing routine to read the JSON string a character at a time:

the variable pos walks through the length of the string, to end, one character at a time. The real magic in Xan's code is in the handling of the [, ], {, } characters. I just copied the magical parts into the replacement code. Above, you can see the handling for quotation mark, colon, t, f. Those characters direct the parser's handling of subsequent characters.

JMP's pattern matcher is a better choice for this job. The following JSL is a single pattern match that processes the entire string from beginning to end. As each chunk is matched, some JSL embedded in the match handles the chunk.

Snippet of the replacement code (don't copy this! Better code in attachment):

The replacement parse function is now only 5 lines, but the code from the original parse function is now in PatternParse, which appears quite complicated at first glance. But really it is just about the same logic as in the original (the magic is copied almost exactly!) PatternParse begins with PatPos(0) and ends with PatRPos(0) which force the pattern to match everything. The original code had a while-loop. PatternParse has PatRepeat instead. PatRepeat is repeating a concatenation ( + ) of three operations: (1) PatPos(no argument) records the current character position into failpos so Parse can say something useful if the JSON data is unintelligible. (2) PatFence speeds up the match by telling the matcher that everything up to the fence is correct and should not be revisited. (3) The parenthesized list of alternatives ( | ) is where the fun happens; skip white space, match a name, match a colon, etc. Pattern Name and Pattern Number (not shown) are in the attachment and do what you'd guess.

The last piece of the puzzle is PatTest. PatTest evaluates its argument to 0 or 1; if 1, then the test succeeds, matching no characters, and the match continues. (If it fails with a 0 then the matcher backs up and tries alternatives, but that never happens here: notice every PatTest ends with ;1 for the return value.) In this program, PatTest is used to inject some JSL into the middle of a match, not to actually test anything.

With these changes and a change Xan suggested to replace a slow insertion at the end of a list with a fast insertion at the beginning of a list, JMP loads 180MB of JSON data in 90 seconds (vs .25MB in 5 minutes), an improvement of 1000X. Better, maybe, since it might be linear with file size now.

The attached code is not well tested, post a note back if it works for you, thanks!

Craige_Hales · ‎08-22-2021

If you find yourself here: this was before JMP had good builtin JSON (and XML) parsing. This example is still a great example of how you can use JMP's pattern matching efficiently.

The builtin wizard for JSON and XML import will probably out-perform this code and help you make a data table with the data you need.