Subscribe Bookmark RSS Feed
Craige_Hales

Staff

Joined:

Mar 21, 2013

Backtracking Secrets

8075_UnderTheHood.jpg

The aphorism "If at first you don't succeed, try, try again" is formalized in pattern matching in a process called backtracking. Backtracking is how the matcher checks out all the possible combinations that might match. This post uses some secret features in JMP's pattern matcher to expose the backtracking process. The single statement below is a moderately complicated pattern match.  But even simple matches (and regex) will use backtracking.

Secret 1: the Log() function on line 21 produces output in the JMP log window. The >> before it is taking the previously matched text and printing it, along with a zero-based position in the source and the text label supplied. This has nothing to do with backtracking and everything to do with debugging. You can use this secret to debug your own patterns.

Secret 2: PatTest() has three arguments, even though only one is ever used. This is all about backtracking. The JSL in the second argument is executed in the forward-tracking direction and the JSL in the third argument is executed in the backtracking direction.

Secret 3: PatMatch() normally optimizes failure, as best it can. When it sees there won't be enough text remaining in the source string to match the remaining minimum length of the pattern, it stops. The FULLSCAN option turns off that optimization. For this example, the backtracking will seem to stop prematurely without the FULLSCAN option.

Secret 4: PatFail() can be used to find all possible matches by forcing each match to fail and alternatives to be considered.

Here's the code and output with an explanation below.


Pat Match(
  "EDACGDE", // a contrived string to exercise backtracking
  (
    Pat Len( 1 ) >> p // match a character, store in p
    +
    Pat Test(
      p > "B", // test
      Write( "p=", p, " " ), // advance
      Write( "retry\!n" ) // backtrack
    )
    +
    Pat Arb() // skip zero or more characters
    +
    Pat Len( 1 ) >> q // match another, store in q
    +
    Pat Test(
      p < q & q < "F", // test
      Write( "q=", q, " " ), // advance
      Write( "retry " ) // backtrack
    )
  ) >> Log( "FOUND" ) // a debugging tool
  +
  Pat Fail(), // try alternatives
  NULL, // no replacement
  FULLSCAN // to the bitter end
);

p=E q=D retry q=A retry q=C retry q=G retry q=D retry q=E retry retry

p=D q=A retry q=C retry q=G retry q=D retry q=E

1(FOUND) DACGDE

retry retry

p=A retry

p=C q=G retry q=D

3(FOUND) CGD

retry q=E

3(FOUND) CGDE

retry retry

p=G q=D retry q=E retry retry

p=D q=E

5(FOUND) DE

retry retry

p=E retry

0

Explanation

First, the final zero. That's the return code, and it says the match failed. The match actually FOUND four matches, but the PatFail() caused it to look for another. This is expected.

Next, the Ps and Qs. p starts at E and the test for >B is happy. PatArb() is reluctant and initially skips nothing. q gets the D following E. But the test is unhappy this time; q between p and "F" is not true. Time to backtrack! The LEN(1) that put the D into q gives the D back and the PatArb() is given a chance to do something different. This time it reluctantly takes one character, the D. The PatLen(1) then gives the A to q. At the end of the first line, there are two retries; the second is for p. The next line shows p getting the second character, D. Since the pattern match is not anchored, the beginning of the match is allowed to float forward through the text as part of the backtracking process; this time the match goes forward with the PatArb taking more and more until the first success happens. The >>Log("FOUND") records the success in the log, then the PatFail() causes the backtracking to continue.

If it isn't clear, you can try simpler tests and instrument them with the >>LOG("tag") method to get a feel for what's happening.

At this point it should be clear why this match runs for a very, very long time:

8068_slow.PNG

while you can see it can't possibly succeed because the string of aaaa...aaa has no b anywhere, the matcher has to discover that by trying lots of silly combinations: skip 2 characters followed by skip 3... and skip 3 followed by skip 2... etc, etc, etc.... Backtracking can't fix a badly written pattern. That progress bar does not mean it is almost finished. (There is no reason to put the patarb inside the patrepeat, except to see this dialog. Patarb already matches an arbitrary number of characters.)

update 1Feb2017 - repair formatting for new community.

Article Tags