Subscribe Bookmark



Mar 21, 2013

Pattern Matching

A pattern cut with a laserA pattern cut with a laser

JMP's pattern matching functions can do almost everything regex can do and a little bit more. Internally, regex is built from the pattern matching functions and sports an easy-to-use interface. You'll probably never need the little bit more, so if you have experience with regex, it might be a good place to stay. Or, you might find the pattern matching language more attractive. Those regular expressions can be quite opaque.

The simplest pattern match might look like this: 

source = "the old fox had eggs for breakfast";
ok = patMatch( source, "eggs" ); // ok will be 1 because the pattern "eggs" is found

Changing eggs to bacon:

source = "the old fox had eggs for breakfast";
ok = patMatch( source, "eggs", "bacon" ); // 3rd argument is the replacement text
"the old fox had bacon for breakfast"

And finding out what the fox eats:

morningMeal = .;
source = "the old fox had yogurt for breakfast";
foods = "eggs" | "bacon" | "yogurt" | "honey";
ok = patMatch( source, foods >? morningMeal ); // >? is an assign-to-the-right operator for patterns
print(morningMeal );

If you haven't seen the patterns before, you might be puzzling over the odd-looking assignment to foods. Those single vertical bars mean "or". They are not double vertical bars for concatenating strings. foods is a pattern value that will match any of those four words. There is also a + operator for concatenating patterns (which is not the same as concatenating strings). Let's beef up this example:

source = "the old fox had yogurt for breakfast, honey for dinner, and bacon for lunch, but eggs at bedtime.";
foods = "eggs" | "bacon" | "yogurt" | "honey";
ok1 = patMatch( source, ( foods >? morningMeal + " for breakfast" ) );
ok2 = patMatch( source, ( foods >? middayMeal + " for lunch" ) );
ok3 = patMatch( source, ( foods >? eveningMeal + " for dinner" ) );
show(morningMeal, middayMeal, eveningMeal);
morningMeal = "yogurt";
middayMeal = "bacon";
eveningMeal = "honey";

Three separate matches to find a food followed by a time of day (using the + to concatenate the patterns). You can also build up more complicated patterns in easy steps:

first, a little pattern to match a span of digits

digits = patSpan("0123456789");

then a pattern that uses the digits pattern to match numbers like -314.15E-2. The "" alternatives allow matching nothing so the sign, decimal, fraction, and exponent are optional.

number = ( "" | "-" ) // optional sign
   + ( digits + ( "." + ( ( digits ) | "" ) | "" ) ) // leading digit not optional
   + ( ( ("E"|"e") + ( "" | "-" ) + digits ) | "" ) ; // another optional sign in the optional exponent

this is not a pattern; quotation marks are ugly (they need escaping) so make a variable

quote = "\!""; // nightmare -> thing of beauty

Next is a pattern to match text in quotation marks; in the sourceFile text (below), embedded quotation marks are represented as two quotation marks. So this pattern works a bit harder to make sure pairs of quotation marks are kept together. patRepeat() will repeat its argument as many times as possible. patBreak() will match up to, but not including, its argument. patLen() is used instead of "" just to show it off; it matches the number of characters requested. So, a literal starts and ends with a quote, and in between has runs of characters up to a quote. Remember patBreak doesn't include the either there are two quotes and repeat some more, or stop repeating and find the final quote.

literal = quote + patRepeat( patBreak(quote) + ( ( quote + quote ) | patLen(0) ) ) + quote;

this little pattern matches various end-of-line characters used by different operating systems

newline = "\!r\!n" | "\!r" | "\!n" | "\!N";

fields in the sample input are delimited by commas, or the beginning of the file, or the end of the file, or by a newline. patPos() and patRPos() match a zero length string if the current position matches the argument. patRPos(0) is the Right end of the string. patPos(0) is the beginning of the string.

endOfField = "," | newline | patPos(0) | patRPos(0);

here's the sample text to match, using the special bracket escape so the quotes don't need more escaping.

sourcefile = "\[
-123.456e-78,"I said,""how are you?""",11,555-5555,"fine, thank you",3.1459,1.414

Finally! a while-loop to find the pattern and replace it with a comma, printing the found text each time. At the end, display the unmatched text.

while(patMatch(sourcefile, endOfField + (number | literal ) >? found + endOfField , ","),
write(found || "\!n") // display the found text and a newline
write( "Non matches: ", sourceFile, "\!n");
"I said,""how are you?"""
"fine, thank you"
Non matches: ,555-5555,

The >? operator is the conditional assignment within a pattern. The assignment will only be made after the pattern succeeds and patMatch() returns 1. There is also a >> operator for immediate assignments which are made as soon as text is matched (possibly many times if the pattern matcher must back up and retry.)

Pattern Matching in JSL can do other interesting things, like looping over a long string in a single statement, which might speed up an otherwise slow process. The PatRegex() function lets you use a regular expression as part of a pattern match. The scripting index (from the Help menu) has a list of pattern matching functions and descriptions.

update 30jan2017: repair formatting for new community.

Article Tags