Subscribe Bookmark
Craige_Hales

Staff

Joined:

Mar 21, 2013

Back Reference

\1\1

The JSL REGEX function Regex has a format argument that defaults to "\0". \0 is the back reference to all of the matched text.

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "\w{6,6}", "\0"); // look for exactly 6 word characters

"jumped"

Other back references are created by parentheses.

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "(\b\w{4,4}\b).*(\b\w{4,4}\b)",
    "this part: \0\!nhas this: \1\!nand this: \2\!nfour letter words");

"this part: over the lazy

has this: over

and this: lazy

four letter words"

The first six-letter example worked because there were no words longer than six letters. The four-letter-word example is more complicated; it is using \b to match word boundaries. Without the boundary test, quic would appear to be a four-letter word. The format argument produced all four lines of text using back references \0, \1, and \2. \1 is the first parenthesized group, and \2 the second. \0 is still the entire text matched. ( \!n is a New Line character. It's all run together because any spaces become part of the output. )

Regex is case sensitive by default. This example fails to match and returns a missing value

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "DOG");

.

There is an option for that.

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "DOG", "\0", IGNORECASE);

"dog"

And there are specifications for changing case in the format string.

sentence = "tHe QUick BROWN fox jUMPED over the lAZy DoG";
regex( sentence, "\b(\w)(\w*)\b", "\U\1\E\L\2\E", IGNORECASE, GLOBALREPLACE);

"The Quick Brown Fox Jumped Over The Lazy Dog"

The regex has made the ragged case input into initial capitals. The format string contains \U to begin upper casing, \L to begin lower casing, and \E to end the case change.

grammar for regex replacement strings (similar to Perl)

  • \L - begin lower casing the output until \E
  • \U - begin upper casing the output until \E
  • \E - end of casing and quoting modifications
  • \l - (lower case letter L) - lower case the next character
  • \u - upper case the next character
  • \0 ... \999 - substitute one of the matched expressions from the pattern
  • \e - end of a number if more digits follow: \7\e123 means expression 7 followed by 123 (JMP extension)
  • \\ - a single backslash

Found a bug: this doesn't work currently, but will in the 12.2 maintenance (loops because of a bad interaction with IGNORECASE and GLOBALREPLACE and the zero-length match at the end of the string)

sentence = "tHe QUick BROWN fox jUMPED over the lAZy DoG";
regex( sentence, "\b(\w*)\b", "\L\u\1", IGNORECASE, GLOBALREPLACE);

Since there are no case-specific characters in the search string, this will work:

sentence = "tHe QUick BROWN fox jUMPED over the lAZy DoG";
regex( sentence, "\b(\w*)\b", "\L\u\1", GLOBALREPLACE);

"The Quick Brown Fox Jumped Over The Lazy Dog"

Using \w+ rather than \w* also avoids the problem by insisting words have at least one character.

The Tennessee Waltz is on the front side.

Update 4 Feb 2017: repaired formatting (and picture) for new community.

Article Tags