Subscribe Bookmark RSS Feed

JSL REGEX word boundaries \b

shaulke

Community Trekker

Joined:

Jan 17, 2012

Hello, -

Does JSL support REGEX word boundaries \b  ?

I'm trying to replace complete words in text using PAT MATCH and "\bword\b" expression.  For some reason JSL does not process \b, though it does recognize and process \W.  Unfortunately \W is not good enough for my task as it doesn't match single word entries among other issues that it introduces.

Example:

a = "window is: open";

p = pat regex("\bis\b");

pat match(a, p, "was");

I would expect it to return:   a = "window was: open";

Thanks!  - Shaul

5 REPLIES
Craige_Hales

Staff

Joined:

Mar 21, 2013

JMP10 does not support the \b word-boundary in regex.  JMP11 will.  In the following examples, {4} means 4 occurrences of the previous character which is intended to match the 4-letter word “very” by requiring the word-boundary to each side.  The regexMatch function interprets the 2nd argument as a regular expression and interprets the 3rd argument as replacement text with substitutions numbered after the capturing parenthesis groups (used in the 2nd example).

print(JMPVERSION());

x = "the large,very quick fox";

regexMatch( x, "\b\w{4}\b", "fairly" );

print(x);

"11.0.0"

"the large,fairly quick fox"

JMP10 might do it well enough, depending on your needs, by capturing the delimiter characters and inserting them back into the replacement:

print(JMPVERSION());

x = "the large,very quick fox";

regexMatch( x, "([^a-zA-Z0-9_]|^)([a-zA-Z0-9_]{4})([^a-zA-Z0-9_]|$)", "\1fairly\3" );

print(x);

"10.0.1"

"the large,fairly quick fox"

\b is a difficult-to-impossible pattern to write because it needs to match both before and after a word and yet match no text at all and the word can be at the very beginning (^) or the very end ($) of the text.  The second example works around the “match no text at all” requirement by using the \1 and \3 parenthesis groups in the replacement text.

Craige
shaulke

Community Trekker

Joined:

Jan 17, 2012

This is very helpful since I'm still on JMP9.  

I will need to dynamically adjust the {n} based on the length of string to match.

Can you elaborate on how the numbering of substitutions work? I'm not sure I understand what \1 and \3 do?

Thank you!  - Shaul

Craige_Hales

Staff

Joined:

Mar 21, 2013

Parenthesized expressions in the regex form a capturing group that can be referred to by number; count just the open parens, from left to right, to get the number.  The text matched by the capturing group can be used as a back reference later in the regex or in the replacement string.  \1 is a back reference to the first capturing group.

regexMatch( x, "([^a-zA-Z0-9_]|^)([a-zA-Z0-9_]{4})([^a-zA-Z0-9_]|$)", "\1fairly\3" );

has 3 open parens; the middle set (around the word that gets matched) was not really needed, but since they are there I use \1 and \3 in the replacement string.  \0 refers to the entire match.  If you nest the parenthesis groups, you still count just the open parens, from left to right, to get the number.

I used {4} repeats to make an example that would not find the three and five letter words in the source text that came first.  You might need some other way to specify the word.  Here's an example that chews the words off one at a time:

Print( JMP Version() );

x = "the large,very quick fox";

While( 1,

  results = Regex Match( x, "([^a-zA-Z0-9_]|^)([a-zA-Z0-9_]+)([^a-zA-Z0-9_]|$)", "\1\3" );

  If( N Items( results ) >= 3,

    Print( results[3] ),

    Break()

  );

);

" 9.0.2"

"the"

"large"

"very"

"quick"

"fox"

In this example, the + means "one or more of the previous expression" so you might not need the explicit length.  RegexMatch returns a list of back references; the first item in the list is \0 which represents the entire match.  results[3] is \2, which is the word that was matched.  The replacement is \1\3, which cuts \2 out of the string.

There are a number of web sites that have tutorials for regex.  Regular-Expressions.info - Regex Tutorial, Examples and Reference - Regexp Patterns , for example.

JMP11's \b will simplify the examples, a lot.

Craige
shaulke

Community Trekker

Joined:

Jan 17, 2012

I simplified the Regex by using \W (JMP9 accepted \W):


x = "the large, very quick fox";

a = Regex Match( x, "(\W|^)(very)(\W|$)", "\1fairly\3" );

print(x);

My only remaining issue is that it only does the replacement once so I need a loop until no more matches are returned.  Is there a flag, like FULLSCAN that I could specify in Regex Match?    I couldn't get Pat Match to work for some reason.

In any case, you helped me a lot already!

Craige_Hales

Staff

Joined:

Mar 21, 2013

Ah, very good.  Much better, thanks.  JMP10 has a simplified regex function that does the global replace:

regex("hello","[^aeiou]","?",GLOBALREPLACE);

"?e??o"


RegexMatch() requires an external loop (the while() from the earlier example).  PatMatch() can use a regex as part of a pattern but does not use the back references except within a patRegex(); patMatch is intended to buildup a pattern using the patXXXX functions rather than the regex language and has different ways to capture partial matches.

Craige