Back Reference

Craige_Hales · May 18, 2015 03:52 PM

$\1$ \1

The JSL REGEX function Regex has a format argument that defaults to "\0". \0 is the back reference to all of the matched text.

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "\w{6,6}", "\0"); // look for exactly 6 word characters

"jumped"

Other back references are created by parentheses.

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "(\b\w{4,4}\b).*(\b\w{4,4}\b)",
    "this part: \0\!nhas this: \1\!nand this: \2\!nfour letter words");

"this part: over the lazy

has this: over

and this: lazy

four letter words"

The first six-letter example worked because there were no words longer than six letters. The four-letter-word example is more complicated; it is using \b to match word boundaries. Without the boundary test, quic would appear to be a four-letter word. The format argument produced all four lines of text using back references \0, \1, and \2. \1 is the first parenthesized group, and \2 the second. \0 is still the entire text matched. ( \!n is a New Line character. It's all run together because any spaces become part of the output. )

Regex is case sensitive by default. This example fails to match and returns a missing value

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "DOG");

.

There is an option for that.

sentence = "the quick brown fox jumped over the lazy dog";
regex( sentence, "DOG", "\0", IGNORECASE);

"dog"

And there are specifications for changing case in the format string.

sentence = "tHe QUick BROWN fox jUMPED over the lAZy DoG";
regex( sentence, "\b(\w)(\w*)\b", "\U\1\E\L\2\E", IGNORECASE, GLOBALREPLACE);

"The Quick Brown Fox Jumped Over The Lazy Dog"

The regex has made the ragged case input into initial capitals. The format string contains \U to begin upper casing, \L to begin lower casing, and \E to end the case change.

grammar for regex replacement strings (similar to Perl)

\L - begin lower casing the output until \E
\U - begin upper casing the output until \E
\E - end of casing and quoting modifications
\l - (lower case letter L) - lower case the next character
\u - upper case the next character
\0 ... \999 - substitute one of the matched expressions from the pattern
\e - end of a number if more digits follow: \7\e123 means expression 7 followed by 123 (JMP extension)
\\ - a single backslash

Found a bug: this doesn't work currently, but will in the 12.2 maintenance (loops because of a bad interaction with IGNORECASE and GLOBALREPLACE and the zero-length match at the end of the string)

sentence = "tHe QUick BROWN fox jUMPED over the lAZy DoG";
regex( sentence, "\b(\w*)\b", "\L\u\1", IGNORECASE, GLOBALREPLACE);

Since there are no case-specific characters in the search string, this will work:

sentence = "tHe QUick BROWN fox jUMPED over the lAZy DoG";
regex( sentence, "\b(\w*)\b", "\L\u\1", GLOBALREPLACE);

"The Quick Brown Fox Jumped Over The Lazy Dog"

Using \w+ rather than \w* also avoids the problem by insisting words have at least one character.

The Tennessee Waltz is on the front side.

Update 4 Feb 2017: repaired formatting (and picture) for new community.

gianpaolo · ‎12-05-2017

Hello Craige_Hales,

i have found 'regex' function very interesting to apply my works.

I have the needing to modify my Comuns name for my script in details i would like to insert a character when my column name starts with a number.

e.g. if is: "00001 DATA no 1"

then will be: "*00001 DATA no 1" (in this case i added ' * ')

if is: "DATA no 1"

then will be: "DATA no 1" (in this case no changes)

is possible to use regex? can you help me?

thanks a lot

ciao Gianpaolo

uday_guntupalli · ‎12-05-2017

@gianpaolo ,

Here is one way to address .My suggestion is to post such questions in "Discussions" - and provide a link the blog. The community will be able to address your question faster :

Clear Globals(); Clear Log();

dt = Current Data Table(); 
ColNames = dt << Get Column Names("String");

for( i = 1 , i <= N Items(ColsList) , i++, 
		If(!IsMissing(Regex(Char(ColsList[i]),"[0-9]")),
			Col = Column(dt,i); // where dt is your data table 
			ColName = Col << Get Name; 
			NewName = Concat("*",ColName); 
			Col << Set Name(NewName);
		  );
   );

Provided is a screenshot of data table (dt) I made for running the example

Craige_Hales · ‎12-05-2017

You might need to change the regex to

"^[0-9]"

to force it to find the character at the start of the string. Otherwise a name like Column7 would match.

gianpaolo · ‎12-06-2017

thank you very much for help,

just i think script work if i use "Colnames" instead "ColsList" in the FOR Cicle

Clear Globals();

Clear Log();

dt = Current Data Table();

ColNames = dt << Get Column Names( "String" );

For( i = 1, i <= N Items( ColNames ), i++,

If( !Is Missing( Regex( Char( ColNames[i] ), "^[0-9]" ) ),

Col = Column( dt, i ); // where dt is your data table

ColName = Col << Get Name;

NewName = Concat( "*", ColName );

Col << Set Name( NewName );

)

);

i'm beginner in the jmp scripting... forgive me if im wrong

uday_guntupalli · ‎12-06-2017

@gianpaolo :
Your observation is totally correct . It was a typo and oversight on my part. The script you posted is correct