Uncharted

Craige_Hales · Feb 11, 2015 05:05 AM

A bunch of JMP character functions appeared in a previous post: length, substr, contains, etc. They are great for simple things, and you could do hard things with them if you wanted to. But Regex might be better.

Regex, short for Regular Expression, is a powerful JSL function for manipulating text.

result = regex( sourceText, searchPattern, replaceTemplate, options );

(There are two other functions in JSL with Regex in their name; for now just look at vanilla "regex". It's the one you want, especially if you are using it in a column formula.)

This is a very incomplete introduction to regular expressions (but a reasonably complete description of JMP's regex function). You can find many tutorials for regular expressions on the web; JMP's regular expression searchPatterns are similar to Perl's regular expressions.

The searchPattern is the regular expression and describes the text to be found. The replaceTemplate describes how the matched text will be replaced and can use bits of the matched text in the template. The following examples use a literal string (in quotation marks) for the first argument, for clarity, but a variable holding the sourceText would almost always make more sense.

regex( "123 456 789", "[0-9]*")

"123"

This example has two arguments, sourceText and searchPattern. The searchPattern uses square brackets to create a set of characters. The hyphen shortcut specifies a range. You could use [a-zA-Z] to make a set of all the upper and lower case ASCII characters. The asterisk (*) means "repeat the previous pattern zero or more times". \d is a shortcut for the set of all digits.

regex( "123 456 789", "\d*")

"123"

A character set like [0123456789] can be negated by starting it with a hat: [^0123456789] matches any character that is not a digit. Some of the shortcuts like \d can be represented as a character set:

\w	[a-zA-Z0-9_]	word characters
\W	[^a-zA-Z0-9_]	not word characters
\d	[0-9]	digits
\D	[^0-9]	not digits
\s	[ \t\r\n\v\f]	whitespace includes tab, return, newline, vertical tab, and formfeed
\S	[^ \t\r\n\v\f]	not whitespace

All of the shortcuts above match a single character. You can add the asterisk

\s*

to match a run of zero or more whitespace characters. Or, you can add a plus

\D+

to match a run of one or more non-digit characters. Or, you can add a question mark

\w?

to match zero or one word characters.

Asterisk, plus, and question mark are repetition operators, and they are greedy. Greedy means they grab as much text as possible, then back off later if needed. They can be made reluctant by adding question mark. Generally greedy is the behavior you want, but not always.

Parentheses, Back References, and replaceTemplate

A lot of the power in regex comes from the interaction between the searchPattern and the replaceTemplate. The interaction is created by the use of back references, represented as \0, \1, \2, ... The replaceTemplate default value is "\0". Back reference \0 is the entire text that was matched. Parentheses are used in regular expressions to group bits of a pattern together to form larger patterns. For example

regex( "cat dog catog cadog cat dog", "cat|dog", "***", GLOBALREPLACE)

"*** *** ***og ca*** *** ***"

regex( "cat dog catog cadog cat dog", "ca(t|d)og", "***", GLOBALREPLACE)

"cat dog *** *** cat dog"

The vertical bar separates alternatives and has very low precedence; the second example uses parentheses to group the t or d together. The replaceTemplate is just constant text, ***. But each open parenthesis begins a new back reference, and the back references can be used in the template:

regex( "cat dog catog cadog cat dog", "ca(t|d)og", "  \1  ", GLOBALREPLACE)

"cat dog t d cat dog"

GLOBALREPLACE causes the regex to find and replace all matches; the pattern matches either catog or cadog. The back reference is either t or d, because that is the part of the pattern inside the parentheses. The template adds some spaces around the \1 value, and the template is then substituted for each matched catog or cadog.

Multiple parenthesis pairs can be used, both nested and in sequence. Just count the open parens to find the back reference number.

Data Table Formula

Here's the lastname, firstname example using regex.Column formula using regex()

This example uses the anchors ^ and $ (underlined in green). ^ matches the beginning of the string and $ matches the end of the string. \w* matches zero or more word characters. This example may be too simple; it won't handle names with special characters or blanks. You can rework it to split on commas using back references: Regex(name, "([^,]*)", "\1") and Regex(name, "\s*([^,]*)$", "\1") perhaps. "([^,]*)" means zero or more characters that are not commas and "\s*([^,]*)$" means skip zero or more spaces, keep all the non-commas, and be at the end. In the "([^,]*)" case \1 and \0 are the same. But in "\s*([^,]*)$", \0 may have leading spaces that \1 doesn't have because \s* is outside the parens for \1.

There's a lot more to regular expressions, well described elsewhere on the web (search for regex). And the Alice and Jerry example in the scripting index is pretty good too, if you are puzzling out how the back references can be used.

Next: pattern matching functions.

Update 29Jan2017: repaired formatting for new community.