cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
tsl
tsl
Level III

Regex finding optional characters

I have text strings that can look like Str1 and Str2 below:

Str1 = "ParamList:Param1,Param2,Param3";
Str2 = "ParamList:Param1,Param2,Param3;Var2;Var3";

So there's a comma-delimited list of values prefixed with a keyword and colon. The comma-delimited list might or might not be followed by additional variables. If it is, the comma-delimited list and subsequent variables will be separated by semi-colons as shown in Str2.

I want to grab the comma-delimited list. I've tried the following regex

ParamLst1 = RegexMatch(Str2, "(ParamList:)(.*);?")[3];
ParamLst2 = RegexMatch(Str2, "(ParamList:)(.*);?")[3];

I'm thinking it will find "ParamList:" then (.*) being greedy will match everything to the end of the string, but will back up until it finds zero or one semi-colons.

I must be thinking about it wrong, since with Str2, I get everything back to the end of the string ( "Param1,Param2,Param3;Var2;Var3")

I can use the word function to bail myself out :

word(1,ParamLst2 = RegexMatch(Str2, "(ParamList:)(.*)")[3],";");

and this works for both Str1 and Str2, but is adding complexity that I'm sure is unnecessary if I could grasp regex a little better !

Any ideas ?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: Regex finding optional characters

Or, using regex

Str1 = "ParamList:Param1,Param2,Param3";
Str2 = "ParamList:Param1,Param2,Param3;Var2;Var3";

regex(Str2, "^([^:]*):([^;]*);?(.*)$", "part1='\1' part2='\2' part3='\3'" )

part1='ParamList' part2='Param1,Param2,Param3' part3=''

or

part1='ParamList' part2='Param1,Param2,Param3' part3='Var2;Var3'

The beginning ^ and ending $ force the regex to match from start to finish. The [^:] matches characters that are not : and [^;] matches characters that are not ; . The lone : gets past the required : and the ;? is an optional ; .  The .* matches the remaining characters, if any.

The parens make capturing groups. The groups are referred to in the 3rd parameter as \1, \2, and \3 and inserted into a string that makes the result.

Note that all of the * operators are written so they never go "too far" and need to back up. This makes matching efficient. Use the negative character classes when you can to match everything up to a delimiter.

The result string could be "\2" if that's all you want.

 

Craige

View solution in original post

2 REPLIES 2
pmroz
Super User

Re: Regex finding optional characters

Here's a simple brute force approach using JSL's string functions.

Str1 = "ParamList:Param1,Param2,Param3";
Str2 = "ParamList:Param1,Param2,Param3;Var2;Var3";

s1 = substr(str1, 11);
w1 = words(s1, ";");
plist1 = words(w1[1], ",");

s2 = substr(str2, 11);
w2 = words(s2, ";");
plist2 = words(w2[1], ",");
show(plist1, plist2);

Output:

plist1 = {"Param1", "Param2", "Param3"};
plist2 = {"Param1", "Param2", "Param3"};
Craige_Hales
Super User

Re: Regex finding optional characters

Or, using regex

Str1 = "ParamList:Param1,Param2,Param3";
Str2 = "ParamList:Param1,Param2,Param3;Var2;Var3";

regex(Str2, "^([^:]*):([^;]*);?(.*)$", "part1='\1' part2='\2' part3='\3'" )

part1='ParamList' part2='Param1,Param2,Param3' part3=''

or

part1='ParamList' part2='Param1,Param2,Param3' part3='Var2;Var3'

The beginning ^ and ending $ force the regex to match from start to finish. The [^:] matches characters that are not : and [^;] matches characters that are not ; . The lone : gets past the required : and the ;? is an optional ; .  The .* matches the remaining characters, if any.

The parens make capturing groups. The groups are referred to in the 3rd parameter as \1, \2, and \3 and inserted into a string that makes the result.

Note that all of the * operators are written so they never go "too far" and need to back up. This makes matching efficient. Use the negative character classes when you can to match everything up to a delimiter.

The result string could be "\2" if that's all you want.

 

Craige