cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
lehaofeng
Level V

Regex: can't get all parts of the string that match

I'm trying to extract all the parts of the string that match, but I'm using regex match and I can't get the "BWA005, BWAZ006" that I want.

ex="Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
Regex match( ex, "[\!"](BWA|BWAZ|BWAZ_)([0-9]{1,})(.*?)[\!"]" )

 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions
mmarchandTSI
Level V

Re: Regex: can't get all parts of the string that match

I'm sure this is not the best solution, but Regex Match() doesn't seem to do what I'd like it to do, like python's re.findall.  You can get a list of all matches with this crude script.  Hopefully, someone knows a better way.

 

Names Default To Here( 1 );
ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
matchlist = {};
While( !Is Missing( Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" ) ),
	a = Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" );
	Insert Into( matchlist, a );
	Substitute Into( ex, a, "" );
);

 

View solution in original post

Craige_Hales
Super User

Re: Regex: can't get all parts of the string that match

Better for a future maintainer of the JSL, or better for the computer, or some other measure? Here's a couple of ideas; I lean towards the first one, partly because I imagine you started with a JSL expression, not a text string.

A note about comments and seeing trees vs seeing forests: my comments are tree-level. They don't describe your goal (the forest), they describe the trees. Anyone maintaining your JSL in the future will want both kinds. A forest level comment should explain why you are thinking about these values. The tree level is about how a non-obvious bit of code works.

ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
// write(ex) // Select Where( :"BWA"n == "BWA005" | :"BWA"n == "BWAZ006" )
// Regex match( ex, "[\!"](BWA|BWAZ|BWAZ_)([0-9]{1,})(.*?)[\!"]" )

// desired result: "BWA005,BWAZ006"
// 
// it isn't 100% clear what the rules are. These examples make different assumptions.

// assume the text is valid JSL that can be parsed back into an expression
// below, also assumes | but not & and no nested parens
expression = Parse( ex ); // Select Where( :BWA == "BWA005" | :BWA == "BWAZ006" )
// get the argument of the select where(...)
expression = Arg( expression, 1 ); // :BWA == "BWA005" | :BWA == "BWAZ006"
// assume there are only | (or) operators, change x|y|z => {x,y,z}
Substitute Into( expression, Expr( Or() ), {} ); // {:BWA == "BWA005", :BWA == "BWAZ006"}
result = {}; // accumulate answer in a result list
// process the expression list
For Each( {op}, expression, // :BWA == "BWA005"   etc.
	// the Right Hand Side of :BWA == "BWA005" is "BWA005"
	RHS = Arg( Name Expr( op ), 2 ); // assign(:BWA,"BWA005"), want 2nd arg
	// apply the test to see if this is one to keep
	If( !Is Missing( Regex( RHS, "(BWA|BWAZ|BWAZ_)([0-9]+)" ) ),
		Insert Into( result, Arg( op, 2 ) ); // yes: keep it in the result list
	);
);
// join the list of RHS strings, separated by commas
Show( Concat Items( result, "," ) ); // Concat Items(result, ",") = "BWA005,BWAZ006";


// alternate example

// assume the strings can't collide with column names, perhaps because no // column name has a suffix number and all strings do have a suffix number result = {}; Pat Match( ex, // this is the pattern; it repeats as long as it can Pat Repeat( // either match the BWA... pattern, in quotes, stashing the regex match into the result: ( "\!"" + Pat Regex( "(BWA|BWAZ|BWAZ_)([0-9]+)" ) >> result[N Items( result ) + 1] + "\!"" ) // >> is the patImmediate() operator that copies the LHS match into the RHS location // the list is initially 0 items long, and [nitems+1] extends the list by one more item | // OR the alternative... Pat Len( 1 ) // advance one character. This is how most of the text is matched. ) ); Show( Concat Items( result, "," ) ); // Concat Items(result, ",") = "BWA005,BWAZ006"; // you *could* write a pattern match that would parse the text as an expression, // but that is what the first example did.
Craige

View solution in original post

6 REPLIES 6
mmarchandTSI
Level V

Re: Regex: can't get all parts of the string that match

I'm sure this is not the best solution, but Regex Match() doesn't seem to do what I'd like it to do, like python's re.findall.  You can get a list of all matches with this crude script.  Hopefully, someone knows a better way.

 

Names Default To Here( 1 );
ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
matchlist = {};
While( !Is Missing( Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" ) ),
	a = Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" );
	Insert Into( matchlist, a );
	Substitute Into( ex, a, "" );
);

 

mmarchandTSI
Level V

Re: Regex: can't get all parts of the string that match

Actually, in case there are duplicates in there that you want to see, like { "BWA005", "BWAZ006", "BWA005" }, you would want to do it this way instead:

 

Names Default To Here( 1 );
ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
matchlist = {};
While( !Is Missing( Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" ) ),
	a = Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" );
	Insert Into( matchlist, a );
	b = Contains( ex, a );
	c = Length( a );
	ex = Substr( ex, b + c );
);
lehaofeng
Level V

Re: Regex: can't get all parts of the string that match

Thank you ! It works!

I want to know if there is a better way to do this.

Craige_Hales
Super User

Re: Regex: can't get all parts of the string that match

Better for a future maintainer of the JSL, or better for the computer, or some other measure? Here's a couple of ideas; I lean towards the first one, partly because I imagine you started with a JSL expression, not a text string.

A note about comments and seeing trees vs seeing forests: my comments are tree-level. They don't describe your goal (the forest), they describe the trees. Anyone maintaining your JSL in the future will want both kinds. A forest level comment should explain why you are thinking about these values. The tree level is about how a non-obvious bit of code works.

ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
// write(ex) // Select Where( :"BWA"n == "BWA005" | :"BWA"n == "BWAZ006" )
// Regex match( ex, "[\!"](BWA|BWAZ|BWAZ_)([0-9]{1,})(.*?)[\!"]" )

// desired result: "BWA005,BWAZ006"
// 
// it isn't 100% clear what the rules are. These examples make different assumptions.

// assume the text is valid JSL that can be parsed back into an expression
// below, also assumes | but not & and no nested parens
expression = Parse( ex ); // Select Where( :BWA == "BWA005" | :BWA == "BWAZ006" )
// get the argument of the select where(...)
expression = Arg( expression, 1 ); // :BWA == "BWA005" | :BWA == "BWAZ006"
// assume there are only | (or) operators, change x|y|z => {x,y,z}
Substitute Into( expression, Expr( Or() ), {} ); // {:BWA == "BWA005", :BWA == "BWAZ006"}
result = {}; // accumulate answer in a result list
// process the expression list
For Each( {op}, expression, // :BWA == "BWA005"   etc.
	// the Right Hand Side of :BWA == "BWA005" is "BWA005"
	RHS = Arg( Name Expr( op ), 2 ); // assign(:BWA,"BWA005"), want 2nd arg
	// apply the test to see if this is one to keep
	If( !Is Missing( Regex( RHS, "(BWA|BWAZ|BWAZ_)([0-9]+)" ) ),
		Insert Into( result, Arg( op, 2 ) ); // yes: keep it in the result list
	);
);
// join the list of RHS strings, separated by commas
Show( Concat Items( result, "," ) ); // Concat Items(result, ",") = "BWA005,BWAZ006";


// alternate example

// assume the strings can't collide with column names, perhaps because no // column name has a suffix number and all strings do have a suffix number result = {}; Pat Match( ex, // this is the pattern; it repeats as long as it can Pat Repeat( // either match the BWA... pattern, in quotes, stashing the regex match into the result: ( "\!"" + Pat Regex( "(BWA|BWAZ|BWAZ_)([0-9]+)" ) >> result[N Items( result ) + 1] + "\!"" ) // >> is the patImmediate() operator that copies the LHS match into the RHS location // the list is initially 0 items long, and [nitems+1] extends the list by one more item | // OR the alternative... Pat Len( 1 ) // advance one character. This is how most of the text is matched. ) ); Show( Concat Items( result, "," ) ); // Concat Items(result, ",") = "BWA005,BWAZ006"; // you *could* write a pattern match that would parse the text as an expression, // but that is what the first example did.
Craige
jthi
Super User

Re: Regex: can't get all parts of the string that match

For Regex Match there is wish
Add flag to Regex Match() to find all non-overlapping occurances of pattern . Depending what you are trying to do (+ why and where) there are many different options of handling this (to name few which are already mentioned: loops and jmp's own pattern matching) but there could be more options more suitable for your use case.

-Jarmo
hogi
Level XII

Re: Regex: can't get all parts of the string that match

With Jmp18, it got much easier just to use Python's re.findall:

 

ex="Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
Python send(ex);
Python Submit ("
import re
matches = re.findall(r'(BWA|BWAZ)([0-9]+)',ex)
");
Python get (matches)