topic Re: Regex: can't get all parts of the string that match in Discussions

Regex: can't get all parts of the string that match

lehaofeng — Fri, 08 Mar 2024 09:22:45 GMT

I'm trying to extract all the parts of the string that match, but I'm using regex match and I can't get the "BWA005, BWAZ006" that I want.

ex="Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
Regex match( ex, "[\!"](BWA|BWAZ|BWAZ_)([0-9]{1,})(.*?)[\!"]" )

Re: Regex: can't get all parts of the string that match

mmarchandTSI — Fri, 08 Mar 2024 13:28:16 GMT

I'm sure this is not the best solution, but Regex Match() doesn't seem to do what I'd like it to do, like python's re.findall. You can get a list of all matches with this crude script. Hopefully, someone knows a better way.

Names Default To Here( 1 );
ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
matchlist = {};
While( !Is Missing( Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" ) ),
	a = Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" );
	Insert Into( matchlist, a );
	Substitute Into( ex, a, "" );
);

Re: Regex: can't get all parts of the string that match

mmarchandTSI — Fri, 08 Mar 2024 15:44:12 GMT

Actually, in case there are duplicates in there that you want to see, like { "BWA005", "BWAZ006", "BWA005" }, you would want to do it this way instead:

Names Default To Here( 1 );
ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
matchlist = {};
While( !Is Missing( Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" ) ),
	a = Regex( ex, "(BWA|BWAZ|BWAZ_)(\d{1,})" );
	Insert Into( matchlist, a );
	b = Contains( ex, a );
	c = Length( a );
	ex = Substr( ex, b + c );
);

Re: Regex: can't get all parts of the string that match

lehaofeng — Sun, 10 Mar 2024 10:44:10 GMT

Thank you ! It works!

I want to know if there is a better way to do this.

Re: Regex: can't get all parts of the string that match

Craige_Hales — Mon, 11 Mar 2024 11:59:08 GMT

Better for a future maintainer of the JSL, or better for the computer, or some other measure? Here's a couple of ideas; I lean towards the first one, partly because I imagine you started with a JSL expression, not a text string.

A note about comments and seeing trees vs seeing forests: my comments are tree-level. They don't describe your goal (the forest), they describe the trees. Anyone maintaining your JSL in the future will want both kinds. A forest level comment should explain why you are thinking about these values. The tree level is about how a non-obvious bit of code works.

ex = "Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
// write(ex) // Select Where( :"BWA"n == "BWA005" | :"BWA"n == "BWAZ006" )
// Regex match( ex, "[\!"](BWA|BWAZ|BWAZ_)([0-9]{1,})(.*?)[\!"]" )

// desired result: "BWA005,BWAZ006"
// 
// it isn't 100% clear what the rules are. These examples make different assumptions.

// assume the text is valid JSL that can be parsed back into an expression
// below, also assumes | but not & and no nested parens
expression = Parse( ex ); // Select Where( :BWA == "BWA005" | :BWA == "BWAZ006" )
// get the argument of the select where(...)
expression = Arg( expression, 1 ); // :BWA == "BWA005" | :BWA == "BWAZ006"
// assume there are only | (or) operators, change x|y|z => {x,y,z}
Substitute Into( expression, Expr( Or() ), {} ); // {:BWA == "BWA005", :BWA == "BWAZ006"}
result = {}; // accumulate answer in a result list
// process the expression list
For Each( {op}, expression, // :BWA == "BWA005"   etc.
	// the Right Hand Side of :BWA == "BWA005" is "BWA005"
	RHS = Arg( Name Expr( op ), 2 ); // assign(:BWA,"BWA005"), want 2nd arg
	// apply the test to see if this is one to keep
	If( !Is Missing( Regex( RHS, "(BWA|BWAZ|BWAZ_)([0-9]+)" ) ),
		Insert Into( result, Arg( op, 2 ) ); // yes: keep it in the result list
	);
);
// join the list of RHS strings, separated by commas
Show( Concat Items( result, "," ) ); // Concat Items(result, ",") = "BWA005,BWAZ006";


// alternate example


// assume the strings can't collide with column names, perhaps because no
// column name has a suffix number and all strings do have a suffix number
result = {};
Pat Match(
	ex,
	// this is the pattern; it repeats as long as it can
	Pat Repeat(
		// either match the BWA... pattern, in quotes, stashing the regex match into the result:
		( "\!"" + Pat Regex( "(BWA|BWAZ|BWAZ_)([0-9]+)" ) >> result[N Items( result ) + 1] + "\!"" )
		// >> is the patImmediate() operator that copies the LHS match into the RHS location
		// the list is initially 0 items long, and [nitems+1] extends the list by one more item
	| // OR the alternative... 
		Pat Len( 1 ) // advance one character. This is how most of the text is matched.
	)
);
Show( Concat Items( result, "," ) ); // Concat Items(result, ",") = "BWA005,BWAZ006";

// you *could* write a pattern match that would parse the text as an expression,
// but that is what the first example did.

Re: Regex: can't get all parts of the string that match

jthi — Mon, 11 Mar 2024 14:14:59 GMT

For Regex Match there is wish
. Depending what you are trying to do (+ why and where) there are many different options of handling this (to name few which are already mentioned: loops and jmp's own pattern matching) but there could be more options more suitable for your use case.

Re: Regex: can't get all parts of the string that match

hogi — Sat, 13 Apr 2024 19:11:43 GMT

With Jmp18, it got much easier just to use Python's re.findall:

ex="Select Where( :\!"BWA\!"n == \!"BWA005\!" | :\!"BWA\!"n == \!"BWAZ006\!" )";
Python send(ex);
Python Submit ("
import re
matches = re.findall(r'(BWA|BWAZ)([0-9]+)',ex)
");
Python get (matches)