Solved: Re: How can use the regex substitution <*>?

Report Inappropriate Content · Jun 9, 2023 3:07 PM

How to use regex to replace the text of dt[1,"A"] with tabs for each <*>.It get the multi-column result in row 2.

Thanks!

abc<ahref="/cindex/list->897</a>"id="s897"</dd>

Craige_Hales · Mar 9, 2021 09:09 PM

/*
THIS IS NOT AN HTML PARSER

HTML parsers are hard to write. There are better ways. 
Neither of these JSL ideas can properly skip comments or
many other HTML tags that should be skipped. They will
find things they shouldn't and miss other things they
should find. They've only been minimally tested on one site.

This code is presented as "how can I use contains() vs patmatch()
to efficiently work through a large string looking for something?"

The second example is FIVE TIMES FASTER and a LITTLE MORE ACCURATE.

The third example uses text explorer to grab links. I don't think
there is a pre-existing regex for the link text descriptions. It works
very similar to this code; it is NOT an HTML parser either.

If you can use python, you might want to investigate "beautiful soup".
I've not used it, but believe it addresses the "hard to write" issue.


*/


u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );

// a link on a page has at least two parts: 
// the URL and some descriptive text.

// <a ... href="url" ... > descripton </a>
// p1                   p4           p5

// to parse ALL the links on a page, you'll want 
// some sort of loop. There are many ways to
// write that loop; here are two choices

// using contains and regex. "<a " will be our search token
// and contains() will be our workhorse. Use regex where appropriate. 
// this will find links that it should not find because it does not skip script sections!

dt = New Table( "regex", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
pos = 0;
while( (p1 = Contains( txt, "<a ", pos )) != 0, // as long as we can find the start of a tag
	p4 = Contains( txt, ">", p1 + 3 ); // find the end of the opening tag
	p5 = Contains( txt, "</a>", p4 + 1 ); // find the ending tag
	if( p5 > p4, // as long as the end is not zero, we found one, see break() below
		desc = Substr( txt, p4 + 1, p5 - p4 - 1 ); // the visible link description. Images can be here too.
		desc = Regex( desc, "<[^>]*>", "", globalreplace ); // remove span, image, etc tags
		linktext = Substr( txt, p1, p4 - p1 + 1 ); // <a href = "/">
		hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );// use regex to get the href value
		if( !Is Missing( hreftext ), // sometimes there isn't one
			dt << addrows( 1 );//
			irow = N Rows( dt );//
			dt:link[irow] = hreftext;//
			dt:description[irow] = desc; // 
		);//
	, // else
		Break() // no more end tags. there is maybe a mess of javascript.
	);//
	pos = Max( pos, p5 + 3 ); // advance past the one just found
	Wait( 0 );
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 127 links




// using pattern matching and regex
// this is about 5X faster and skips over the garbage in the <script> sections

// two patterns, one for <a> links and one for <script> to skip over
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";

dt = New Table( "patmatch", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );

Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();

rc = patmatch( txt,
	patrepeat( // repeat the pattern until no more
		(linkpat + pattest(// run some JSL for the linkpat...
			desc = Regex( desctext, "<[^>]*>", "", globalreplace ); // remove span, etc tags
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );//
			if( !Is Missing( hreftext ), //
				dt << addrows( 1 );//
				irow = N Rows( dt );//
				dt:link[irow] = hreftext;//
				dt:description[irow] = desc; // 
			);//
			1; // pattest needs this to succeed
		)) | scriptpat /* maybe skip a script */ | /* maybe skip to the next tag */ (Pat Len( 1 ) + Pat Break( "<" ))
	),
	NULL,IGNORECASE
);

stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 128 links

Show( rc ); // should be 1, not 0




///////////////////////////////////////
// text explorer can also do this:

dt=New Table( "textexplorer",
	Add Rows( 1 ),
	New Column( "text", Character, "Nominal", Set Values( evallist({txt}) ) )
);

dt<<Text Explorer(
	Text Columns( :text ),
	Set Regex( Library( "HTML Link Grabber" ) ),
	Language( "English" ),
	SendToReport(
		Dispatch(
			{"Term and Phrase Lists"},
			"",
			TableBox,
			{Sort By Column( 2, 1 )}
		)
	)
);

Craige

View solution in original post

Craige_Hales · Mar 7, 2021 06:06 AM

I think you want separate formulas for columns b, c, ...

Column b's formula should extract part of the href string.

Column c's formula should extract part of the id string.

The exact formulas might be quite simple if the data is the same shape on every row. For example, if there is always a number just before </a> for column b, (untested code follows)

regex(a, "(\d+)</a>", "\1")

might work. Breaking that pattern down:

(\d+) - the parens create the \1 capture group, \d is a digit, + is one or more
</a> - literal text that must follow the numbers
"\1" - insert capture group 1 into the result

Similarly, if the column c text is always identified by id=,

regex(a,"id=\"[^\"]*\"")

might work.

This pattern uses \" three places to represent a quotation mark inside a JSL string surrounded by quotation marks. If we remove the outer quotation marks and unescape the inner ones, the string regex sees is id="[^"]*" which breaks down to

match id= and the "
[^"] is a character class that matches any character that is NOT a "
* means zero or more of the not-" characters
followed by "

Simple parsing like this will usually work on well behaved data. In the general case you might need more complicated patterns (if there could be another id= string, etc).

Craige

lala · Mar 7, 2021 06:31 AM

Thank Craige!
I tried the operation, JSL failed to run.

Craige_Hales · Mar 7, 2021 06:33 AM

Told you it was untested... use

\!"

for the escaped "

Craige

Craige_Hales · Mar 7, 2021 06:37 AM

I've been writing C code lately. C uses \" to escape a " but JSL uses \!" to escape a ".

All three of the inside-the-string " need the JSL escape.

Craige

lala · Mar 7, 2021 06:54 AM

Thank Craige!

I'm using the regular substitution of "EmEditor"

<(.[^>]{0,})>

Script:

document.selection.Replace("<(.[^>]{0,})>","\\t",eeReplaceAll | eeFindReplaceRegExp,eeExFindSeparateCRLF);

Craige_Hales · Mar 7, 2021 11:19 AM

I'm not sure if you are still asking a question. If you are still trying to make something work, I'd need to know what you are trying to do.

Are you trying to use a text editor to reformat the file into a tab-delimited file that JMP can read?

Craige

lala · Mar 8, 2021 04:46 AM

I want to use JSL via JMP directly to get the title of the page and link list.
For example: this JMP community web page,

I'll try to write it this way. But it didn't work.
Thanks!

u="https://community.jmp.com/t5/Discussions/bd-p/discussions";txt=loadtextfile(u);

t1="";offset=Contains(txt,t1);
If(offset,txt=SubStr(txt,offset+4,Length(txt)));
t2="";offset=Contains(txt,t2,-1);
If(offset,txt=SubStr(txt,1,offset-1));

txt=Substitute(txt," ","","message">","","Discussions/","Discussions/>");
txt=Substitute(txt,"<(.[^>]{0,})>","","\!n","","\!r","");

Craige_Hales · Mar 9, 2021 09:09 PM

/*
THIS IS NOT AN HTML PARSER

HTML parsers are hard to write. There are better ways. 
Neither of these JSL ideas can properly skip comments or
many other HTML tags that should be skipped. They will
find things they shouldn't and miss other things they
should find. They've only been minimally tested on one site.

This code is presented as "how can I use contains() vs patmatch()
to efficiently work through a large string looking for something?"

The second example is FIVE TIMES FASTER and a LITTLE MORE ACCURATE.

The third example uses text explorer to grab links. I don't think
there is a pre-existing regex for the link text descriptions. It works
very similar to this code; it is NOT an HTML parser either.

If you can use python, you might want to investigate "beautiful soup".
I've not used it, but believe it addresses the "hard to write" issue.


*/


u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );

// a link on a page has at least two parts: 
// the URL and some descriptive text.

// <a ... href="url" ... > descripton </a>
// p1                   p4           p5

// to parse ALL the links on a page, you'll want 
// some sort of loop. There are many ways to
// write that loop; here are two choices

// using contains and regex. "<a " will be our search token
// and contains() will be our workhorse. Use regex where appropriate. 
// this will find links that it should not find because it does not skip script sections!

dt = New Table( "regex", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
pos = 0;
while( (p1 = Contains( txt, "<a ", pos )) != 0, // as long as we can find the start of a tag
	p4 = Contains( txt, ">", p1 + 3 ); // find the end of the opening tag
	p5 = Contains( txt, "</a>", p4 + 1 ); // find the ending tag
	if( p5 > p4, // as long as the end is not zero, we found one, see break() below
		desc = Substr( txt, p4 + 1, p5 - p4 - 1 ); // the visible link description. Images can be here too.
		desc = Regex( desc, "<[^>]*>", "", globalreplace ); // remove span, image, etc tags
		linktext = Substr( txt, p1, p4 - p1 + 1 ); // <a href = "/">
		hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );// use regex to get the href value
		if( !Is Missing( hreftext ), // sometimes there isn't one
			dt << addrows( 1 );//
			irow = N Rows( dt );//
			dt:link[irow] = hreftext;//
			dt:description[irow] = desc; // 
		);//
	, // else
		Break() // no more end tags. there is maybe a mess of javascript.
	);//
	pos = Max( pos, p5 + 3 ); // advance past the one just found
	Wait( 0 );
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 127 links




// using pattern matching and regex
// this is about 5X faster and skips over the garbage in the <script> sections

// two patterns, one for <a> links and one for <script> to skip over
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";

dt = New Table( "patmatch", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );

Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();

rc = patmatch( txt,
	patrepeat( // repeat the pattern until no more
		(linkpat + pattest(// run some JSL for the linkpat...
			desc = Regex( desctext, "<[^>]*>", "", globalreplace ); // remove span, etc tags
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );//
			if( !Is Missing( hreftext ), //
				dt << addrows( 1 );//
				irow = N Rows( dt );//
				dt:link[irow] = hreftext;//
				dt:description[irow] = desc; // 
			);//
			1; // pattest needs this to succeed
		)) | scriptpat /* maybe skip a script */ | /* maybe skip to the next tag */ (Pat Len( 1 ) + Pat Break( "<" ))
	),
	NULL,IGNORECASE
);

stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 128 links

Show( rc ); // should be 1, not 0




///////////////////////////////////////
// text explorer can also do this:

dt=New Table( "textexplorer",
	Add Rows( 1 ),
	New Column( "text", Character, "Nominal", Set Values( evallist({txt}) ) )
);

dt<<Text Explorer(
	Text Columns( :text ),
	Set Regex( Library( "HTML Link Grabber" ) ),
	Language( "English" ),
	SendToReport(
		Dispatch(
			{"Term and Phrase Lists"},
			"",
			TableBox,
			{Sort By Column( 2, 1 )}
		)
	)
);

Craige

Craige_Hales · Mar 10, 2021 06:10 AM

ignore the .5 sec, # links comments in the JSL. Run your own test.

Comments quickly go bad, especially when copy/paste!

Craige