cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Choose Language Hide Translation Bar
lala
Level VIII

How can use the regex substitution <*>?

How to use regex to replace the text of dt[1,"A"] with tabs for each <*>.It get the multi-column result in row 2.

2021-03-07_16-36-24.png

Thanks!

abc<ahref="/cindex/list->897</a>"id="s897"</dd>

 

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: How can use the regex substitution <*>?

/*
THIS IS NOT AN HTML PARSER

HTML parsers are hard to write. There are better ways. 
Neither of these JSL ideas can properly skip comments or
many other HTML tags that should be skipped. They will
find things they shouldn't and miss other things they
should find. They've only been minimally tested on one site.

This code is presented as "how can I use contains() vs patmatch()
to efficiently work through a large string looking for something?"

The second example is FIVE TIMES FASTER and a LITTLE MORE ACCURATE.

The third example uses text explorer to grab links. I don't think
there is a pre-existing regex for the link text descriptions. It works
very similar to this code; it is NOT an HTML parser either.

If you can use python, you might want to investigate "beautiful soup".
I've not used it, but believe it addresses the "hard to write" issue.


*/


u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );

// a link on a page has at least two parts: 
// the URL and some descriptive text.

// <a ... href="url" ... > descripton </a>
// p1                   p4           p5

// to parse ALL the links on a page, you'll want 
// some sort of loop. There are many ways to
// write that loop; here are two choices

// using contains and regex. "<a " will be our search token
// and contains() will be our workhorse. Use regex where appropriate. 
// this will find links that it should not find because it does not skip script sections!

dt = New Table( "regex", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
pos = 0;
while( (p1 = Contains( txt, "<a ", pos )) != 0, // as long as we can find the start of a tag
	p4 = Contains( txt, ">", p1 + 3 ); // find the end of the opening tag
	p5 = Contains( txt, "</a>", p4 + 1 ); // find the ending tag
	if( p5 > p4, // as long as the end is not zero, we found one, see break() below
		desc = Substr( txt, p4 + 1, p5 - p4 - 1 ); // the visible link description. Images can be here too.
		desc = Regex( desc, "<[^>]*>", "", globalreplace ); // remove span, image, etc tags
		linktext = Substr( txt, p1, p4 - p1 + 1 ); // <a href = "/">
		hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );// use regex to get the href value
		if( !Is Missing( hreftext ), // sometimes there isn't one
			dt << addrows( 1 );//
			irow = N Rows( dt );//
			dt:link[irow] = hreftext;//
			dt:description[irow] = desc; // 
		);//
	, // else
		Break() // no more end tags. there is maybe a mess of javascript.
	);//
	pos = Max( pos, p5 + 3 ); // advance past the one just found
	Wait( 0 );
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 127 links




// using pattern matching and regex
// this is about 5X faster and skips over the garbage in the <script> sections

// two patterns, one for <a> links and one for <script> to skip over
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";

dt = New Table( "patmatch", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );

Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();

rc = patmatch( txt,
	patrepeat( // repeat the pattern until no more
		(linkpat + pattest(// run some JSL for the linkpat...
			desc = Regex( desctext, "<[^>]*>", "", globalreplace ); // remove span, etc tags
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );//
			if( !Is Missing( hreftext ), //
				dt << addrows( 1 );//
				irow = N Rows( dt );//
				dt:link[irow] = hreftext;//
				dt:description[irow] = desc; // 
			);//
			1; // pattest needs this to succeed
		)) | scriptpat /* maybe skip a script */ | /* maybe skip to the next tag */ (Pat Len( 1 ) + Pat Break( "<" ))
	),
	NULL,IGNORECASE
);

stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 128 links

Show( rc ); // should be 1, not 0




///////////////////////////////////////
// text explorer can also do this:

dt=New Table( "textexplorer",
	Add Rows( 1 ),
	New Column( "text", Character, "Nominal", Set Values( evallist({txt}) ) )
);

dt<<Text Explorer(
	Text Columns( :text ),
	Set Regex( Library( "HTML Link Grabber" ) ),
	Language( "English" ),
	SendToReport(
		Dispatch(
			{"Term and Phrase Lists"},
			"",
			TableBox,
			{Sort By Column( 2, 1 )}
		)
	)
);
Craige

View solution in original post

11 REPLIES 11
Craige_Hales
Super User

Re: How can use the regex substitution <*>?

I think you want separate formulas for columns b, c, ...

Column b's formula should extract part of the href string.

Column c's formula should extract part of the id string.

The exact formulas might be quite simple if the data is the same shape on every row. For example, if there is always a number just before </a> for column b, (untested code follows)

 

regex(a, "(\d+)</a>", "\1")

might work. Breaking that pattern down:

 

  • (\d+) - the parens create the \1 capture group, \d is a digit, + is one or more
  • </a> - literal text that must follow the numbers
  • "\1" - insert capture group 1 into the result

 

Similarly, if the column c text is always identified by id=,

 

regex(a,"id=\"[^\"]*\"")

might work.

 

This pattern uses \" three places to represent a quotation mark inside a JSL string surrounded by quotation marks. If we remove the outer quotation marks and unescape the inner ones, the string regex sees is id="[^"]*" which breaks down to

  •  match id= and the "
  • [^"] is a character class that matches any character that is NOT a "
  • * means zero or more of the not-" characters
  • followed by "

 

Simple parsing like this will usually work on well behaved data. In the general case you might need more complicated patterns (if there could be another id= string, etc).

 

Craige
lala
Level VIII

Re: How can use the regex substitution <*>?

Thank Craige!
I tried the operation, JSL failed to run.

2021-03-07_19-27-18.png

2021-03-07_19-28-16.png

Craige_Hales
Super User

Re: How can use the regex substitution <*>?

Told you it was untested... use

\!"

for the escaped "

Craige
Craige_Hales
Super User

Re: How can use the regex substitution <*>?

I've been writing C code lately. C uses \" to escape a " but JSL uses \!" to escape a ".

All three of the inside-the-string " need the JSL escape.

 

Craige
lala
Level VIII

Re: How can use the regex substitution <*>?

Thank Craige!

 

I'm using the regular substitution of "EmEditor"

<(.[^>]{0,})>

 Script:

document.selection.Replace("<(.[^>]{0,})>","\\t",eeReplaceAll | eeFindReplaceRegExp,eeExFindSeparateCRLF);
Craige_Hales
Super User

Re: How can use the regex substitution <*>?

I'm not sure if you are still asking a question. If you are still trying to make something work, I'd need to know what you are trying to do.

 

Are you trying to use a text editor to reformat the file into a tab-delimited file that JMP can read? 

Craige
lala
Level VIII

Re: How can use the regex substitution <*>?

I want to use JSL via JMP directly to get the title of the page and link list.
For example: this JMP community web page,

2021-03-08_17-30-03.png

  • I'll try to write it this way. But it didn't work.

  • Thanks!
u="https://community.jmp.com/t5/Discussions/bd-p/discussions";txt=loadtextfile(u);

t1="";offset=Contains(txt,t1);
If(offset,txt=SubStr(txt,offset+4,Length(txt)));
t2="";offset=Contains(txt,t2,-1);
If(offset,txt=SubStr(txt,1,offset-1));

txt=Substitute(txt," ","","message">","","Discussions/","Discussions/>");
txt=Substitute(txt,"<(.[^>]{0,})>","","\!n","","\!r","");

 

Craige_Hales
Super User

Re: How can use the regex substitution <*>?

/*
THIS IS NOT AN HTML PARSER

HTML parsers are hard to write. There are better ways. 
Neither of these JSL ideas can properly skip comments or
many other HTML tags that should be skipped. They will
find things they shouldn't and miss other things they
should find. They've only been minimally tested on one site.

This code is presented as "how can I use contains() vs patmatch()
to efficiently work through a large string looking for something?"

The second example is FIVE TIMES FASTER and a LITTLE MORE ACCURATE.

The third example uses text explorer to grab links. I don't think
there is a pre-existing regex for the link text descriptions. It works
very similar to this code; it is NOT an HTML parser either.

If you can use python, you might want to investigate "beautiful soup".
I've not used it, but believe it addresses the "hard to write" issue.


*/


u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );

// a link on a page has at least two parts: 
// the URL and some descriptive text.

// <a ... href="url" ... > descripton </a>
// p1                   p4           p5

// to parse ALL the links on a page, you'll want 
// some sort of loop. There are many ways to
// write that loop; here are two choices

// using contains and regex. "<a " will be our search token
// and contains() will be our workhorse. Use regex where appropriate. 
// this will find links that it should not find because it does not skip script sections!

dt = New Table( "regex", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
pos = 0;
while( (p1 = Contains( txt, "<a ", pos )) != 0, // as long as we can find the start of a tag
	p4 = Contains( txt, ">", p1 + 3 ); // find the end of the opening tag
	p5 = Contains( txt, "</a>", p4 + 1 ); // find the ending tag
	if( p5 > p4, // as long as the end is not zero, we found one, see break() below
		desc = Substr( txt, p4 + 1, p5 - p4 - 1 ); // the visible link description. Images can be here too.
		desc = Regex( desc, "<[^>]*>", "", globalreplace ); // remove span, image, etc tags
		linktext = Substr( txt, p1, p4 - p1 + 1 ); // <a href = "/">
		hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );// use regex to get the href value
		if( !Is Missing( hreftext ), // sometimes there isn't one
			dt << addrows( 1 );//
			irow = N Rows( dt );//
			dt:link[irow] = hreftext;//
			dt:description[irow] = desc; // 
		);//
	, // else
		Break() // no more end tags. there is maybe a mess of javascript.
	);//
	pos = Max( pos, p5 + 3 ); // advance past the one just found
	Wait( 0 );
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 127 links




// using pattern matching and regex
// this is about 5X faster and skips over the garbage in the <script> sections

// two patterns, one for <a> links and one for <script> to skip over
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";

dt = New Table( "patmatch", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );

Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();

rc = patmatch( txt,
	patrepeat( // repeat the pattern until no more
		(linkpat + pattest(// run some JSL for the linkpat...
			desc = Regex( desctext, "<[^>]*>", "", globalreplace ); // remove span, etc tags
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );//
			if( !Is Missing( hreftext ), //
				dt << addrows( 1 );//
				irow = N Rows( dt );//
				dt:link[irow] = hreftext;//
				dt:description[irow] = desc; // 
			);//
			1; // pattest needs this to succeed
		)) | scriptpat /* maybe skip a script */ | /* maybe skip to the next tag */ (Pat Len( 1 ) + Pat Break( "<" ))
	),
	NULL,IGNORECASE
);

stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime ); // .5 sec, 128 links

Show( rc ); // should be 1, not 0




///////////////////////////////////////
// text explorer can also do this:

dt=New Table( "textexplorer",
	Add Rows( 1 ),
	New Column( "text", Character, "Nominal", Set Values( evallist({txt}) ) )
);

dt<<Text Explorer(
	Text Columns( :text ),
	Set Regex( Library( "HTML Link Grabber" ) ),
	Language( "English" ),
	SendToReport(
		Dispatch(
			{"Term and Phrase Lists"},
			"",
			TableBox,
			{Sort By Column( 2, 1 )}
		)
	)
);
Craige
Craige_Hales
Super User

Re: How can use the regex substitution <*>?

ignore the .5 sec, # links comments in the JSL. Run your own test.

Comments quickly go bad, especially when copy/paste!

Craige