@lala asked in How can use the regex substitution <*>? about grabbing URLs from HTML. Here's a little more detail on three proposed answers. First, the disclaimer: none of these examples do a proper job of parsing HTML, so all of them can find URLs that are not really there and miss other URLs that are really there.
The problems in parsing HTML are mostly about finding matching start and end tags and looking at the data between them. All of the JSL presented here can be tripped up by HTML features. But, it might be good enough; do your own testing. Also see Beautiful Soup which might be a better answer.
Here's a snip from an HTML web page showing a link (the <a> tag) with a URL (the href= part) and a bit of text (the Discussions part) that will be visible on the page.
< a > tag with URL and Text
The goal is to make a data table of the URLs and link text, one per row. A typical web page may have 100 or so links and be a half megabyte of text. One slow way to attack the problem (not shown) is to write a simple regex that finds one link at a time, remove the link from the text, and repeat until no more can be found. If there is only one or two, great. But if there could be 100's, then a few hundred megabytes of data is manipulated and searched.
A faster way is to search through the data, always moving forward, building the data table as links are found. The first example uses the contains() function which has a third argument to specify a starting position. Contains() also returns a found position. By updating the start position to just beyond the found position, contains() can efficiently look through the text without needing to back up or modify the text.
u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );
dt = New Table( "regex", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
pos = 0;
while( (p1 = Contains( txt, "<a ", pos )) != 0,
p4 = Contains( txt, ">", p1 + 3 );
p5 = Contains( txt, "</a>", p4 + 1 );
if( p5 > p4,
desc = Substr( txt, p4 + 1, p5 - p4 - 1 );
desc = Regex( desc, "<[^>]*>", "", globalreplace );
linktext = Substr( txt, p1, p4 - p1 + 1 );
hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
if( !Is Missing( hreftext ),
dt << addrows( 1 );
irow = N Rows( dt );
dt:link[irow] = hreftext;
dt:description[irow] = desc;
);
,
Break()
);
pos = Max( pos, p5 + 3 );
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime );
The heart of that code is the line
while( (p1 = Contains( txt, "<a ", pos )) != 0,
which is doing a bunch of jobs: it uses pos in the 3rd argument to skip over any previous work, it finds the next <a that begins a link tag, it stores that position in p1 (see comment at top of JSL), and it compares the position to zero. Zero means nothing found, so the loop stops. The next line
p4 = Contains( txt, ">", p1 + 3 );
searches for the matching > that ends the <a tag. p4 (see comment) points there. By starting at p1+3, contains() finds one after p1, not re-finding an earlier occurrence. The next line is similar, looking for the closing tag after p4
p5 = Contains( txt, "</a>", p4 + 1 );
p5 will point to the </a> tag that ends the link text. Many, but not all, HTML tags work like that, a start tag without the / and an end tag with the slash. Sometimes the start and end tag are the same, using <tagname ... />, but this JSL is ignoring that possibility.
The if statement determines if something was found, and if so, picks up the description like this
desc = Substr( txt, p4 + 1, p5 - p4 - 1 );
p4 and p5 are pointing the > and <. so p4+1 is the first description character, and p5-p4-1 is the length of the description. The next line cleans up the description by removing embedded tags
desc = Regex( desc, "<[^>]*>", "", globalreplace );
GlobalReplace means the regex will find the pattern, replace that text with nothing, as many times as possible. The pattern matches <, followed by zero or more characters that are not >, followed by > which might be an <img ... > tag or a bunch of text styling tags. The end result is the text of the tag without any picture/color/font/etc.
The next two lines are similar, for grabbing the URL rather than the description
linktext = Substr( txt, p1, p4 - p1 + 1 );
hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
p1 and p4 (see comment) are the entire link tag positions; linktext is the link tag including a number of unwanted bits. the hreftext uses regex() to extract just the URL between the href quotation marks, or apostrophes. The regex pattern uses the backref \1 to make sure the trailing apostrophe or quotation mark matches the leading one. The first open paren in the regex pattern makes a capture group (for \1) that matches " or '. The second open paren is capture group (for \2) that matches, reluctantly moving forward as far as needed to match between the delimiters. The third open paren is capture group 3; the parens and grouping are not really needed. The result, \2, is just the URL.
All but done: if there is a description, add a row to the table, and then, very important, advance the pos variable!
pos = Max( pos, p5 + 3 );
Max makes sure it doesn't go backwards at the end and get stuck in a loop.
A Faster Way
The code above is pretty fast, but this is even better.
u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";
dt = New Table( "patmatch", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
rc = patmatch( txt,
patrepeat(
(linkpat + pattest(
desc = Regex( desctext, "<[^>]*>", "", globalreplace );
hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
if( !Is Missing( hreftext ),
dt << addrows( 1 );
irow = N Rows( dt );
dt:link[irow] = hreftext;
dt:description[irow] = desc;
);
1;
)) | scriptpat | (Pat Len( 1 ) + Pat Break( "<" ))
),
NULL,IGNORECASE
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime );
Show( rc );
This time the loop is inside the pattern matcher (using patrepeat() to push the pattern through the string.) The two patterns
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";
are doing two different jobs: linkpat matches some text and stores parts of the match in linktext and desctext. Scriptpat matches some text between <script ... and ... </script> and ignores it. The <script> tag in HTML defines a JavaScript code section that should be ignored. In this example case the JavaScript contains a lot of <a sequences that make false positives, and skipping the JavaScript eliminates them. More on that in a moment. The looping is now done by the patmatch() that begins
rc = patmatch( txt,
At the end the show(rc) verifies the patmatch succeeded. But first, the second argument to patmatch is the pattern
patrepeat(
(linkpat + pattest(
desc = Regex( desctext, "<[^>]*>", "", globalreplace );
hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
if( !Is Missing( hreftext ),
dt << addrows( 1 );
irow = N Rows( dt );
dt:link[irow] = hreftext;
dt:description[irow] = desc;
);
1;
)) | scriptpat | (Pat Len( 1 ) + Pat Break( "<" ))
),
which can be simplified into three parts
patrepeat(
linkpat
|
scriptpat
|
(Pat Len( 1 ) + Pat Break( "<" ))
)
which means each step along the way one of those three patterns will match some text. First see if the linkpat matches; that will do some of the work that was simplified away. If not, see if the scriptpat matches; that will skip some JavaScript. Otherwise, skip over one character (usually a <), plus enough characters to reach the next <. Then loop for more.
The part that was simplified away is the JSL that runs when the linkpat succeeds (remember linkpat saves some text in linktext and desctext as it is matching them)
linkpat + pattest(
desc = Regex( desctext, "<[^>]*>", "", globalreplace );
hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
if( !Is Missing( hreftext ),
dt << addrows( 1 );
irow = N Rows( dt );
dt:link[irow] = hreftext;
dt:description[irow] = desc;
);
1;
)
That should look very much like the JSL in the first example that writes rows to the data table. pattest() is a special pattern matching function that returns 1 if the match should succeed and 0 if it should fail. Here pattest is used to run some JSL and always returns 1 so the matcher can move forward. The JSL it runs isn't really testing if the matcher needs to back up and retry; the JSL is just updating the table.
I believe this is faster than the previous example because the previous example calls contains() with a starting position, and it still has some set up time to do that. Here, there is a single call to patmatch, with only a single setup overhead.
Both examples use regex for smaller jobs within the bigger picture: extracting the link text and URL from a short string.
In the answer Text Explorer is also a possible solution. It has a library of pre-written regex that include a link grabber, but not a descriptive text grabber. It might be easier for some cases, but probably won't be faster.