Grab URL from HTML

Craige_Hales · Mar 11, 2021 06:40 PM

@lala asked in How can use the regex substitution <*>? about grabbing URLs from HTML. Here's a little more detail on three proposed answers. First, the disclaimer: none of these examples do a proper job of parsing HTML, so all of them can find URLs that are not really there and miss other URLs that are really there.

/*
THIS IS NOT AN HTML PARSER

HTML parsers are hard to write. There are better ways. 
Neither of these JSL ideas can properly skip comments or
many other HTML tags that should be skipped. They will
find things they shouldn't and miss other things they
should find. They've only been minimally tested on one site.

This code is presented as "how can I use contains() vs patmatch()
to efficiently work through a large string looking for something?"

The second example is FIVE TIMES FASTER and a LITTLE MORE ACCURATE.

The third example uses text explorer to grab links. I don't think
there is a pre-existing regex for the link text descriptions. It works
very similar to this code; it is NOT an HTML parser either.

If you can use python, you might want to investigate "beautiful soup".
I've not used it, but believe it addresses the "hard to write" issue.
*/

The problems in parsing HTML are mostly about finding matching start and end tags and looking at the data between them. All of the JSL presented here can be tripped up by HTML features. But, it might be good enough; do your own testing. Also see Beautiful Soup which might be a better answer.

Here's a snip from an HTML web page showing a link (the <a> tag) with a URL (the href= part) and a bit of text (the Discussions part) that will be visible on the page.

< a > tag with URL and Text < a > tag with URL and Text

The goal is to make a data table of the URLs and link text, one per row. A typical web page may have 100 or so links and be a half megabyte of text. One slow way to attack the problem (not shown) is to write a simple regex that finds one link at a time, remove the link from the text, and repeat until no more can be found. If there is only one or two, great. But if there could be 100's, then a few hundred megabytes of data is manipulated and searched.

A faster way is to search through the data, always moving forward, building the data table as links are found. The first example uses the contains() function which has a third argument to specify a starting position. Contains() also returns a found position. By updating the start position to just beyond the found position, contains() can efficiently look through the text without needing to back up or modify the text.

u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );

// a link on a page has at least two parts: 
// the URL and some descriptive text.

// <a ... href="url" ... > descripton </a>
// p1                   p4           p5

// using contains and regex. "<a " will be the search token
// and contains() will do the work. Use regex where appropriate. 
// this will find links that it should not find because it does not skip script sections!

dt = New Table( "regex", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );
Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();
pos = 0;
while( (p1 = Contains( txt, "<a ", pos )) != 0, // as long as we can find the start of a tag
	p4 = Contains( txt, ">", p1 + 3 ); // find the end of the opening tag
	p5 = Contains( txt, "</a>", p4 + 1 ); // find the ending tag
	if( p5 > p4, // as long as the end is not zero, we found one, see break() below
		desc = Substr( txt, p4 + 1, p5 - p4 - 1 ); // the visible link description. Images can be here too.
		desc = Regex( desc, "<[^>]*>", "", globalreplace ); // remove span, image, etc tags
		linktext = Substr( txt, p1, p4 - p1 + 1 ); // <a href = "/">
		hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );// use regex to get the href value
		if( !Is Missing( hreftext ), // sometimes there isn't one
			dt << addrows( 1 );
			irow = N Rows( dt );
			dt:link[irow] = hreftext;
			dt:description[irow] = desc;
		);
	, // else
		Break() // no more end tags. there is maybe a mess of javascript.
	);
	pos = Max( pos, p5 + 3 ); // advance past the one just found
);
stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime );

The heart of that code is the line

while( (p1 = Contains( txt, "<a ", pos )) != 0,

which is doing a bunch of jobs: it uses pos in the 3rd argument to skip over any previous work, it finds the next <a that begins a link tag, it stores that position in p1 (see comment at top of JSL), and it compares the position to zero. Zero means nothing found, so the loop stops. The next line

p4 = Contains( txt, ">", p1 + 3 );

searches for the matching > that ends the <a tag. p4 (see comment) points there. By starting at p1+3, contains() finds one after p1, not re-finding an earlier occurrence. The next line is similar, looking for the closing tag after p4

p5 = Contains( txt, "</a>", p4 + 1 );

p5 will point to the </a> tag that ends the link text. Many, but not all, HTML tags work like that, a start tag without the / and an end tag with the slash. Sometimes the start and end tag are the same, using <tagname ... />, but this JSL is ignoring that possibility.

The if statement determines if something was found, and if so, picks up the description like this

desc = Substr( txt, p4 + 1, p5 - p4 - 1 );

p4 and p5 are pointing the > and <. so p4+1 is the first description character, and p5-p4-1 is the length of the description. The next line cleans up the description by removing embedded tags

desc = Regex( desc, "<[^>]*>", "", globalreplace );

GlobalReplace means the regex will find the pattern, replace that text with nothing, as many times as possible. The pattern matches <, followed by zero or more characters that are not >, followed by > which might be an <img ... > tag or a bunch of text styling tags. The end result is the text of the tag without any picture/color/font/etc.

The next two lines are similar, for grabbing the URL rather than the description

linktext = Substr( txt, p1, p4 - p1 + 1 ); // <a href = "/">
hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );// use regex to get the href value

p1 and p4 (see comment) are the entire link tag positions; linktext is the link tag including a number of unwanted bits. the hreftext uses regex() to extract just the URL between the href quotation marks, or apostrophes. The regex pattern uses the backref \1 to make sure the trailing apostrophe or quotation mark matches the leading one. The first open paren in the regex pattern makes a capture group (for \1) that matches " or '. The second open paren is capture group (for \2) that matches, reluctantly moving forward as far as needed to match between the delimiters. The third open paren is capture group 3; the parens and grouping are not really needed. The result, \2, is just the URL.

All but done: if there is a description, add a row to the table, and then, very important, advance the pos variable!

pos = Max( pos, p5 + 3 );

Max makes sure it doesn't go backwards at the end and get stuck in a loop.

A Faster Way

The code above is pretty fast, but this is even better.

u = "https://community.jmp.com/t5/Discussions/bd-p/discussions";
txt = Load Text File( u );
// using pattern matching and regex
// this is about 5X faster and skips over the garbage in the <script> sections

// two patterns, one for <a> links and one for <script> to skip over
linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";

dt = New Table( "patmatch", New Column( "description", Character, "Nominal" ), New Column( "link", Character, "Nominal" ) );

Wait( .1 );
dt << beginDataUpdate;
startTime = Tick Seconds();

rc = patmatch( txt,
	patrepeat( // repeat the pattern until no more
		(linkpat + pattest(// run some JSL for the linkpat...
			desc = Regex( desctext, "<[^>]*>", "", globalreplace ); // remove span, etc tags
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
			if( !Is Missing( hreftext ),
				dt << addrows( 1 );
				irow = N Rows( dt );
				dt:link[irow] = hreftext;
				dt:description[irow] = desc;
			);
			1; // pattest needs this to succeed
		)) | scriptpat /* maybe skip a script */ | /* maybe skip to the next tag */ (Pat Len( 1 ) + Pat Break( "<" ))
	),
	NULL,IGNORECASE
);

stoptime = Tick Seconds();
dt << endDataUpdate;
Show( stoptime - starttime );

Show( rc ); // should be 1, not 0

This time the loop is inside the pattern matcher (using patrepeat() to push the pattern through the string.) The two patterns

linkpat = "<a " + Pat Break( ">" ) >> linktext + ">" + Pat Arb() >> desctext + "</a>";
scriptpat = "<script" + Pat Break( ">" ) + ">" + Pat Arb() + "</script>";

are doing two different jobs: linkpat matches some text and stores parts of the match in linktext and desctext. Scriptpat matches some text between <script ... and ... </script> and ignores it. The <script> tag in HTML defines a JavaScript code section that should be ignored. In this example case the JavaScript contains a lot of <a sequences that make false positives, and skipping the JavaScript eliminates them. More on that in a moment. The looping is now done by the patmatch() that begins

rc = patmatch( txt,

At the end the show(rc) verifies the patmatch succeeded. But first, the second argument to patmatch is the pattern

	patrepeat( // repeat the pattern until no more
		(linkpat + pattest(// run some JSL for the linkpat...
			desc = Regex( desctext, "<[^>]*>", "", globalreplace ); // remove span, etc tags
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );//
			if( !Is Missing( hreftext ), //
				dt << addrows( 1 );//
				irow = N Rows( dt );//
				dt:link[irow] = hreftext;//
				dt:description[irow] = desc; // 
			);//
			1; // pattest needs this to succeed
		)) | scriptpat /* maybe skip a script */ | /* maybe skip to the next tag */ (Pat Len( 1 ) + Pat Break( "<" ))
	),

which can be simplified into three parts

patrepeat(
   linkpat // part 1
|
   scriptpat // part 2
|
   (Pat Len( 1 ) + Pat Break( "<" )) // part 3
)

which means each step along the way one of those three patterns will match some text. First see if the linkpat matches; that will do some of the work that was simplified away. If not, see if the scriptpat matches; that will skip some JavaScript. Otherwise, skip over one character (usually a <), plus enough characters to reach the next <. Then loop for more.

The part that was simplified away is the JSL that runs when the linkpat succeeds (remember linkpat saves some text in linktext and desctext as it is matching them)

linkpat + pattest(
			desc = Regex( desctext, "<[^>]*>", "", globalreplace );
			hreftext = Regex( linktext, "href\s*=\s*(\!"|')(.*?)(\1)", "\2" );
			if( !Is Missing( hreftext ),
				dt << addrows( 1 );
				irow = N Rows( dt );
				dt:link[irow] = hreftext;
				dt:description[irow] = desc;
			);
			1; // pattest needs this to succeed
		)

That should look very much like the JSL in the first example that writes rows to the data table. pattest() is a special pattern matching function that returns 1 if the match should succeed and 0 if it should fail. Here pattest is used to run some JSL and always returns 1 so the matcher can move forward. The JSL it runs isn't really testing if the matcher needs to back up and retry; the JSL is just updating the table.

I believe this is faster than the previous example because the previous example calls contains() with a starting position, and it still has some set up time to do that. Here, there is a single call to patmatch, with only a single setup overhead.

Both examples use regex for smaller jobs within the bigger picture: extracting the link text and URL from a short string.

In the answer Text Explorer is also a possible solution. It has a library of pre-written regex that include a link grabber, but not a descriptive text grabber. It might be easier for some cases, but probably won't be faster.

ron_horne · ‎12-01-2022

Thanks @Craige_Hales

I have found this https://uibakery.io/regex-library/url online,

how can i use it in a column formula assuming i have another column with texts that include urls?

Craige_Hales · ‎12-01-2022

I suspect that regex is good, but not perfect. Probably good enough. I think there may be Unicode URLs and, without doing some research, I'm not positive the character sets are perfect even for non Unicode cases. And I'm not sure the top level domain names (.com, .net, etc) are still limited to 6 characters. wikipedia suggests .academy, etc . I think many people have attempted to make this regex with varying degrees of success. stackoverflow question . As presented, it includes escaped forward slashes, which I simplified, and ^ and $, which I removed because my test case below needs to find a URL in the midst of a longer text.

// modified from  https://uibakery.io/regex-library/url 
rpat = "https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)";
// what it matches: 
// http with an optional s
// ://
// an optional www
// between 1 and 256 characters from the set shown
// a period, followed by a 1 to 6 character top level domain name
// a word boundary
// a run of zero or more characters from the set
regex("some arbitrary text https://community.jmp.com/t5/Uncharted/Grab-URL-from-HTML/bc-p/576161#M359 and more text",rpat);
// -> "https://community.jmp.com/t5/Uncharted/Grab-URL-from-HTML/bc-p/576161#M359"

So I might make a column formula to find the first URL in another column by writing something like

regex(another,"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)")

If you expect multiple URLs in the text column, you need to decide what to do with the extra ones. One choice could be to use a formula column of type expression (instead of a character) with a { list } for the expression to hold zero or more URLs as the list items. It is possible to use patmatch with a regex in the formula to populate the list for each row, or you could brute force with a for loop that chops away URLs from a source column string one at a time. Brute force is good enough (and easier to maintain) if the strings are short and not too many URLs.

ron_horne · ‎12-02-2022

thank you very much @Craige_Hales

with one little amendment it did the job.

you gave me this:

"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)"

which i changed to:

"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&=]*)"

by removing the lass / between the & and the =

could you please help me solve the following?

1) text with just www (and no http) are not retrieved

2) if at the end of the domain their is a ? (and not / ) it brings the whole rest of the url

the following script give an example.

thank you!

New Table( "regex2",
	Add Rows( 7 ),
	New Column( "another",
		Character,
		"Nominal",
		Set Values(
			{"https://www.youtube.com/watch?v=fYUDuLN3a_k",
			"{\!"https://www.youtube.com/watch?v=fYUDuLN3a_k\!"}",
			"https://www.youtube.com?watch?v=fYUDuLN3a_k",
			"www.bbc.co.uk/news/world-europe-63832151",
			"{\!"https://www.bbc.co.uk/news/world-europe-63832151\!"}",
			"bbc.co.uk/news/world-europe-63832151",
			"https://www.bbc.co.uk/news/world-europe-63832151"}
		),
		Set Display Width( 440 )
	),
	New Column( "regex",
		Character,
		"Nominal",
		Formula(
			Regex(
				:another,
				"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&=]*)"
			)
		),
		Set Selected,
		Set Display Width( 424 )
	)
)

Craige_Hales · ‎12-02-2022

In community.jmp.com, community is called a sub-domain. WWW is another example of a very common sub domain.

.com is the top level domain (TLD).

JMP is the site, or server.

the http: vs https: vs ftp: vs file: vs etc is the protocol.

I'm not positive, but I think the / after the TLD is required. Maybe not.

The text after the TLD is pretty much free form, using a limited set of characters. Most servers use that text like a path in a file system because that's how most servers work, or at least that was true in the beginning.

/t5/Uncharted/Grab-URL-from-HTML/bc-p/576503

The JMP server doesn't really have a directory structure like that. It is a virtual directory path that the server will use with a database to generate this page. Interestingly, there is a similar URL you can get from the red triangle permalink:

/t5/Uncharted/Grab-URL-from-HTML/bc-p/576503/highlight/true#M361

The # tells the browser to scroll to a tagged part of the HTML when the URL opens.

Then there is the ? and & parts. Most servers interpret the URL data after the ? as a set of parameters separated by &, but that is largely a convention. You probably won't run into anything else. Depending on your use case, it may make sense to throw away the parameters. Here's a search URL:

https://community.jmp.com/t5/forums/searchpage/tab/message?advanced=false&allow_punctuation=false&filter=location&location=blog-board:chales-blog&q=argyle

Here's some ideas.

New Table( "regex2",
	Add Rows( 7 ),
	New Column( "another",
		Character,
		"Nominal",
		Set Values(
			{"https://www.youtube.com/watch?v=fYUDuLN3a_k",
			"{\!"https://www.youtube.com/watch?v=fYUDuLN3a_k\!"}",
			"https://www.youtube.com?watch?v=fYUDuLN3a_k",
			"www.bbc.co.uk/news/world-europe-63832151",
			"{\!"https://www.bbc.co.uk/news/world-europe-63832151\!"}",
			"bbc.co.uk/news/world-europe-63832151",
			"https://www.bbc.co.uk/news/world-europe-63832151"}
		),
		Set Display Width( 440 )
	),
	New Column( "regexA",
		Character,
		"Nominal",
		Formula(
			Regex(
				:another, // original, with the http or https made optional
				"(?:https?://)?(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?/&=]*)"
			)
		),
		Set Selected,
		Set Display Width( 424 )
	),
	New Column( "regexB",
		Character,
		"Nominal",
		Formula(
			Regex(
				:another,// ignore anything after the .com or .net or etc
				"(?:https?://)?(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b"
			)
		),
		Set Selected,
		Set Display Width( 424 )
	),
	New Column( "regexC",
		Character,
		"Nominal",
		Formula(
			Regex(
				:another,// the added plain parens identify the \1 backreference
				"(?:https?://)?((?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b)(?:[-a-zA-Z0-9()@:%_\+.~#?/&=]*)","\1"
			)
		),
		Set Selected,
		Set Display Width( 424 )
	),
	New Column( "regexD",
		Character,
		"Nominal",
		Formula(
			Regex(
				:another, // drop the parameters after the ?...by removing the ? from the pattern...
				"(?:https?://)?(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#/&=]*)"
			)
		),
		Set Selected,
		Set Display Width( 424 )
	)
)

ron_horne · ‎12-02-2022

Thank you @Craige_Hales this is an outstanding contribution.

I have learned a lot from your examples and can get all I need.