Discussions

peri_a · Oct 17, 2018 09:59 AM

i need to parse a HTML page to extract some info but i am getting an unexpected results, see the script below.

HTMLPageAsText = "
<html>
	<head>
		<title>String 1 i want to get</title>
	</head>
	<body>String 2 i want to get
		<h2>Possibly also this</h2>
		<h2>why i get only this</h2>
	</body>
</html>
";

PageTitle="";
PageBody="";
	
	Parse XML( HTMLPageAsText,
		On Element( "title", 
			End Tag( PageTitle=XML text();show("Found title") ) 
		),
							
		On Element( "body", 
			End Tag( PageBody=XML text();show("Found body") ) 
		),
	);
	show(PageTitle,PageBody);

what i Ideally need in the variable PageBody is:

PageBody="String 2 i want to get
<h2>Possibly also this</h2>
<h2>why i get only this</h2>"

or if it is only possible to get the content of the tag excluding the subtags i expect to get

PageBody="String 2 i want to get"

while Instead what i get is:

PageBody="why i get only this"

What am I doing wrong?

Craige_Hales · Oct 17, 2018 11:20 AM

As far as I can tell you are only missing something that may never have been documented: text(). similar to EndTag().

HTMLPageAsText = "
<html>
	<head>
		<title>one of four</title>
	</head>
	<body>two of four
		<h2>three of four</h2>
		<h2>four of four</h2>
	</body>
</html>
";

title = "";
body = "";
	
Parse XML( HTMLPageAsText,
	On Element( "title", 
		End Tag( title = XML Text(); ) 
	), 				
	On Element( "body", 
		Text( body = body  || XML Text(); ), 
	)
);

show(title,body);

title = "one of four";
body = "two of four
three of fourfour of four";

Text() runs each time a new snippet of text is processed.
HTML and XML are not usually the same; your example works because it is also valid XML. A number of HTML commands, like , don't have a matching and break if used in an XML reader.
If the XML is a bit more complicated, you might need to track the nesting levels too.

(I reworked your example while I was puzzling over how to do it. I'll see if I can get this documented, Thanks!)

Craige

View solution in original post

Craige_Hales · Oct 17, 2018 11:20 AM

As far as I can tell you are only missing something that may never have been documented: text(). similar to EndTag().

HTMLPageAsText = "
<html>
	<head>
		<title>one of four</title>
	</head>
	<body>two of four
		<h2>three of four</h2>
		<h2>four of four</h2>
	</body>
</html>
";

title = "";
body = "";
	
Parse XML( HTMLPageAsText,
	On Element( "title", 
		End Tag( title = XML Text(); ) 
	), 				
	On Element( "body", 
		Text( body = body  || XML Text(); ), 
	)
);

show(title,body);

title = "one of four";
body = "two of four
three of fourfour of four";

Text() runs each time a new snippet of text is processed.
HTML and XML are not usually the same; your example works because it is also valid XML. A number of HTML commands, like , don't have a matching and break if used in an XML reader.
If the XML is a bit more complicated, you might need to track the nesting levels too.

(I reworked your example while I was puzzling over how to do it. I'll see if I can get this documented, Thanks!)

Craige

peri_a · Oct 17, 2018 11:42 AM

Thanks!

This will do for now.

However for future developments an additional command for the On Element() called like XML Body() that will return the whole content of a tag (including the nested TAG as text) could be really useful.

On the comment about HTML vs XML i get the point. i will try to handle it with substituting the with if the webpage becomes more complicated so it will be XML compliant. i could even envision a loop tracking the non closed tags and substitute them with the XML correct version. however it would be great if the XML parser would raise a warning but continue the execution in case of errors similarly to what web browser do for faulty HTMLs.

Finally Regarding documentation: i agree that for Parsing XML the docomentation is somewhat minimal, so while you are updating it please also include the second argument for the XML Attr(). at the moment documentation states only 0 and 1 attribute possible however i found (and used) a piece of code that uses 2 arguments and the second one would be the string returned if the attribute is not found.

i think this is a useful feature that is not documented at the meoment

Thanks.

Discussions

Problem Parsing XML

Re: Problem Parsing XML

Re: Problem Parsing XML

Re: Problem Parsing XML

Recommended Articles