Solved: Re: Extracting a section of a webpage

ron_horne · Feb 21, 2020 3:43 PM

Dear Members of the community,

I am trying to extract a section of text from a long and messy string. I would like to extract the description part of a YouTube video.

For example, in this video: https://youtu.be/yvoddqG-lm8 i would like to extract just the description part:

“Mia Stephens shows how to perform basic statistical analyses in JMP. She covers using Distribution to analyze data one variable at a time. Using Fit Y by X for analyses involving two variables, and using Fit Model for analyses involving more than two variables. She also reviews tools for summarizing and graphing data. This video is part three is a series on learning the basics of using JMP to make the most of your JMP 30-day free trial or your new JMP license. JMP Academic Ambassador Mia Stephens demonstrates how to navigate the JMP menus and data tables, import data into JMP, summarize and graph data and perform basic statistical analyses. This demo uses JMP 11, which will be available in September. See what's coming in JMP 11: http://www.jmp.com/software/preview-j...”

I manage to get the whole web page script as a string using the following command:

page = open (https://youtu.be/yvoddqG-lm8);

I have noticed that the description part I am looking appears a few times in the page. In particular after these terms:

\\!"description\\!":{\\!"runs\\!":[{\\!"text\\!":\\!"

Or: \\!"description\\!":{\\!"simpleText\\!":\\!"

Any suggestions ?

Am I in the right direction extracting the whole page code or is there a way to extract directly just the section I am looking for?

Thank you.

Craige_Hales · Feb 22, 2020 08:46 PM

You are trying to grab the text from a javascript json snippet. I think it will be easier to grab it from the HTML further down. I'm using . in place of quotation marks just to avoid \ escaping. This <div>...</div> works because there are no nested <div> within this one that would make it end early. Regex doesn't handle the nested structures by itself (but pattern matching can).

Note the div's id= is identifying the text you want, so it might be fairly robust, for a while.

page = open ("https://youtu.be/yvoddqG-lm8");

division = Regex ( page, "<div id=.watch-description-text. class=..>(.*?)</div","\1"  );
withouttags = regex(division,"<[^>]*>","",globalreplace);

"Mia Stephens shows how to perform basic statistical analyses in JMP. She covers using Distribution to analyze data one variable at a time. Using Fit Y by X for analyses involving two variables, and using Fit Model for analyses involving more than two variables. She also reviews tools for summarizing and graphing data.This video is part three is a series on learning the basics of using JMP to make the most of your JMP 30-day free trial or your new JMP license. JMP Academic Ambassador Mia Stephens demonstrates how to navigate the JMP menus and data tables, import data into JMP, summarize and graph data and perform basic statistical analyses. This demo uses JMP 11, which will be available in September. See what's coming in JMP 11: http://www.jmp.com/software/preview-j..."

Craige

View solution in original post

txnelson · Feb 22, 2020 12:16 AM

Ron,

I have written a couple of parsers on the order of what you appear to need. I typically don't use an Open() function, but I use Loat Text File() to read in the file into a long string. I then use Contains() to get to the starting point, and to Substr() the string of all characters after the offset returned from the Contains(). Since you are reading in an HTML file, you can use the structure of it's markup language to configure your start points and delimiters.

Best of luck

Jim

ron_horne · Feb 22, 2020 07:48 PM

Thank you, @txnelson for the good directions. Using the most basic character functions I have managed to do this:

Names Default To Here( 1 );
page = open ("https://youtu.be/yvoddqG-lm8");

startingpoint = Contains( page, "description\\!":{\\!"simpleText\\!":\\!"" );
rightremain = Substr( page, startingpoint + 32, Length( page ) );
endpoint = Contains( rightremain, "\\!"},\\!"lengthSeconds" );
final = Substr( rightremain, 0, endpoint - 1 );

Show( final );

At this point, when I tried to duplicate this to other YouTube pages, I found out that the prefix and suffix for the full description are somewhat inconsistent so I would have to use a variety of them to find the correct startingpoint and endpoint for each.

Mark_Bailey · Feb 22, 2020 3:41 AM

I like @txnelson suggestion. You could also use a regular expression on the character string. You could steal, er, I mean study the regex for HTML tags in the Text Explorer. Launch it on any column with character data. (You're not really analyzing text here.) Click the red triangle at the top of the platform and select Parsing Options > Customize Regex. There are several expressions related to HTML. Click the Add button (+) below the list of expressions to go to the library. Add the ones that have promising names. The editor will let you examine and modify the expressions. You will get immediate feedback with the sample text at the top or the examples in the definition.

This one might be as straight-forward as:

Names Default to Here( 1 );

page = Load Text File( "https://youtu.be/yvoddqG-lm8" );

description = Regex( page, "<meta name=\!"description\!" content=\!"(.+?)\!">", "\1" );

Like I said, I like the previous suggestion. But it is good to have options.

Craige_Hales · Feb 24, 2020 4:55 AM

from a while back: Twitter Screen Scraping and FFT-Based Cross Fade

it is using a really simple regex to locate URLs for images:

	x = Load Text File( "https://twitter.com/search?q=football&src=typd" );//cat%20OR%20dog%20OR%20turtle%20OR%20rabbit
	Pat Match( x, // look for links to jpgs in the html
		Pat Regex( "https://[^\!"']+?\.jpg" ) >> link + Pat Test(
			// reject some links to small images and duplicates
			If( !Contains( link, "bigger" ) & !Contains( link, "normal" ) & duplicate << Contains( link ) == 0,
				duplicate[link] = 1; // remember this keeper
				dt << addrows( 1 ); // make a row for it
				dt:url = link;
				dt << runformulas;// make sure it loads
			);
			1;
		) + Pat Fail()// this forces the match to keep trying, recording all links
	);

It loops, without using PatRepeat, with PatFail(). PatFail never matches and causes the matcher to retry at the next position. Each time it finds that https://<text without ! " or ' >.jpg pattern it runs some JSL to process the URL.

Just tried it, apparently football is still a viable search term. You'll get around 20 pictures on the first query, and a new picture or so every few minutes after that.

Mark's on the right path, using the HTML tags to identify the content you want.

Craige

ron_horne · Feb 22, 2020 07:51 PM

Thank you @Mark_Bailey for your solution. It is robust in the sense that it retrieves a clean string from any YouTube video I tried it on. Yet, it brings a shorten version of the description (as it turns out, the code for each video includes the full description twice and the short description 3 times).

Attempting to amend the regex command I am not sure why they do not work as well as yours. Do you notice what is wrong with my script?

Names Default To Here( 1 );
page = open ("https://youtu.be/yvoddqG-lm8");

// this works well for the short version of the description.
final = Regex ( page, "<meta name=\!"description\!" content=\!"(.+?)\!">", "\1" );

// not working for the long version of the description.
final = Regex ( page, "description\\!":{\!"simpleText\!":\!"(.+?)\\!"},", "\1" );

Craige_Hales · Feb 22, 2020 08:46 PM

You are trying to grab the text from a javascript json snippet. I think it will be easier to grab it from the HTML further down. I'm using . in place of quotation marks just to avoid \ escaping. This <div>...</div> works because there are no nested <div> within this one that would make it end early. Regex doesn't handle the nested structures by itself (but pattern matching can).

Note the div's id= is identifying the text you want, so it might be fairly robust, for a while.

page = open ("https://youtu.be/yvoddqG-lm8");

division = Regex ( page, "<div id=.watch-description-text. class=..>(.*?)</div","\1"  );
withouttags = regex(division,"<[^>]*>","",globalreplace);

"Mia Stephens shows how to perform basic statistical analyses in JMP. She covers using Distribution to analyze data one variable at a time. Using Fit Y by X for analyses involving two variables, and using Fit Model for analyses involving more than two variables. She also reviews tools for summarizing and graphing data.This video is part three is a series on learning the basics of using JMP to make the most of your JMP 30-day free trial or your new JMP license. JMP Academic Ambassador Mia Stephens demonstrates how to navigate the JMP menus and data tables, import data into JMP, summarize and graph data and perform basic statistical analyses. This demo uses JMP 11, which will be available in September. See what's coming in JMP 11: http://www.jmp.com/software/preview-j..."

Craige

ron_horne · Feb 23, 2020 05:14 AM

Thank you @Craige_Hales

This strikes through like lightning! extracting exactly what i wanted and is robust across different YouTube pages.

this is my final version

New Table( "Video list",
	Add Rows( 3 ),
	New Script(
		"bring description from YouTube",
		For( i = 1, i <= N Rows(), i++,
			page = Open( :link[i] );
			division = Regex(
				page,
				"<div id=.watch-description-text. class=..>(.*?)</div",
				"\1"
			);
			withouttags = Regex( division, "<[^>]*>", "", globalreplace );
			:webpagetext[i] = withouttags;
		)
	),
	New Column( "Video",
		Character,
		"Nominal",
		Set Values(
			{"Getting Started With JMP, Part 1", "Getting Started With JMP, Part 2",
			"Getting Started With JMP, Part 3"}
		)
	),
	New Column( "Link",
		Character( 83 ),
		"Nominal",
		Set Values(
			{
			"https://www.youtube.com/watch?v=xge-f1KV_oc&list=PL411D719858B57C47&index=2&t=5s",
			"https://www.youtube.com/watch?v=xhZVuDrKiEA&list=PL411D719858B57C47&index=3&t=0s",
			"https://www.youtube.com/watch?v=1M8LzJ8bjwg&list=PL411D719858B57C47&index=4&t=0s"
			}
		)
	),
	New Column( "webpagetext", Character, "Nominal", Set Values( {"", "", ""} ) )
)

Craige_Hales · Feb 23, 2020 4:49 AM

Great!

For anyone reading along: Way up at the top, @Mark_Bailey introduced the reluctant ? operator:

description = Regex( page, "<meta name=\!"description\!" content=\!"(.+?)\!">", "\1" );

I recycled it as .*? but it is the same idea and pretty important. The ? following + or * makes + or * reluctant instead of greedy. + means one or more and * means zero or more of whatever is to its left side... in this case a period. Period matches anything. Greedy means the * or + will repeat the . (that matches any character) all the way to the end of the string, which is ~50,000 characters. In the example above, regex still needs to find "> . It can't match at the end of the text, so regex begins releasing the greedily acquired characters, one at a time, testing for "> . Most likely it will find "> that belongs to some other quoted string near the end of the string and wind up accepting text you want plus a bunch of extra text.

In the reluctant version, .*? initially matches zero characters, but does not find the "> . So the regex pushes the .*? forward one character and tests again. Reluctant will find the very next "> , not one near the end of the text.

Often the greedy behavior and the reluctant behavior get the same answer, and greedy is faster. But in this example the answers are different and reluctant is faster (because it will reluctantly advance a few dozen characters, while the greedy case might have to go forward thousands of characters then backup over most of them.)

Craige

LNitz · Jul 12, 2021 6:23 AM

I have tried this neat procedure and have run it without errors. But, I do not know how to embed it, so that the text I recover gets posted to a JMP file. As I run this, or a slightly different version pointing to a different web page, the file I defined (called Page) receives no data from the web page.

Here is my code (including the original source file as a comment):

/*page = open ("https://youtu.be/yvoddqG-lm8");;*/

page = Open( "https://www.theatlantic.com/ideas/archive/2021/07/republicans-anti-history-marjorie-taylor-greene/619403" );
division = Regex( page, "<div id=.watch-description-text. class=..>(.*?)</div", "\1" );
withouttags = Regex( division, "<[^>]*>", "", globalreplace );