Choose Language Hide Translation Bar
Highlighted
ron_horne
Super User

Extracting a section of a webpage

Dear Members of the community,

I am trying to extract a section of text from a long and messy string. I would like to extract the description part of a YouTube video.

 

For example, in this video: https://youtu.be/yvoddqG-lm8 i would like to extract just the description part:

 

“Mia Stephens shows how to perform basic statistical analyses in JMP. She covers using Distribution to analyze data one variable at a time. Using Fit Y by X for analyses involving two variables, and using Fit Model for analyses involving more than two variables. She also reviews tools for summarizing and graphing data. This video is part three is a series on learning the basics of using JMP to make the most of your JMP 30-day free trial or your new JMP license. JMP Academic Ambassador Mia Stephens demonstrates how to navigate the JMP menus and data tables, import data into JMP, summarize and graph data and perform basic statistical analyses. This demo uses JMP 11, which will be available in September. See what's coming in JMP 11: http://www.jmp.com/software/preview-j...”

 

I manage to get the whole web page script as a string using the following command:

page = open (https://youtu.be/yvoddqG-lm8); 

I have noticed that the description part I am looking appears a few times in the page. In particular after these terms:

\\!"description\\!":{\\!"runs\\!":[{\\!"text\\!":\\!"

Or: \\!"description\\!":{\\!"simpleText\\!":\\!"

 

Any suggestions ?

 

Am I in the right direction extracting the whole page code or is there a way to extract directly just the section I am looking for?

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
Craige_Hales
Staff (Retired)

Re: Extracting a section of a webpage

You are trying to grab the text from a javascript json snippet. I think it will be easier to grab it from the HTML further down. I'm using . in place of quotation marks just to avoid \ escaping. This <div>...</div> works because there are no nested <div> within this one that would make it end early. Regex doesn't handle the nested structures by itself (but pattern matching can).

Note the div's id= is identifying the text you want, so it might be fairly robust, for a while.

page = open ("https://youtu.be/yvoddqG-lm8");

division = Regex ( page, "<div id=.watch-description-text. class=..>(.*?)</div","\1"  );
withouttags = regex(division,"<[^>]*>","",globalreplace);

"Mia Stephens shows how to perform basic statistical analyses in JMP. She covers using Distribution to analyze data one variable at a time. Using Fit Y by X for analyses involving two variables, and using Fit Model for analyses involving more than two variables. She also reviews tools for summarizing and graphing data.This video is part three is a series on learning the basics of using JMP to make the most of your JMP 30-day free trial or your new JMP license. JMP Academic Ambassador Mia Stephens demonstrates how to navigate the JMP menus and data tables, import data into JMP, summarize and graph data and perform basic statistical analyses. This demo uses JMP 11, which will be available in September. See what's coming in JMP 11: http://www.jmp.com/software/preview-j..."

Craige

View solution in original post

9 REPLIES 9
Highlighted
txnelson
Super User

Re: Extracting a section of a webpage

Ron,

I have written a couple of parsers on the order of what you appear to need.  I typically don't use an Open() function, but I use Loat Text File() to read in the file into a long string.  I then use Contains() to get to the starting point, and to Substr() the string of all characters after the offset returned from the Contains(). Since you are reading in an HTML file, you can use the structure of it's markup language to configure your start points and delimiters.

 

Best of luck

Jim
Highlighted
ron_horne
Super User

Re: Extracting a section of a webpage

Thank you, @txnelson  for the good directions. Using the most basic character functions I have managed to do this:

Names Default To Here( 1 );
page = open ("https://youtu.be/yvoddqG-lm8");

startingpoint = Contains( page, "description\\!":{\\!"simpleText\\!":\\!"" );
rightremain = Substr( page, startingpoint + 32, Length( page ) );
endpoint = Contains( rightremain, "\\!"},\\!"lengthSeconds" );
final = Substr( rightremain, 0, endpoint - 1 );

Show( final );

At this point, when I tried to duplicate this to other YouTube pages, I found out that the prefix and suffix for the full description are somewhat inconsistent so I would have to use a variety of them to find the correct startingpoint and endpoint for each.

Highlighted

Re: Extracting a section of a webpage

I like @txnelson suggestion. You could also use a regular expression on the character string. You could steal, er, I mean study the regex for HTML tags in the Text Explorer. Launch it on any column with character data. (You're not really analyzing text here.) Click the red triangle at the top of the platform and select Parsing Options > Customize Regex. There are several expressions related to HTML. Click the Add button (+) below the list of expressions to go to the library. Add the ones that have promising names. The editor will let you examine and modify the expressions. You will get immediate feedback with the sample text at the top or the examples in the definition.

 

This one might be as straight-forward as:

 

Names Default to Here( 1 );

page = Load Text File( "https://youtu.be/yvoddqG-lm8" );

description = Regex( page, "<meta name=\!"description\!" content=\!"(.+?)\!">", "\1" );

 

Like I said, I like the previous suggestion. But it is good to have options.

Learn it once, use it forever!
Highlighted
Craige_Hales
Staff (Retired)

Re: Extracting a section of a webpage

from a while back: Twitter Screen Scraping and FFT-Based Cross Fade 

it is using a really simple regex to locate URLs for images:

 

	x = Load Text File( "https://twitter.com/search?q=football&src=typd" );//cat%20OR%20dog%20OR%20turtle%20OR%20rabbit
	Pat Match( x, // look for links to jpgs in the html
		Pat Regex( "https://[^\!"']+?\.jpg" ) >> link + Pat Test(
			// reject some links to small images and duplicates
			If( !Contains( link, "bigger" ) & !Contains( link, "normal" ) & duplicate << Contains( link ) == 0,
				duplicate[link] = 1; // remember this keeper
				dt << addrows( 1 ); // make a row for it
				dt:url = link;
				dt << runformulas;// make sure it loads
			);
			1;
		) + Pat Fail()// this forces the match to keep trying, recording all links
	);

It loops, without using PatRepeat, with PatFail(). PatFail never matches and causes the matcher to retry at the next position. Each time it finds that https://<text without ! " or ' >.jpg pattern it runs some JSL to process the URL.

Just tried it, apparently football is still a viable search term. You'll get around 20 pictures on the first query, and a new picture or so every few minutes after that.

Mark's on the right path, using the HTML tags to identify the content you want.

 

 

Craige
Highlighted
ron_horne
Super User

Re: Extracting a section of a webpage

Thank you @markbailey  for your solution. It is robust in the sense that it retrieves a clean string from any YouTube video I tried it on. Yet, it brings a shorten version of the description (as it turns out, the code for each video includes the full description twice and the short description 3 times).

Attempting to amend the regex command I am not sure why they do not work as well as yours. Do you notice what is wrong with my script?

Names Default To Here( 1 );
page = open ("https://youtu.be/yvoddqG-lm8");

// this works well for the short version of the description.
final = Regex ( page, "<meta name=\!"description\!" content=\!"(.+?)\!">", "\1" );

// not working for the long version of the description.
final = Regex ( page, "description\\!":{\!"simpleText\!":\!"(.+?)\\!"},", "\1" );

 

Highlighted
Craige_Hales
Staff (Retired)

Re: Extracting a section of a webpage

You are trying to grab the text from a javascript json snippet. I think it will be easier to grab it from the HTML further down. I'm using . in place of quotation marks just to avoid \ escaping. This <div>...</div> works because there are no nested <div> within this one that would make it end early. Regex doesn't handle the nested structures by itself (but pattern matching can).

Note the div's id= is identifying the text you want, so it might be fairly robust, for a while.

page = open ("https://youtu.be/yvoddqG-lm8");

division = Regex ( page, "<div id=.watch-description-text. class=..>(.*?)</div","\1"  );
withouttags = regex(division,"<[^>]*>","",globalreplace);

"Mia Stephens shows how to perform basic statistical analyses in JMP. She covers using Distribution to analyze data one variable at a time. Using Fit Y by X for analyses involving two variables, and using Fit Model for analyses involving more than two variables. She also reviews tools for summarizing and graphing data.This video is part three is a series on learning the basics of using JMP to make the most of your JMP 30-day free trial or your new JMP license. JMP Academic Ambassador Mia Stephens demonstrates how to navigate the JMP menus and data tables, import data into JMP, summarize and graph data and perform basic statistical analyses. This demo uses JMP 11, which will be available in September. See what's coming in JMP 11: http://www.jmp.com/software/preview-j..."

Craige

View solution in original post

Highlighted
ron_horne
Super User

Re: Extracting a section of a webpage

Thank you @Craige_Hales 

This strikes through like lightning! extracting exactly what i wanted and is robust across different YouTube pages.

this is my final version

New Table( "Video list",
	Add Rows( 3 ),
	New Script(
		"bring description from YouTube",
		For( i = 1, i <= N Rows(), i++,
			page = Open( :link[i] );
			division = Regex(
				page,
				"<div id=.watch-description-text. class=..>(.*?)</div",
				"\1"
			);
			withouttags = Regex( division, "<[^>]*>", "", globalreplace );
			:webpagetext[i] = withouttags;
		)
	),
	New Column( "Video",
		Character,
		"Nominal",
		Set Values(
			{"Getting Started With JMP, Part 1", "Getting Started With JMP, Part 2",
			"Getting Started With JMP, Part 3"}
		)
	),
	New Column( "Link",
		Character( 83 ),
		"Nominal",
		Set Values(
			{
			"https://www.youtube.com/watch?v=xge-f1KV_oc&list=PL411D719858B57C47&index=2&t=5s",
			"https://www.youtube.com/watch?v=xhZVuDrKiEA&list=PL411D719858B57C47&index=3&t=0s",
			"https://www.youtube.com/watch?v=1M8LzJ8bjwg&list=PL411D719858B57C47&index=4&t=0s"
			}
		)
	),
	New Column( "webpagetext", Character, "Nominal", Set Values( {"", "", ""} ) )
)
Highlighted
Craige_Hales
Staff (Retired)

Re: Extracting a section of a webpage

Great!

For anyone reading along: Way up at the top, @markbailey  introduced the reluctant ? operator:

 

description = Regex( page, "<meta name=\!"description\!" content=\!"(.+?)\!">", "\1" );

 

I recycled it as .*? but it is the same idea and pretty important. The ? following + or * makes + or * reluctant instead of greedy. + means one or more and * means zero or more of whatever is to its left side... in this case a period. Period matches anything. Greedy means the * or + will repeat the . (that matches any character) all the way to the end of the string, which is ~50,000 characters. In the example above, regex still needs to find "> . It can't match at the end of the text, so regex begins releasing the greedily acquired characters, one at a time, testing for "> .  Most likely it will find "> that belongs to some other quoted string near the end of the string and wind up accepting text you want plus a bunch of extra text.

 

In the reluctant version, .*? initially matches zero characters, but does not find the "> . So the regex pushes the .*? forward one character and tests again. Reluctant will find the very next "> , not one near the end of the text.

Often the greedy behavior and the reluctant behavior get the same answer, and greedy is faster. But in this example the answers are different and reluctant is faster (because it will reluctantly advance a few dozen characters, while the greedy case might have to go forward thousands of characters then backup over most of them.)

 

Craige
Highlighted

Re: Extracting a section of a webpage

If you don't mind adding new tools to your toolbox (and want to avoid the complexity of parsing HTML with regular expressions), you can take advantage of JMP's Python Bridge (here is a helper to use it with Anaconda's distribution) and use a package that was built to handle HTML parsing, such as BeautifulSoup.

This package has tons of documentation and examples out there, so once you figure out how to transfer the data between JMP and Python (details here) you can focus on what you want to extract, and then what to do with it in JMP.

Article Labels

    There are no labels assigned to this post.