Subscribe Bookmark RSS Feed

Extracting text data from a website...

ERJ-2015

Community Member

Joined:

Oct 19, 2017

I would like to quickly analyze the text on a website (well... on a large number of websites, really). Is this possible in JMP's new Text Explorer? Can I provide JMP with a column of website addresses and have it read/import the associated text on those websites? (Or is it just a dream?)

1 REPLY
Craige_Hales

Staff

Joined:

Mar 21, 2013

loadTextFile("https://google.com")
/*:

"<!doctype html><html itemscope=\!"\!" i...

 

You'll get the HTML for the page, not just the text. Text Explorer does have some example regex that might help you parse the HTML.

You can use these regex or modify them to meet your needs.You can use these regex or modify them to meet your needs.

The regex tokenizer uses regular expressions to break the input text into tokens (words, or units of text that act like a word). You can look at the Grabber and Remover examples to see how to remove text from the stream of tokens and how to (for example) convert all phone-number-like strings into "#phone".

You can make a table of web site content:

New Table( "content",
	Add Rows( 2 ),
	New Column( "site",
		Character,
		"Nominal",
		Set Values( {"https://google.com", "https://bing.com"} )
	),
	New Column( "text", Character, "Nominal", Formula( Load Text File( :site ) ) )
);

Data table with HTML loaded in formula columnData table with HTML loaded in formula column

Craige