BookmarkSubscribeSubscribe to RSS Feed

Extracting text data from a website...


Community Member


Oct 19, 2017

I would like to quickly analyze the text on a website (well... on a large number of websites, really). Is this possible in JMP's new Text Explorer? Can I provide JMP with a column of website addresses and have it read/import the associated text on those websites? (Or is it just a dream?)




Mar 21, 2013


"<!doctype html><html itemscope=\!"\!" i...


You'll get the HTML for the page, not just the text. Text Explorer does have some example regex that might help you parse the HTML.

Capture.PNGYou can use these regex or modify them to meet your needs.

The regex tokenizer uses regular expressions to break the input text into tokens (words, or units of text that act like a word). You can look at the Grabber and Remover examples to see how to remove text from the stream of tokens and how to (for example) convert all phone-number-like strings into "#phone".

You can make a table of web site content:

New Table( "content",
	Add Rows( 2 ),
	New Column( "site",
		Set Values( {"", ""} )
	New Column( "text", Character, "Nominal", Formula( Load Text File( :site ) ) )

Capture.PNGData table with HTML loaded in formula column