Choose Language Hide Translation Bar
Community Member

Extracting text data from a website...

I would like to quickly analyze the text on a website (well... on a large number of websites, really). Is this possible in JMP's new Text Explorer? Can I provide JMP with a column of website addresses and have it read/import the associated text on those websites? (Or is it just a dream?)

0 Kudos
Staff (Retired)

Re: Extracting text data from a website...


"<!doctype html><html itemscope=\!"\!" i...


You'll get the HTML for the page, not just the text. Text Explorer does have some example regex that might help you parse the HTML.

Capture.PNGYou can use these regex or modify them to meet your needs.

The regex tokenizer uses regular expressions to break the input text into tokens (words, or units of text that act like a word). You can look at the Grabber and Remover examples to see how to remove text from the stream of tokens and how to (for example) convert all phone-number-like strings into "#phone".

You can make a table of web site content:

New Table( "content",
	Add Rows( 2 ),
	New Column( "site",
		Set Values( {"", ""} )
	New Column( "text", Character, "Nominal", Formula( Load Text File( :site ) ) )

Capture.PNGData table with HTML loaded in formula column