Problem
You have a text string that has a repeating pattern. You want to extract some data from the string into a data table. Each repetition in the string represents a row in the table. There might be a lot of data to skip over because it doesn't belong in the table.
Solution
Use the JSL PatMatch function. You'll need to use the pat___ functions to build a pattern. The following example makes a data table of links found on a web page.
site = "https://en.wikipedia.org";
url = "/wiki/JMP_(statistical_software)";
html = Load Text File( site || url ); // Dec 2017: 200+ links here
quote = "\!""; // simplify escaping of quotation mark elsewhere
dt = New Table( url, New Column( "link", Character ), New Column( "text", Character ) );
// A typical link looks like this on Wikipedia:
//<a href="//st.wikipedia.org/" lang="st">Sesotho</a>
// the following pattern will need tweaking for other sites.
rc = Pat Match( html, //
Pat Repeat( // until there are no more
Pat Break( "<" ) + // match up to, but not including, the <
(// either we've found a link with <a
(// this will fail if the html link format changes much...
// there is no requirement that href follows a after one space.
// the href value is between quotation marks and lang=... is
// thrown away by this simple pattern
"<a href=" + quote + Pat Break( quote ) >> vlink + Pat Break( ">" ) // store match in vlink
// throw away the closing > then capture everything up to the
// closing </a>. this may include other tags
+ ">" + Pat Arb() >> vtext + "</a>" + // store match in vtext
Pat Fence() + // fence off previously parsed data, back-up-and-retry is pointless
Pat Test( // inject some JSL into the match to save the results
If( Starts With( vlink, "#" ), // ignore in-page anchors
{} // nothing
, // else
If( Starts With( vlink, "/" ), // within site links begin with /
vlink = site || vlink; // fully qualified
);
dt << addrows( 1 ); // extend the table by one row
dt:link = vlink; // vlink and vtext are the JSL variables
dt:text = vtext; // link and text are the table variables
); //
1; // PatTest needs a true result to keep going
) //
) //
| // or we found something else and can just skip it
Pat Break( ">" ) // match up to the closing > and throw it away
) //
) //
);
Data table of links and associated text
Discussion
Your pattern will be different; the < and > characters are part of the html specification and the pattern matcher uses them to find the links in the text. Your data will follow some other pattern. You'll want the PatFence and the PatTest, but you'll also want to change any special cases in PatTest. For this html example, links beginning with a # sign are just anchors within a web page (they make the page scroll to a section when clicked) so the code ignores them. Otherwise, the link should be saved. The links that are saved have the site prepended if they start with a /. You might need to handle some similar details.
See Also
https://community.jmp.com/t5/Uncharted/Pattern-Matching/ba-p/21005