Uncharted

Craige_Hales · May 8, 2022 09:59 PM

Much of the web uses restful APIs to move data to and from servers. Rest is a simple concept that has nothing to do with sleeping; rest means representational state transfer, but this article is not about rest. This article is about a kludgy mechanism for working around the lack of a rest API when you really need to retrieve some data from a site.

Web sites might not want you to do this for various reasons: bandwidth for data costs money, data licensing costs money, and not watching the advertisements might cost money too. This tool, selenium, is nominally for testing a web site, not for speeding over a speed bump. Most sites have terms of use; you can find JMP's terms at the bottom of this page.

A complete JSL file is attached. It is written for Firefox, Windows, and the JMP web site as it looked on 8May2022. The Firefox part can be changed, probably, to many other browsers. It might work on Mac too, no testing was done. The JMP web site will change over time and the JSL will need tweaking. That's the downside of not using an official API.

Before starting, download a driver and install selenium as shown in the comments below. You'll need Firefox too, or do some research on the driver for your preferred browser.

// JMP + Python + Selenium + Firefox (probably works with Chrome or Edge with a few tweaks)
// https://firefox-source-docs.mozilla.org/testing/geckodriver/index.html
// https://selenium-python.readthedocs.io/

// downloaded            https://github.com/mozilla/geckodriver/releases    -- geckodriver-...-win64.zip (pick latest)
// expand                on desktop   geckodriver.exe
// installed selenium    python -m pip install selenium

// the following code is welded into the JMP.COM's HTML of 8May2022. It could change at anytime.
// the functions are sprinkled through the JSL below, near where I first needed them, reused later.
// you'll need similar functions, probably, and want to consult the Python Selenium bindings to write them.
// I only wrote enough to make this simple example work. There is no actual need to be logged in to JMP.
//
// Using try...except on snips of python code makes debugging much easier. JMP loses the exception message without it.
//
// the JSL functions are thin wrappers around the python calls to selenium which (I believe) is
// a thin wrapper to generate an API call to the gecko (etc) driver. I *think* it might be possible to
// remove the python layer and load/call the driver directly from JSL. But Selenium is documented.

Python Init(); // one-time startup...

// 0: startup 

xrc = Python Execute( {}, {By_ID, By_XPATH, rc},
"\[
try:
    from selenium import webdriver
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By
    from selenium.webdriver.firefox.service import Service
    from selenium.webdriver.firefox.options import Options
    #
    options = webdriver.FirefoxOptions()
    #options.add_argument("--private") # example. you most likely don't want private, some things don't work.
    #
    service=Service(r'C:\Users\v1\Desktop\geckodriver.exe')
    # sometimes people use "browser" rather than "driver". It will be used below.
    driver = webdriver.Firefox(service=service,options=options)
    # return two magic values. you may need some others, just add them in the same way...
    By_ID = By.ID
    By_XPATH = By.XPATH
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\"
);

If( xrc != 0 | rc != "ok", Throw( "start up Selenium failed" || Try( ": " || Char( rc ), "" ) ) );

The PythonInit() only needs to be done once; it connects JMP to Python and takes a few seconds the first time. You can call it again with no penalty. The PythonExecute(...) sends no variables in but gets three back from the code it runs. It takes a bit to load everything and start the browser.

You are looking at an empty browser controlled by JMP

Open the JMP.COM page next. You might see a redirect that normally goes unnoticed.

// 1: navigate to jmp.com

nav = Function( {url}, {rc},
    Python Execute( {url}, {rc}, 
"\[
try:
    driver.get(url)
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\" );
    return(rc);
);

rc = nav( "https://www.jmp.com/" );
if( rc != "ok", throw("nav: "||char(rc)));

The nav function returns "ok" or an error message. The JMP web page loads in the browser. Ignore the people in the screenshot.

The icon means the browser is remote controlled.

Script the sign-in to the JMP site. Right-click the Sign in button to find out the button's HTML id value. Remember how to do this; I'll skip this explanation at the end...

F12 might bring you to the next screen, but this way the control will already be selected.

And the developer console opens with the control's id showing. Further down there will be controls that have a class but not an id. Selenium's XPATH can handle it. The trick is similar to displaybox navigation--finding a path that is not too brittle and still specific enough.

Use the button id in the JSL that follows.

IDs are usually the best choice when available because they are unique on the page. WaitID waits for up to 10 seconds for the login button to appear. It might not be necessary to wait. It takes no time if it is already there.

// 2: login with userid/password credentials

// wait for an id to be available
waitID = Function( {id, timeout = 10, BYformat=By_ID}, {rc},
    Python Execute( {id, timeout, BYformat},  {rc},
"\[
try:
    myElem = WebDriverWait(driver, timeout).until(EC.presence_of_element_located((BYformat, id)))
    rc = "ok"
except TimeoutException:
    rc = "timeout"
except Exception as e:
    rc = repr(e)
]\"
    );
    Return( rc );
);
rc = waitID( "loginButton", 5 );
If( rc != "ok", Throw( "no login button: " || char(rc) ) );

// click a button ID
clickID = Function( {id, BYformat=By_ID}, {rc},
    Python Execute( {id, BYformat}, {rc},
"\[
try:
    driver.find_element(BYformat, id).click()
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\"
    );
    return(rc);
);
rc = clickID( "loginButton" );
if( rc != "ok", throw("login button: "||char(rc)));

Cool! The sign on screen pops up. Find the user name field next.

Now get the id for the user name field by right-click...

Again, wait for the expected field. Now a keystroke function is needed...

rc = waitID( "idp-discovery-username", 10 );
If( rc != "ok", Throw( "no username field: " || char(rc) ) );

// type a value into a field
keysToID = Function( {id, txt}, {rc},
    Python Execute( {id, txt}, {rc},
"\[
try:
    driver.find_element(By.ID, id).send_keys(txt)
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\"
    );
    return(rc);
);
rc = keysToID( "idp-discovery-username", Include( "$documents/UserID.jsl" ) ); // file contains "AliBaba@1000&OneNights.com", in quotation marks, possibly encrypted
if( rc != "ok", throw("keysToID username: "||char(rc)));

Fake user name for fake password.

My userid is scrolled off the screen and the next button is visible...find its name...

Click the Next button to get the password prompt.

click it, then wait for the password field...

// click Next button
rc = clickID( "idp-discovery-submit" );
if( rc != "ok", throw("click submit user name: "||char(rc)));

// wait for password field
rc = waitID( "okta-signin-password", 5 );
If( rc != "ok", Throw( "no password field: " || char(rc) ) );

then enter the password

// enter password
rc = keysToID( "okta-signin-password", Include( "$documents/password.jsl" ) ); // file contains "OpenSesame", in quotation marks, possibly encrypted
if( rc != "ok", throw("keysToID password: "||char(rc)));

and repeat the process...find the sign in button...

After entering the password, click the sign in button.

click it

// click Sign In
rc = clickID( "okta-signin-submit" );
if( rc != "ok", throw("click signin submit: "||char(rc)));

We are signed in.

Must be signed in, there is an edit profile choice.

There is a search field in the picture, Type in JSL and click the magnifier. "searchField" is the id. The magnifier could be clicked, but selenium has a submit form mechanism that will work off the searchField, which is an input field in the form.

The search field needs a bigger window to be visible, here it is.

There is some asynchronous JavaScript that loads some parts of the page. Waiting for any particular field might not be necessary if the field is loaded as part of the page.

// 3: query for articles about JSL

rc = waitID( "searchField" );
If( rc != "ok", Throw( "no search field: " || char(rc) ) );

// enter a search string. "jsl" currently retruns 3 pages
rc = keysToID( "searchField", "jsl" );
if( rc != "ok", throw("keysToID search field: "||char(rc)));

submitForm = Function( {id}, {rc},
    Python Execute( {id}, {rc},
"\[
try:
    driver.find_element(By.ID, id).submit()
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\"
    );
    return(rc);
);
rc = submitForm( "searchField" ); // submit form works OK off of this input field.
if( rc != "ok", throw("submitForm searchField: "||char(rc)));

Now get ready to page through the results. The multi-page listing elements look like this

The pink outer element holds three inner parts: title, description, link.

there is a list of the pink-circled data items that spans multiple pages.

// 4: page through the results to capture them

getElements = Function( {id, BYformat=By_ID}, {rc},
    Python Execute( {id, BYformat},  {rc},
"\[
try:
    list = driver.find_elements(BYformat,id)
    rc = "ok"
except Exception as e:
    rc = repr(e)
]\"
    );
    Return( rc );
);

getNElements = function({},{n},
    Python Execute( {}, {n},
"\[
try:
    n = len(list)
except Exception as e:
    print(repr(e))
    n = -1
]\"
    );	
    return(n);
);

getElementItext = function({i,id, BYformat=By_ID},{txt},
    Python Execute( {i, id, BYformat}, {txt},
"\[
try:
    txt = list[int(i)].find_element(BYformat, id).text
except Exception as e:
    txt = "Error: getElementItext: " + repr(e)
]\"
    );	
    return(txt);
);

getElementIattribute = function({i,id, BYformat=By_ID, attr},{txt},
    Python Execute( {i, id, BYformat, attr}, {txt},
"\[
try:
    txt = list[int(i)].find_element(BYformat, id).get_attribute(attr)
except Exception as e:
    txt = "Error: getElementIattribute: " + repr(e)
]\"
    );	
    return(txt);
);

Above: some functions to use in the loop below. There are buttons at the bottom of the page to go to the next page; they run some JavaScript that destroys and recreates the list of items. The functions are called again to recapture the new list. The JSL and Python are good enough for this example. They will break down if there is more than one list to keep track of at the same time--see the Python list variable. I'm pretty sure it is necessary to wait for the data to load after each next page...

dt = New Table( "articles",
    New Column( "link", character,
        Set Property("Event Handler",
            Event Handler(
                Click(JSL Quote(Function( {thisTable, thisColumn, iRow}, Web( Char( thisTable:thisColumn[ iRow ] ) ); );)),
                Tip(JSL Quote(Function( {thisTable, thisColumn, iRow}, "Open " || Char( thisTable:thisColumn[ iRow ] ) || " in your browser."; );)),
                Color(JSL Quote(Function( {thisTable, thisColumn, iRow}, RGBColor("link"); );))
            )
        )
    ),
    New Column( "title", character ),
    New Column( "description", character )
);

// this while loop will grab screens of answers and break() when the NextScreen button goes dim
while(1,
    // not sure, yet, what to wait for. maybe the card(s)...
    // this html is updated in the background by a json ajax mechanism that
    // is hard to see. 
    rc = waitID("//div[@id='searchresults']//div[@class='result-card']",10,By_XPATH);
    If( rc != "ok", Throw( "no search results" ) );
    
    // get the current set
    rc = getElements("//div[@id='searchresults']//div[@class='result-card']",By_XPATH);
    if( rc != "ok", throw("getElements: "||char(rc)));
    n = getNElements();
    for(i=0,i<n,i+=1,
        // fetch the elements
        dt<<addrows(1);
        dt:title[nrows(dt)] = getElementItext(i,"a[@class='result-title_txt_all']",By_XPATH);
        dt:description[nrows(dt)] = getElementItext(i,"section[@class='result-description_txt_all']",By_XPATH);
        dt:link[nrows(dt)] = getElementIattribute(i,"a[@class='result-url']",By_XPATH,"href");
    );
     
    // advance to next page via pager-next button within the id=pager
    rc = waitID("//ul[@id='pager']//a[@class='pager-next']",1,By_XPATH);
    if(rc != "ok", // check for end vs error
        rc = waitID("//ul[@id='pager']//span[@class='pager-disabled pager-next']",1,By_XPATH);
        if(rc=="ok", break(/*normal exit with disabled span*/), throw("did not find expected pager button disabled"));
    ,//
        rc = clickID("//ul[@id='pager']//a[@class='pager-next']",By_XPATH);
        if( rc != "ok", throw("click pager next: "||char(rc)));
    );
    
);

At this point the browser is open and this table is on the screen.

Today there were 57 entries spanning three pages.

Time to shut down the browser.

// 5: quit
 
Python Submit( 
"\[
driver.quit() # close the browser
]\" );
Python Term();

Towards the end there is an XPATH

rc = clickID("//ul[@id='pager']//a[@class='pager-next']",By_XPATH);

that means

// - somewhere below the root of the document find a
ul - a <ul> tag (some sort of HTML list)
[@id='pager'] - the list has this id
// - more nested tags, followed by...
a = a <a> tag (link)
[@class='pager-next']

It might not need to be that complicated. It is an example that uses a unique id to find an item that might not be unique if only the class was considered.