When it comes to deployment, flexibility is another important feature to consider. Converting the model to scoring code is an important step; next you have to deal with the specific needs of each production scenario. That might lead to completely different solutions, from providing an API that can be called by remote applications, to creating simple tools that can score and visualize Excel spreadsheets, to creating distributed applications that can score massive amounts of data.
One of the answers to deployment requirements diversity is to use a language that can tap into a rich ecosystem of supporting libraries and frameworks. This is where Python stands out. One can find Python libraries (not to mention books and tutorials) to handle all of the challenges described above, and many more.
The examples below illustrate how simple solutions can be built in Python by leveraging JMP-generated scoring code and available libraries to solve scenarios that would otherwise require a big investment in custom software development. The examples were tested in Windows, using the Anaconda Python distribution from Continuum Analytics™.
A common but difficult requirement is to make the ability to score data available to many users over a network. But you might not want to make the scoring code visible by being part of the web application itself (see our JavaScript example). Or you might want to log the scoring calls, or combine the input data sent by the user (or automated process) with data retrieved from a database.
All these requirements can be addressed by deploying the Python scoring code as part of a server-side application on a web server. By exposing the scoring code through an API callable over HTTP, you are effectively implementing a "scoring-as-a-service" solution that can serve users (through a web or even mobile application) and automated processes alike.
The code in the WebService directory shows a simple way to implement this solution. The main file, app.py, uses the Flask microframework to create a web server application. That application exposes a single entry point named score that calls a JMP-generated Python model to score the input data provided as URL arguments.
In the same directory we have both the JMP-generated Python scoring code and a copy of the jmp_score.py support file provided with the JMP install.
The solution also includes a web client application that illustrates how the scoring service can be called from a browser. It is basically the already mentioned JavaScript example, modified to call the scoring service instead of calculating the score locally.
The provided WebService/run.cmd script starts the scoring web service and then opens two browser tabs. The first contains the web client; try interacting with it and check the server window to see the requests and replies being printed. The second has a URL that points to the scoring service, passing along the encoded input values as arguments:
http://localhost:9004/score?Petal+length=5.1&Petal+width=1.9&Sepal+length=5.8&Sepal+width=2.789
The result should be a page displaying a JSON object with the scoring results:
{ success: { Most Likely Species: "virginica", Prob[setosa]: 1.1654444266441007e-25, Prob[versicolor]: 0.000699496836649252, Prob[virginica]: 0.9993005031633508 } }
Jupyter is the most popular open source implementation of the literate programming paradigm. Project Jupyter was born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across not just Python but also other programming languages. From the Jupyter site:
"The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text."
Jupyter was covered by a recent Nature article, which found many scientists that are now "publishing their notebooks alongside papers". The iPhython GitHub page keeps a list of examples.
The example under the Notebook directory contains a Jupyther notebook that shows how to use a JMP-generated Python scoring code model along with other core Python libraries for data analsys and visualization, all in a literate programming context. First launch the notebook using the provided command run.cmd.
That should launch your browser pointing to the URL:
http://localhost:8888/notebooks/JMP_scoring_Excel_Bokeh.ipynb
Follow the document, and optionally re-evaluate the associated code cells, to learn how to:
Sometimes even the largest server is not be enough. When your scalability requirements get to the level reserved for "Big Data" applications, you have to start considering distributed solutions.
In recent years, Apache Spark has gained a lot of traction in that space due to its speed, easy of use and generality. Thanks to a Python interface called PySpark, we can use Spark clusters to execute JMP-generated Python models and score large datasets.
The (Windows, Spark 1.6.1) example in the Spark directory includes data, models and the following scripts that implement the required steps in the setup and execution of a Spark application:
spark-submit --master spark://%1:7077 ^
Calls the Spark application launcher. Make sure the Spark install script is in your path. It takes many arguments, the first one being the address of the cluster master node. %1 will contain the local address computed by the call to myip.cmd.
--py-files .\models\airlines_models.zip ^
Points to a .zip file containing the code that will be executed on each node (the JMP models), plus all its dependencies (in our case, just the jmp_score.py support file).
--packages com.databricks:spark-csv_2.11:1.5.0 ^
Additional Python packages required by the main application code - here we point to a Spark CSV parser library.
airline_delay.py .\data\2008_100k.csv
The application entry point and its arguments - in this case, the data file we want to load and score. The data file must be in a disk and directory accessible to all the cluster nodes. We used a slice of the famous Airlines dataset as using a Spark cluster to score the Iris dataset is the very definition of overkill. ☺
For long running examples and especially for deployment in production, you have to consider how you will monitor the application. To that end, Spark provides both a REST API as well as a web UI. Once the application is running, open a browser on the cluster master machine and point it to the following locations:
The timeline visualization is a great way to debug and understand large scale Spark applications.
I hope this example will get you curious for more. I for one was very excited to see a JMP model running in parallel on multiple machines! But please note that this is just a proof-of-concept implementation. Configuring Spark clusters and applications for a production environment is beyond the scope of this code exercise.