Abstract
Finding the right model to predict new outcomes given new data is an important accomplishment. But many times it is a step in a journey where the final goal is to share the model’s predictive power with a much larger audience. JMP Pro has many tools to facilitate fitting, comparison, and selection of predictive models. JMP Pro 13 added the Formula Depot: an efficient way to collect models, apply them to new tables, and access the Model Comparison and Profiler platforms. To help with deploying models to production, the Formula Depot can also convert them to scoring code in a number of different programming languages. In this tutorial, we use a real estate case study to illustrate the predictive modeling workflow. We’ll compile data, prepare the data for modeling, generate predictive models, publish models to the Formula Depot, and explore and select the best model(s). Then, we’ll generate scoring code to support the creation of web applications that can calculate housing prices “on the spot.” We will also explore different methods for scoring data, and provide an overview of current deployment architectures.
Tutorial Content
The final result of the tutorial is a web application with housing predictive capabilities hosted on AWS. You can find it here. What follows is an explanation of the content associated with the tutorial, as it can be found in the attachments to this page.
The client tier (see attached file RedFinWeb.zip) uses Bootstrap for the user interface and OpenLayers for the mapping capabilities. It is implemented as a static website hosted on AWS S3; the interface collects the user data entry using a customized HTML form; these are used as input values in the model evaluation, which is triggered by a REST call to the compute layer.
The compute layer (see attached file RedFinServerless.zip) uses two Amazon services: API Gateway is the entry point for the REST calls, which are mapped to an AWS Lambda service. Lambda is Amazon's implementation of the Serverless Architecture paradigm, also referred to as Function-as-a-Service (FaaS). Serverless provides performance at scale with low management costs to stateless, low latency, high-throughput applications, making it a great fit for scoring applications which are, by definition, embarassingly parallel.
The Lambda service was the final deployment destination for the Python scoring code generated by JMP. Deployment of applications to AWS services can be done manually using their management console, but to streamline the process we recommend the use of one of the many open source wrappers available. In this exercise, we used the Serverless application framework.
Note how the Python code captures both the feature engineering tasks (clustering, binning, imputation) as well as the model built on top of them. This allows the same raw data sources used to create the model to also be used to score new data in production - an important consideration when building a maintanable analytics pipeline.
Attached you will also find a Jupyter Notebook (RedFinNotebook.zip) used to test the model before deployment; a collection of Python models generated by JMP for the housing scenario (RedFinModels.zip); and the original JMP table with the data clean-up and scripts to generate the models. (RedFinData.zip). This last .zip file also includes a spreadsheet that illustrates how the REST backend can be called directly for scoring by other applications, in this case from an Excel formula.
References