Using Neural platform in JMP Pro for automated creation of validation column
Apr 9, 2014 8:02 AM
JMP Pro is a great tool for quickly building multiple models with your data using a variety of techniques, namely tree-based methods (Boostrap Forest, Boosted Tree options in the Partition platform), neural networks and penalized regression (using the Generalized Regression personality in Fit Model).
When building predictive models, you need sound ways to validate your model, or you can easily get into trouble overfitting. Many modeling platforms in JMP Pro support a validation column. The validation column is used to split your data up in training and validation portions. Training data is used to build the model, and validation data is used to tune the model. Sometimes a third split – test – is used to simulate new data that has come in so that you may see how the model performs with data previously unseen by the model.
Let’s say you want to use 70 percent of your data to build the model and save 30 percent to validate or fine-tune the model. You might think that taking a random sample of all of your rows might be the best way to go – but that can easily lead to problems if you are dealing with lots of outliers or a rare event. A simple random sample could easily place all of the important data points (like the rare events) in the training set or validation set. This creates suboptimal modeling conditions and may lead you to build models that are not very useful.
The Neural platform in JMP Pro can help you create an unbiased and balanced validation column automatically in just a few steps. I’m using the Boston Housing data set, which is in the Sample Data in JMP (Help > Sample Data). For this data, I want to predict a house’s mvalue based on a number of possible predictors. The Distribution below shows that the response has a number of high mvalues that we want to make sure are equally divided into our training and validation sets.
In the Neural Model Launch, I can specify a holdback proportion for the validation method. Because I want a 70/30 split, I’ll put .3 into the field. The Neural platform will automatically sort the response from lowest to highest value and then randomly assign the record to either the training or validation set based on my proportion desired.
I’m not particularly interested in the actual Neural model here, so I can just accept the defaults and click: Go. Then from the red triangle menu on the fit, I am going to select the “Save Validation” option.
This will automatically create a new Validation column in my data table, with each row tagged “Training” or “Validation.”
If we again fit a Distribution to the response with validation as the By, you can see that the properties of the training and validation sets are very close. The Neural platform has done a great job of dividing my data up in an intelligent way – automatically.
Now, I’m ready to go about the process of building my models – knowing that I have a solid data splitting scheme that will let me build the most informative and useful model with my data. Thanks to Chris Gotwalt, the developer of the Neural Platform, for showing me this powerful capability. It has certainly been the quickest and most reliable way to build a validation column in JMP Pro that I have found so far.