Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
This fall, we introduced a new member of the JMP family: JMP Pro. For this first version of JMP Pro, the main intention was to start to make predictive (as opposed to exploratory) modeling more accessible to those who are drawn to the JMP style of working. Indeed, JMP Pro is part of the SAS Predictive Analytics Suite for this reason.
Predictive modeling draws on a body of knowledge that may or may not be familiar to traditional JMP users. However, one of the important aspects of the approach is the need to split data to support different phases of the predictive model-building process.
As always, "all models are wrong, but some are useful," as George E.P. Box said. The utility of a model intended to make predictions is, of course, tied up with its ability to do exactly that. But because we don’t actually have data from the future, the best we can do is "hold out" some of the data we do currently have and then use it to test the models we build to get an indication of how they are likely to perform.
In fact, it’s useful to think of two sequential steps: model selection and model assessment, the "test data" we hold out being used in the latter. Model selection is required because each type or family of model we might want to try will usually have various fitting or tuning parameters, and we need to adjust these to get the best model within this family. This gives rise to splitting the non-test data into "training data" and "validation data." Naturally, the success or otherwise of any predictive modeling effort is somewhat bound up with exactly how the observations to hand are split into training, validation and test data.
To make this data spitting easy, JMP 9 allows you to build a new random indicator column:
As you see, JMP will build a numeric column and allows you to specify the proportion of the data to be assigned a given indicator value. Note that the modeling type has been changed from Continuous to Nominal. More importantly, note that the JMP analysis platforms will assume that rows assigned an indicator value of 0, 1 or 2 are part of the training, validation or test data, respectively. You could make use of the Value Label column property to make this more intelligible. As you might expect, once you have specified the proportions you want, the assignment of indicator values to rows is random.
In predictive modeling, the response ("target variable") of interest will be either continuous or discrete. In data mining circles, the first circumstance is usually called a regression problem, and the second a classification problem. For classification problems, it’s not uncommon to try to predict relatively rare events. Saying it differently, it’s unlikely that the data to hand will have a roughly equal number of rows within each level of the discrete target variable.
For example, we might be interested in trying to predict the occurrence of a manufacturing failure when we have data for 9,000 passes and 1,000 failures. Clearly, in cases like this, the random indicator column described above is probably not the best choice. Depending on the proportions we specify, we might end up with no failures in the test data. What we would like to do is to split the data at random within each level of the discrete target variable – giving rise to the idea of a Stratified Split.
Although possible in principle to do these manipulations by hand, in practice it’s just too burdensome to bother. But fortunately, a little JMP Scripting Language (JSL) and the new JMP 9 add-in architecture can easily give us what we need. Download my Stratified Split add-in at the JMP File Exchange. (SAS login is required for access.)
Installing this add-in puts a new Stratified Split item under the Rows menu that makes this dialog:
Hitting OK adds a new column and brings up a Distribution so you can verify the outcome:
(Note that, like the in-built equivalent, "Stratified Split" ignores any row state information. In the screenshot above, the "N Missing, 3" relates to the fact that the table has three excluded rows.)
The JSL used is rather unremarkable but will be dissected in a subsequent post. If you would like to see how splitting data is actually used in predictive modeling (with or without stratification), watch the JMP Pro launch webcast.