Subscribe Bookmark
Jeff_Perkinson

Community Manager

Joined:

Jun 23, 2011

Train, Validate and Test for Data Mining in JMP

Data mining is looking for patterns and relationships in (sometimes large volumes of) data. Many methods, such as recursive partitioning and neural nets, are extremely sensitive to the sample of data being mined. How do you know if you are creating a model that would be useful for predicting future outcomes?

 

We want a model that is repeatable. That is, if used on future data, you would have similar success in predicting the outcome. Data miners separate their data into three different subsets to ensure model accuracy. They are training data, validation data, and testing data. There is no set rule as to the proportion of data in each data set, but a good rule of thumb is 40/40/20.

 

Let’s create two new variables. The first variable assigns values from a random uniform distribution, which randomly assigns a value between 0 and 1, inclusive. Next, create a new variable called “Subset” using the formula editor that groups our data into the three subsets. If the value is less than 0.40, then this data will be used to train. The data between 0.40 and 0.80 is used to validate. Data over 0.80 is used to test. See the Formula Editor below. Stay tuned for an easier way to do this in the next version of JMP.

 TVTFormulaEditorNew.jpg

 

The training data is used to build your model. When you feel like you have modeled the pattern and not the noise, run the model on your validation data set. The results, especially if the data sets are large, should be similar. If the results are different, the model you created over-fit the data (i.e., modeled the noise). There will be some back and forth in this process. Once both the training and validation results are similar, you have found a good model that you would expect to be repeatable. This model is put to test with the test data set. All three models should have a similar fit; otherwise, you have not uncovered the underlying pattern in the data.

 

Train your model using the training data by hiding and excluding the validation and test data using the Data Filter in JMP. Rerun your model using the validation and then test data sets by toggling between the validation and test data using the Data Filter.

 TVTDataFilterNew.jpg

 

Using the train-validate-test method is especially important when you are comparing models from different methods. If you find that, for example, the models are repeatable for neural nets but not recursive partitioning, you will have uncovered the model and the best method to predict the outcome in the future.

 

Editor's note: This entry was originally posted by a different blogger.

2 Comments
Community Member

Michael Joner wrote:

In the formula editor screenshot, there are actually two "Random Uniform()" calls. This results in getting 40% in the Train, but then doing a 40/60 split on the remaining (non-train) 60% of the data. Thus, you get a 40/24/36 split with that code. You can see this directly by creating the formula column as described and then running the Distribution platform against that formula column. There are a few solutions to this problem. One way involves creating two columns: one column of Random Uniform() numbers and the second column to categorize.

Community Member

Making better predictive models quickly with JMP wrote:

[...] while back, this blog featured a great post on the concept of training, validation and test sets to build your models. To build a model that is not only descriptive but also predictive, [...]