Data mining is looking for patterns and relationships in (sometimes large volumes of) data. Many methods, such as recursive partitioning and neural nets, are extremely sensitive to the sample of data being mined. How do you know if you are creating a model that would be useful for predicting future outcomes?
We want a model that is repeatable. That is, if used on future data, you would have similar success in predicting the outcome. Data miners separate their data into three different subsets to ensure model accuracy. They are training data, validation data, and testing data. There is no set rule as to the proportion of data in each data set, but a good rule of thumb is 40/40/20.
Let’s create two new variables. The first variable assigns values from a random uniform distribution, which randomly assigns a value between 0 and 1, inclusive. Next, create a new variable called “Subset” using the formula editor that groups our data into the three subsets. If the value is less than 0.40, then this data will be used to train. The data between 0.40 and 0.80 is used to validate. Data over 0.80 is used to test. See the Formula Editor below. Stay tuned for an easier way to do this in the next version of JMP.
The training data is used to build your model. When you feel like you have modeled the pattern and not the noise, run the model on your validation data set. The results, especially if the data sets are large, should be similar. If the results are different, the model you created over-fit the data (i.e., modeled the noise). There will be some back and forth in this process. Once both the training and validation results are similar, you have found a good model that you would expect to be repeatable. This model is put to test with the test data set. All three models should have a similar fit; otherwise, you have not uncovered the underlying pattern in the data.
Train your model using the training data by hiding and excluding the validation and test data using the Data Filter in JMP. Rerun your model using the validation and then test data sets by toggling between the validation and test data using the Data Filter.
Using the train-validate-test method is especially important when you are comparing models from different methods. If you find that, for example, the models are repeatable for neural nets but not recursive partitioning, you will have uncovered the model and the best method to predict the outcome in the future.
Editor's note: This entry was originally posted by a different blogger.