Fitting a Decision Tree with Validation

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this video, we use the Chemical Manufacturing example and fit a regression tree for the continuous response, Yield. We use JMP Pro for this demo.

The acceptable yield for this process is 80%.

The data have been partitioned into training and validation data. 60% of the observations have been randomly assigned to the training set, and 40% of the observations are in the validation set.

We will fit the regression tree to the training data and determine when to stop tree growth using the validation data.

To do this, we select Predictive Modeling from the Analyze menu, and then Partition. We select Yield as the Y, Response variable and select the two groups of predictors as the X, Factors.

There are different options for using model validation in JMP Pro. One method is to specify a holdout portion. For example, we can hold 40% of the data out of the model to use for .

Instead, we'll use the validation column. We drag this column to the Validation role and click OK.

Because we are using validation, JMP provides a Go button. We'll click this button to automatically build the tree and then prune it back.

The tree is a little large, so we'll turn it off. To do this, we use the top red triangle, Display Options, and deselect Show Tree.

The final model has seven splits.

You can see the Split History at the bottom. This is a small data set, so only eight splits are possible. The model was pruned back to seven splits. This is the model with the maximum RSquare value on the validation set.

The RSquare and RMSE are both better for the training data, which were used to build the model. This means that the model fits the training data better than it fits the validation data.

To see this, we'll select Plot Actual by Predicted Plot from the top red triangle. Notice that the data points are much more tightly clustered for the training set than they are in the validation set.

This plot tells you that there is still a lot of unexplained variation in the data.

Let's see which variables are involved in this model. To do this, we select Column Contributions from the top red triangle. Only five of the variables are involved in the model.

Finally, let's look at the Leaf report. We'll select this option from the top red triangle. Remember that the acceptable yield for this process is 80%. The highest predicted yield is 87.3.

Given the current process, the rules provide some indication as to how higher yields might be achieved.