Fitting a Regression Tree

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this video, we use the Chemical Manufacturing example and fit a regression tree for the continuous response, Yield.

To do this, we select Predictive Modeling from the Analyze menu, and then Partition.

We select Yield as the Y, Response variable. Then we select the two groups of predictors as the X, Factors.

The horizontal line in the partition graph shows the mean of Yield, 82.86. Each point is plotted at its Yield value on the Y axis and is randomly scattered on the X axis. The points are colored by Performance: rejected batches are red, and accepted batches are green.

The potential split variables and values are listed in the Candidates outline. For each predictor there is a cutpoint and a LogWorth statistic. The Candidate SS is the variation in the response explained by splitting the data at the cutpoint for the variable.

The candidate with the highest LogWorth is Vessel Size, at a cutpoint of 2000.

When we click Split, we see that this is used as the cutpoint. If Vessel Size is 2000, the mean of Yield is 80.67. However, if Vessel Size is 750 or 500, the mean is 84.19.

We'll split again.

The second split is in the first branch. When Vessel Size is 2000 and Carbamate Amount is less than 1.12, the mean is 77.679 . But when Carbamate Amount is greater than or equal to 1.12, the mean is 82.1

The next split is in the second branch, when Vessel Size is 750 or 500. For the smaller vessel sizes, when the water content is greater than or equal to 1.59, the mean is 86.2.

With each split, the partition graph updates. This graph provides a summary of the splits in the tree.

We'll select Leaf Report from the top red triangle to see the prediction rules. The current model results in four predicted values, ranging from 77.67 to 86.21.

To see this model, we'll save the formula for the model to the data table. To do this, we select Save Columns and then Save Prediction Formula from the top red triangle. This adds a new formula column to the data table. Let's look at this formula.

You can see that the regression tree formula is simply a series of nested If-Then statements. This model, which has three splits, results in one of four possible predicted values.