Choose Language Hide Translation Bar  julian Community Manager

## Creating a Classification Tree

Statistical Thinking for Industrial Problem Solving

In this video, we use the Chemical Manufacturing example and fit a classification tree for the categorical response, Performance.

To do this, we select Predictive Modeling from the Analyze menu, and then select Partition.

We select Performance as the Y, Response variable.

Then we select the two groups of predictors as the X, Factors.

The horizontal line in the partition graph shows the overall acceptance rate. When Performance is Accept, the points are randomly scattered below the line, and when Performance is Reject, the points are randomly scattered above the line. We'll click Color Points so that you can see this better.

Our initial model has one node. Let's add a few more details. To do this, we select Display Options, and then Show Split Count, from the top red triangle.

You can see the rates and the number of observations for both outcomes. Fourteen of the ninety batches were rejected.

To create the classification tree, you can use the red triangle for the node to apply a specified split. Or you can use the Split button. We'll click the Split button one time.

The first split is on the variable Base Assay. When Base Assay is below 97.9, the reject rate is 0.4167. Otherwise, the reject rate is 0.0606.

Notice that JMP also reports a Prob value. This is the predicted probability of class membership based on the model. This value is slightly different from the rates because it includes a weighting factor to make sure that the probabilities are always nonzero.

Let's split again.

This time the model split is on Carbamate Amount when the Base Assay is less than 97.9. In this branch, if Carbamate Amount is less than 1.1, the probability of Reject is 1.0, but if Carbamate Amount is greater than or equal to 1.1, the probability of Reject is 0.22.

Before we split further, let's take a closer look at the model we are building. To do this, we'll select Leaf Report from the top red triangle.

You can see that a classification tree is a series of If-Then statements. Let's split one more time.

Now the model has split on Vessel Size. Based on this model, the lowest probability of Reject is when Base Assay is greater than or equal to 97.9 and the Vessel Size is 500 or 750.  This predicted probability is close to zero!

The highest probability of reject is when Base Assay is less than 97.9 and Carbamate Amount is less than 1.1.

Notice that some of these nodes don't have a lot of observations. This is a small data set for a classification tree. We could continue to split until no further splits are possible. But this would lead us to overfit the model.

To make sure the model doesn't split on nodes with very few observations, you can set a minimum split size before you build the model.

For illustration, we'll use the Prune button to prune the model back to one node.

We'll use 10 for the minimum split size. To set this, we'll select Minimum Split Size from the top red triangle, enter the value, and click OK.

Now, we'll click the Split button a few times.

Our final model has four splits. You can easily see that the combination of low Base Assay and low Carbamate Amount is a problem. More than half of the rejects are in this branch!

Article Labels
Article Tags
Contributors