Variable Selection with a Bootstrap Forest

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this video, we use the Chemical Manufacturing example and fit a bootstrap forest for the continuous response, Yield. We use JMP Pro for this demo.

The acceptable yield for this process is 80%. There are 17 potential predictors, and only 90 observations. We'll use a bootstrap forest to identify the most important predictors.

To do this, we select Predictive Modeling from the Analyze menu, and then Bootstrap Forest.

We select Yield as the Y, Response variable and select the two groups of predictors as the X, Factors.

This takes us to a specification window. Here, we can specify the number of trees in the forest, the number of terms to be sampled per split, and the proportion of our data to be bootstrap resampled. We can also specify the size of the tree and the minimum split size.

We'll use the default settings but will enter the random seed 1234 to make our analysis repeatable.

You can see that the RSquare for this model is 0.665. However, we're primarily interested in identifying important variables.

We'll select Column Contributions from the top red triangle. You can see that there are five or six important variables.

If we want to improve yield, we might study this subset of variables further.

An alternative to using the Bootstrap Forest platform for variable selection is to use the Predictor Screening platform.

We'll launch this platform from Analyze and then Screening.

Predictor screening is just that: a platform to screen for potentially important predictors. To do this, it uses a bootstrap forest behind the scenes.

We'll select Yield as the Y, Response, and select both groups of predictors for the X role.

We'll change the number of trees to 1000 to create a bigger forest, and click OK.

You can see that, once again, the top five or six predictors have been identified. These predictors can be copied and pasted into another analysis window for further study.