Variable Selection in Generalized Regression

1 Kudo

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this video, we use the Chemical Manufacturing data. We fit a logistic regression model for the categorical response, Performance, using generalized regression. Then we show how to do variable selection to reduce this model.

First, we select Fit Model from the Analyze menu. We select Performance for the Y role and both sets of predictors as model effects.

Then we select Generalized Regression from the menu for Personality. Because we have a two-level categorical response, the default response distribution is Binomial. We select Reject as the target level, and run this model.

JMP has fit a logistic regression model. Notice that, when we start with all of the terms in the model, none of them are significant.

What if we want to reduce this model? Several variable selection methods are available, such as backward elimination, forward selection, pruned forward selection (or mixed), and best subset (or all possible models).

Here, we have a lot of predictors. So we use pruned forward for this example. This method both adds and removes terms from the model, in a series of steps. At each step, it adds the most significant term that is not in the model, and it removes the least significant term that is in the model.

You can see that four of the terms are significant and that several of the terms have been removed from the model. To better see this, we’ll sort on the p-value (Prob > ChiSquare).

This is our reduced model.

This analysis also provides two “solution path” graphs to help you visualize and explore your model.

In these graphs, the vertical red line is drawn at our current reduced model.

The blue lines in the first graph show you the size of the parameter estimates. These parameter estimates are scaled by their mean and sums of squares to make them comparable.

To better see this, we select Regression reports from the red triangle for the model, and then Parameter Estimates for Centered and Scaled Predictors.

This line represents the parameter estimate for Assay. The value of this estimate, for this reduced model, is 14.46.

Let’s drag the red line to the beginning, when no terms are in the model, and see how the model changes as we add terms.

The first term that is added is Base Assay. Then Vessel Size is added. At the next step, Inlet Temp is added. This process continues. At some steps, the least significant term is also removed.

Let’s drag this line to step 45. Notice that most of the terms are in the model at this step. Also, notice how large many of the parameter estimates have gotten. For example, look at Base Assay. With all of the other terms in the model at this step, the parameter estimate for Base Assay has exploded.

This is evidence of multicollinearity. Multicollinearity can cause both the parameter estimates and their standard errors to become inflated.

Let’s look at the second graph. This graph shows the value of the AICc statistic, which is the default validation method for this model.

Because we aren’t using holdout validation in this example, AICc is used to select the best model. The lower the value of this statistic, the better.

Pruned forward selection has selected the model with the lowest AICc value.

However, models with AICc values in the green zone are comparable to our reduced model. We could, for example, reduce this model further.

Let’s return to the model selected by pruned forward selection. To see this model, we select Show Prediction Expression from the red triangle for the model. This is simply a logistic regression model, with seven predictors.