Fitting a Multiple Linear Regression Model with Validation

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this example, we fit a predictive model for Impurity using Fit Model with all main effects and two-way interaction terms.

The data have been partitioned into training and validation data. 60% of the observations have been randomly assigned to the training set, and 40% of the observations are in the validation set.

We will fit the model to the training data and evaluate model performance on the validation data.

We start by adding Impurity as the Y variable.

We select all the predictor variables. Then under Macros, we select Factorial to Degree. This adds all main effects and two-way interactions to the model.

Finally, we select the Validation column as the Validation variable.

We run our model, with main effects and two-way interactions.

The parameters are estimated using only the observations in the training set. The observations in the validation set have "v's" as markers in the Actual by Predicted and other plots, to distinguish these observations from those in the training set.

Measures of the performance of the model are reported under Crossvalidation .

RSquare for the training set is much higher than it is for the validation set. And RASE, or root average squared error, is much lower on the training set than it is for the validation set. This is a good indication that our model is overfit.

We'll slowly reduce this model using the Effect Summary table, removing one term at a time.

Notice that the smaller, less complicated model performs better on the validation set than the full model!

This means that the smaller model generalizes better to new data than the full model.