Fitting a Logistic Model with Validation

Learn more in our free online course:
Statistical Thinking for Industrial Problem Solving

In this video, we use the Impurity example and fit a model for the categorical response, Outcome, with the three continuous main effects, Temp, Catalyst Conc, and Reaction Time.

The data have been partitioned into training and validation data. 60% of the observations have been randomly assigned to the training set, and 40% of the observations are in the validation set.

We will fit the logistic model to the training data and evaluate model performance on the validation data.

Let's begin by selecting Fit Model from the Analyze menu.

We select Outcome as the Y variable.

Then we select Temp, Catalyst Conc, and Reaction Time as the model effects. The default personality is Nominal Logistic and Target Category is Fail.

Finally, we select Validation as the Validation variable.

The model is fit using only the training data.

To see the overall error rate, we open the Fit Details outline.

The misclassification rate for the training data is 3.3%, but the misclassification rate for the holdout validation data is 20%.

Let's take a closer look at model classifications.

Confusion Matrix is an option under the top red triangle.

There are 40 observations in the validation data. Ten of these are fails.

The model correctly classified 7 out of 10, or 70% of the fails in the validation set. This means that it incorrectly classified, or missed, 30% of the fails.

Can we improve the misclassification rate by fitting a more complicated model?