Solved: Re: Neural nets_validation sets get higher model fits than training sets - Page 2

Adele · Jan 14, 2020 09:35 AM

Problem: Why my validation sets worked better than the training sets?

Data: 200 data points; 90% for training and 10% for validation

Statistical:

1. make validation column

Make Validation Column(
Training Set( 0.90 ),
Validation Set( 0.10 ),
Test Set( 0.00 ),
Formula Random);

2. neural nets:

one hidden layer, 3 nodes TanH

Boosting: number of models (10), learning rate (0.1)
Fitting options: Number of Tours( 10 )

3. the above steps were repeated for 1000 times

4. calculate the mean of the R squares for training sets and validation sets

Description of the problems:

The mean R square for the validation sets was much bigger than that of the training sets. Why would that happen? We even tried without a penalty method, the validation sets still worked better than the training.

How does JMP optimize the estimates parameters for the validation sets?

Why this is not the case if we use bootstrap forests? i.e., for bootstrap forests, training sets worked better than the validation sets.

Thanks.

Adele · Jan 15, 2020 07:53 AM

Thank you for your suggestions. We tried k-fold, and it works better in our case. However, new problems come out: how could we do k-fold cross-validation for bootstrap forests?

Thanks.

Mark_Bailey · Jan 15, 2020 09:34 AM

You cannot use K-fold cross-validation with Bootstrap Forest model selection in JMP. You can only use hold-out sets.

Remember that most predictive model types are intended for large data sets. You might do as well with multiple regression in your case if the data meet the model assumptions.

Adele · Jan 15, 2020 8:57 AM

Thank you. Yes, I only found the scripts below. But I could not combine it with the bootstrap forest model.

obj = Open( "$SAMPLE_DATA/Car Poll.jmp" ) << Partition(
	Y( :country ),
	X( :sex, :marital status, :age, :type, :size )
);

obj << K Fold Crossvalidation( 5 );

obj << Go;

Thank you so much for your helping.

Adele · Jan 16, 2020 04:00 AM

Dear Mr. Bailey,

Thanks. I have one more question: can we use K-fold cross-validation with standard least squares regression in JMP?

Mark_Bailey · Jan 16, 2020 08:22 AM

No, the Fit Least Squares platform in JMP does not provide any cross-validation. The Generalized Regression platform in JMP Pro, which includes the standard linear regression model, provides both hold-out cross-validation in the launch dialog box and K-fold cross-validation as a platform validation option.

You might consider using AICc for model selection in place of cross-validation because of the small sample size.

You might also consider using open source software if it provides the flexibility you demand. You can connect to R or Python, for example, with JMP and work with both software.

Dan_Obermiller · Jan 16, 2020 09:34 AM

JMP Pro does offer k-fold for regression analysis under the Stepwise Platform. It is used to determine which terms to keep in the model. From the JMP Documentation:

K-Fold Crossvalidation (Available only for continuous responses.) Performs K-Fold cross validation in the selection process. When selected, this option enables the Max K-Fold RSquare stopping rule (“Stepwise Regression Control Panel” on page 250). For more
information about validation, see “Using Validation” on page 275.

Dan Obermiller

Mark_Bailey · Jan 14, 2020 09:50 AM

I forgot to mention that K-fold cross-validation is specifically for cases with small sample sizes. Did you try this method instead of a validation hold out set?

Adele · Jan 22, 2020 2:27 AM

Dear Mr. Bailey,
As recommended, I used k-fold cross-validation with neural nets and stepwise regression (enter all). I wrote a loop and ran each analysis for 100 times. The results show that the mean R squares for the validation sets are not higher than that of the training sets.
Now I am wondering why K-fold cross-validation works better with small sample sizes and why holdout validation with neural nets will get awkward result, i.e., validation sets get higher mean R squares. What is the theoretical or the mathematical reason?
Thank you. Sorry for bothering you again.

Mark_Bailey · Jan 22, 2020 07:40 AM

You are not bothering anyone here! Questions like yours are one of the main reasons for having the Discussions area in the JMP Community! On the other hand, some discussions might be beyond the scope or capability of this format.

NOTE: we are discussing predictive models, not explanatory models. As such, we only care about the ability of the model to predict the future. The future is unavailable while training the model and selecting a model.

Cross-validation of any kind is not the result of a theory. It is a solution to the problem of how to perform an honest assessment of the performance of models before and after selecting one model. The hold out set approach was developed first for cases of large data sets. Inference is not helpful to model selection in such cases. The future is not available. What to do? Set aside, or hold out, a portion of the data (validation). These data are not available to the training procedure. Then apply the fitted model to the hold out set to obtain model predictions. How did the model do? How do other models do?

Since the validation data was used to select the model, some modelers hold out a third portion (test) to evaluate the selected model.

Cross-validation is a very good solution but relies on the large size of the original set of observations. What to do when the sample size is small? The K-fold approach was developed specifically for such cases. Some modelers take it to the extreme and set K to the sample size. This version is known as 'leave one out.' It works better than hold out sets in such cases because you are using the data more efficiently.