Problem: Why my validation sets worked better than the training sets?
Data: 200 data points; 90% for training and 10% for validation
Statistical:
1. make validation column
Make Validation Column(
Training Set( 0.90 ),
Validation Set( 0.10 ),
Test Set( 0.00 ),
Formula Random);
2. neural nets:
one hidden layer, 3 nodes TanH
Boosting: number of models (10), learning rate (0.1)
Fitting options: Number of Tours( 10 )
3. the above steps were repeated for 1000 times
4. calculate the mean of the R squares for training sets and validation sets
Description of the problems:
The mean R square for the validation sets was much bigger than that of the training sets. Why would that happen? We even tried without a penalty method, the validation sets still worked better than the training.
How does JMP optimize the estimates parameters for the validation sets?
Why this is not the case if we use bootstrap forests? i.e., for bootstrap forests, training sets worked better than the validation sets.
Thanks.