Hi @c_blanken,
Just to add some comments/remarks to the great answer from @SDF1.
- Why do you have some rows that have missing values regarding the validation column ? Were they added after the column creation ?
- I'm not surprised to see that platform "Bootstrap Forest" is able to deal with missing values in the validation column, since this model use bootstrap samples of the training data to fit many decision tree. Neural Networks and other methods use the data "as it is", and so you may encounter the warning message you get.
Instead of creating several validation columns to assess the influence of the training set variability on the model performances, there might be (at least) 2 other interesting options for you :
1. Create a "formula" Validation Column (jmp.com) instead of "fixed" validation column :
This approach has two benefits :
- Even in the case of stratified/grouping validation, it will enable to automatically add training/validation or test label in the validation column for new rows added, depending on the stratification/grouping method and the ratio between these sets that you have used.
- In the modeling, you can also more easily try different resampling between training and validation sets, by right-clicking on the performance metrics you would like to evaluate with various training and validation sets, click on "Simulate" and here you will have the option to switch in and out your validation column :
By doing this you'll get a datatable with all the simulations done on various training and validation sets, and you can plot the performance variability obtained with these simulations (here for R² on the toy dataset "Boston Housing" with Random Forest):
2. Use the Model Screening Platform (jmp.com) platform with only the model you want selected and the validation method you want (K-folds, Nested K-folds, with the option to repeat them, or validation column), and launch it. You'll get access to the results summary, but also the individual folds results :
When selecting for example one fold to run the model (here I tried with fold3), a new column will also appear in your datatable, in order to know which rows in this fold were used as training or validation (0: Training, 1: Validation) :
I think these methods may be useful for you, with less manual work and greater flexibility.
Victor GUILLER
L'Oréal Data & Analytics
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)