Hi @sreekumarp,
Interesting question, and I'm afraid I won't have a definitive response regarding your question, as it depends on the dataset, types of model to consider, and practices/habits of the analyst (or person doing the analysis).
First, it's important to know what are the use and needs between each sets :
- Training set : Used for the actual training of the model(s),
- Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
- Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data.
There are several choices/methods to split your data depending on your objectives and the size of your dataset :
- Train/Validation/test sets: Fixed sets to train, optimize and assess model performances. Recommended for larger datasets.
- K-folds crossvalidation : Split the dataset in K folds. The model is trained K-times, and each fold is used K-1 times for training, and 1 time for validation. It enables to assess model robustness, as performances should be equivalent across all folds.
- Leave-One-Out crossvalidation : Extreme case of the K-fold crossvalidation, where K = N (number of observations). It is used when you have small dataset, and want to assess if your model is robust.
- Autovalidation/Self Validating Ensemble Model : Instead of separating some observations in different sets, you associate each observation with a weight for training and validation (a bigger weight in training induce a lower weight in validation, meaning that this observation will be used mainly for training and less for validation), and then repeat this procedure by varying the weight. It is used for very small dataset, and/or dataset where you can't independently split some observations between different sets : for example in Design of Experiments, the set of experiments to do can be based on a model, and if so, you can't split independantly some runs between training and validation, as it will bias the model in a negative way; the runs needed for estimating parameters won't be available, hence reducing dramatically the performance of the model.
All these approaches are supported by JMP : Launch the Make Validation Column Platform (jmp.com)
As a rule of thumb, a ratio 70/20/10 is often used. You can read the paper "Optimal Ratio for Data Splitting" here to have more details. Generally, the higher the number of parameters in the model, the bigger your training dataset will be, as you'll need more data to estimate precisely each of the parameters in the model, so the complexity/type of model is also something to consider when creating training/validation/test sets.
If you have a more precise use case, maybe this could be more helpful and less general to provide you some guidance ?
I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube
Hope this first answer will help you,
Victor GUILLER
L'OrƩal Data & Analytics
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)