When using the validation column method for cross validation , we split the data set into training , validation and test sets. This split ratio is specified by the user. Is there any guideline /reference to decide on the split ratio (such as 60:20:20 / 70:15:15 / 50:25:25 / 80 :10:10). Is it chosen also based on the total number of observations -N ?
To your original question, no, there are not specific rules about how much data to leave out. In the JMP Education analytics courses, we advise you to hold out as much data as you are comfortable with, with at least 20% held out. If you feel the training set is too small to hold back that many rows, consider k-fold cross validation. How many rows are you willing to sacrifice to validation? Use k = n / that many rows. If k < 5 using that formula, consider leave-one-out cross validation.
Hi @sreekumarp,
Interesting question, and I'm afraid I won't have a definitive response regarding your question, as it depends on the dataset, types of model to consider, and practices/habits of the analyst (or person doing the analysis).
First, it's important to know what are the use and needs between each sets :
There are several choices/methods to split your data depending on your objectives and the size of your dataset :
All these approaches are supported by JMP : Launch the Make Validation Column Platform (jmp.com)
As a rule of thumb, a ratio 70/20/10 is often used. You can read the paper "Optimal Ratio for Data Splitting" here to have more details. Generally, the higher the number of parameters in the model, the bigger your training dataset will be, as you'll need more data to estimate precisely each of the parameters in the model, so the complexity/type of model is also something to consider when creating training/validation/test sets.
If you have a more precise use case, maybe this could be more helpful and less general to provide you some guidance ?
I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube
Hope this first answer will help you,
Thank you for providing a detailed input on the splitting of the data sets in machine learning. I am sure this will help in my research.
Sreekumar Punnappilly
You're welcome @sreekumarp.
If you consider one or several of these answers as solution(s), don't hesitate to mark them as solution(s), to help visitors of the JMP Community to more easily find the answers they are looking for.
If you have more questions or a concrete case on which you would like some advice, don't hesitate to answer on this topic or create a new one.
Hi @sreekumarp ,
In addition to what @Victor_G wrote, I would also highly recommend that you split off your Test data set and make a new data table with it. That way when you train and validate your models, you can compare them on the test holdout data table to see which model performs the best. This reduces any chance that the test data set could accidentally be used in the training or validation sets.
There are also other ways that you can use simulated data to train your models and then test the models on the real data. I sometimes use this approach when the original data set is small and I need to keep the correlation structure of the inputs.
Good luck!,
DS
Hi @SDF1,
One minor correction to the great addition you provided : validation set is the set used for model comparison and selection, not the test set (but sometimes test and validation names are used alternatively or confusely).
These two sets have very different purposes :
An explanation is given here : machine learning - Can I use the test dataset to select a model? - Data Science Stack Exchange
Two explanations to this difference in use :
Test set is often the last step between model creation/development and its deployment in production or publication, so the final assessment needs to be as fair and unbiased as possible.
Some ressources on the sets : Train,Test, and Validation Sets (mlu-explain.github.io)
MFML 071 - What's the difference between testing and validation? - YouTube
I hope this may avoid any confusion in the naming and use of the sets,
Hi @Victor_G ,
Sorry if it wasn't clear, but we are actually referring to the same thing. When I mentioned models before, I am talking about generating models using the same test/validation setup, but also across different platforms in JMP. I never build models with any one platform in mind, but generate and tune models using the different platforms (e.g. NN, BT, Bootstrap Forest, SVM, KNN, and XGBoost) and crossvalidation with the training and validation sets. Then, after I've generated the different models I compare their ability to predict on the holdout (test) data set, a data set that none of the models have seen during their training and validation steps. Sometimes the NN works best, and sometimes it's XGBoost, or some other platform.
Lastly @sreekumarp , one thing I forgot to mention in my previous post, is that it really helps to stratify your crossvalidation column using your output column -- the column your trying to predict. This keeps a similar distribution structure for your training, validation, and test data sets. Sometimes this can't be done, especially with highly imbalanced data sets, but if possible, I highly recommend it. If not, the data is randomly placed in each type of data set, and this might lead to a poorly divided validation column, which can often lead to poor models.
DS
Good point on the stratified sampling @SDF1 !
I'm sorry, we don't use the same naming or there is a misunderstanding.
If you bring several models in the test set to compare them (and select one), that's an issue and not the purpose of this set, no matter the platforms or models used. Comparing models should be done on the validation set (to keep a clear and "pure" test set without any information leakage or bias), hence my previous answer with some ressources on this topic.
See JMP Help also emphasizing on the differences between sets :"The testing set checks the model’s predictive ability after a model has been chosen." : https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/overview-of-the-make-validation-column...
This is also a reason of the importance of well defining the different sets and either fixed them or the method, to properly evaluate different models on the same validation set (or same method).
Hi @Victor_G ,
Yeah, I think there might be some kind of misunderstanding. Perhaps I should say that there is a hold-out data set rather than a test data set. The hold-out data set is data that was never used in training, validation, or selection of a model from a given algorithm (test set). I use this hold-out set to compare the different algorithms against each other to see which algorithm performs the best. JMP's Model Comparison platform is very helpful in comparing the different algorithms against each other using a hold-out set. This allows for an unbiased and no leakage comparison of the different algorithms. It therefore becomes a sort of "test" data set in the sense that the hold-out data set is being used to see which of the different algorithms perform best on this "pure" data set.
This is the only way I am aware of to compare different algorithms against each other in JMP. What I mean is that this is the only way JMP can compare a neural net algorithm against an XGBoost algorithm, for example. I would never use, nor recommend, using the hold-out data set to compare different models from within a platform. I wouldn't compare the 20 different tuned models within the bootstrap forest platform against each other using the hold-out set, that would be the purpose of the test data set.
So, sorry if there was any misunderstanding/confusion. I'll try to refer to it as the hold-out set from now on to avoid confusion with the test set.
Thanks for the discussion!,
DS
Thank you for your input on the cross validation column.
Sreekumar Punnappilly