Discussions

sreekumarp · Jun 8, 2023 9:43 AM

When using the validation column method for cross validation , we split the data set into training , validation and test sets. This split ratio is specified by the user. Is there any guideline /reference to decide on the split ratio (such as 60:20:20 / 70:15:15 / 50:25:25 / 80 :10:10). Is it chosen also based on the total number of observations -N ?

Di_Michelson · Jan 11, 2023 08:31 AM

To your original question, no, there are not specific rules about how much data to leave out. In the JMP Education analytics courses, we advise you to hold out as much data as you are comfortable with, with at least 20% held out. If you feel the training set is too small to hold back that many rows, consider k-fold cross validation. How many rows are you willing to sacrifice to validation? Use k = n / that many rows. If k < 5 using that formula, consider leave-one-out cross validation.

View solution in original post

Victor_G · Jan 10, 2023 5:28 AM

Hi @sreekumarp,

Interesting question, and I'm afraid I won't have a definitive response regarding your question, as it depends on the dataset, types of model to consider, and practices/habits of the analyst (or person doing the analysis).

First, it's important to know what are the use and needs between each sets :

Training set : Used for the actual training of the model(s),
Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data.

There are several choices/methods to split your data depending on your objectives and the size of your dataset :

Train/Validation/test sets: Fixed sets to train, optimize and assess model performances. Recommended for larger datasets.
K-folds crossvalidation : Split the dataset in K folds. The model is trained K-times, and each fold is used K-1 times for training, and 1 time for validation. It enables to assess model robustness, as performances should be equivalent across all folds.
Leave-One-Out crossvalidation : Extreme case of the K-fold crossvalidation, where K = N (number of observations). It is used when you have small dataset, and want to assess if your model is robust.
Autovalidation/Self Validating Ensemble Model : Instead of separating some observations in different sets, you associate each observation with a weight for training and validation (a bigger weight in training induce a lower weight in validation, meaning that this observation will be used mainly for training and less for validation), and then repeat this procedure by varying the weight. It is used for very small dataset, and/or dataset where you can't independently split some observations between different sets : for example in Design of Experiments, the set of experiments to do can be based on a model, and if so, you can't split independantly some runs between training and validation, as it will bias the model in a negative way; the runs needed for estimating parameters won't be available, hence reducing dramatically the performance of the model.

All these approaches are supported by JMP : Launch the Make Validation Column Platform (jmp.com)

As a rule of thumb, a ratio 70/20/10 is often used. You can read the paper "Optimal Ratio for Data Splitting" here to have more details. Generally, the higher the number of parameters in the model, the bigger your training dataset will be, as you'll need more data to estimate precisely each of the parameters in the model, so the complexity/type of model is also something to consider when creating training/validation/test sets.

If you have a more precise use case, maybe this could be more helpful and less general to provide you some guidance ?

I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube

Hope this first answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

sreekumarp · Jan 11, 2023 05:29 AM

Thank you for providing a detailed input on the splitting of the data sets in machine learning. I am sure this will help in my research.

Sreekumar Punnappilly

Victor_G · Jan 11, 2023 07:11 AM

You're welcome @sreekumarp.

If you consider one or several of these answers as solution(s), don't hesitate to mark them as solution(s), to help visitors of the JMP Community to more easily find the answers they are looking for.
If you have more questions or a concrete case on which you would like some advice, don't hesitate to answer on this topic or create a new one.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

SDF1 · Jan 10, 2023 10:20 AM

Hi @sreekumarp ,

In addition to what @Victor_G wrote, I would also highly recommend that you split off your Test data set and make a new data table with it. That way when you train and validate your models, you can compare them on the test holdout data table to see which model performs the best. This reduces any chance that the test data set could accidentally be used in the training or validation sets.

There are also other ways that you can use simulated data to train your models and then test the models on the real data. I sometimes use this approach when the original data set is small and I need to keep the correlation structure of the inputs.

Good luck!,

DS

Victor_G · Jan 10, 2023 9:15 AM

Hi @SDF1,

One minor correction to the great addition you provided : validation set is the set used for model comparison and selection, not the test set (but sometimes test and validation names are used alternatively or confusely).

These two sets have very different purposes :

Test set is the holdout part of data, not used before having selected a model, in order to provide unbiased estimation of model generalization and predictive performance.
Validation set is a portion of data used for model fine-tuning and models comparison, in order to select the best candidate model.

An explanation is given here : machine learning - Can I use the test dataset to select a model? - Data Science Stack Exchange

Two explanations to this difference in use :

Practical one : In data science competitions, you don't have access to test dataset, so in order to create and select the best performing algorithm you have to split your data in training and validation set, and "hope" to have good performances (and generalization) on the unseen test set.
Theoritical one : If you're using test set to compare models, you're actually doing data/information leakage, as you can improve your results on the test set over time by selecting the best performing algorithm on the test set (and then, perhaps continue to fine-tune it based on performance on test data or try other models...). So your test set is no longer unbiased, as the choice of the algorithm (and perhaps other actions done after the selection) will be made on this.

Test set is often the last step between model creation/development and its deployment in production or publication, so the final assessment needs to be as fair and unbiased as possible.

Some ressources on the sets : Train,Test, and Validation Sets (mlu-explain.github.io)

MFML 071 - What's the difference between testing and validation? - YouTube

I hope this may avoid any confusion in the naming and use of the sets,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

SDF1 · Jan 10, 2023 01:18 PM

Hi @Victor_G ,

Sorry if it wasn't clear, but we are actually referring to the same thing. When I mentioned models before, I am talking about generating models using the same test/validation setup, but also across different platforms in JMP. I never build models with any one platform in mind, but generate and tune models using the different platforms (e.g. NN, BT, Bootstrap Forest, SVM, KNN, and XGBoost) and crossvalidation with the training and validation sets. Then, after I've generated the different models I compare their ability to predict on the holdout (test) data set, a data set that none of the models have seen during their training and validation steps. Sometimes the NN works best, and sometimes it's XGBoost, or some other platform.

Lastly @sreekumarp , one thing I forgot to mention in my previous post, is that it really helps to stratify your crossvalidation column using your output column -- the column your trying to predict. This keeps a similar distribution structure for your training, validation, and test data sets. Sometimes this can't be done, especially with highly imbalanced data sets, but if possible, I highly recommend it. If not, the data is randomly placed in each type of data set, and this might lead to a poorly divided validation column, which can often lead to poor models.

DS

Victor_G · Jan 10, 2023 01:46 PM

Good point on the stratified sampling @SDF1 !

I'm sorry, we don't use the same naming or there is a misunderstanding.

If you bring several models in the test set to compare them (and select one), that's an issue and not the purpose of this set, no matter the platforms or models used. Comparing models should be done on the validation set (to keep a clear and "pure" test set without any information leakage or bias), hence my previous answer with some ressources on this topic.

See JMP Help also emphasizing on the differences between sets :"The testing set checks the model’s predictive ability after a model has been chosen." : https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/overview-of-the-make-validation-column...

This is also a reason of the importance of well defining the different sets and either fixed them or the method, to properly evaluate different models on the same validation set (or same method).

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

SDF1 · Jan 10, 2023 02:30 PM

Hi @Victor_G ,

Yeah, I think there might be some kind of misunderstanding. Perhaps I should say that there is a hold-out data set rather than a test data set. The hold-out data set is data that was never used in training, validation, or selection of a model from a given algorithm (test set). I use this hold-out set to compare the different algorithms against each other to see which algorithm performs the best. JMP's Model Comparison platform is very helpful in comparing the different algorithms against each other using a hold-out set. This allows for an unbiased and no leakage comparison of the different algorithms. It therefore becomes a sort of "test" data set in the sense that the hold-out data set is being used to see which of the different algorithms perform best on this "pure" data set.

This is the only way I am aware of to compare different algorithms against each other in JMP. What I mean is that this is the only way JMP can compare a neural net algorithm against an XGBoost algorithm, for example. I would never use, nor recommend, using the hold-out data set to compare different models from within a platform. I wouldn't compare the 20 different tuned models within the bootstrap forest platform against each other using the hold-out set, that would be the purpose of the test data set.

So, sorry if there was any misunderstanding/confusion. I'll try to refer to it as the hold-out set from now on to avoid confusion with the test set.

Thanks for the discussion!,

DS

sreekumarp · Jan 11, 2023 05:33 AM

Thank you for your input on the cross validation column.

Sreekumar Punnappilly

Discussions

CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Recommended Articles