topic Re: CROSS VALIDATION - VALIDATION COLUMN METHOD in Discussions

CROSS VALIDATION - VALIDATION COLUMN METHOD

sreekumarp — Thu, 08 Jun 2023 16:43:59 GMT

When using the validation column method for cross validation , we split the data set into training , validation and test sets. This split ratio is specified by the user. Is there any guideline /reference to decide on the split ratio (such as 60:20:20 / 70:15:15 / 50:25:25 / 80 :10:10). Is it chosen also based on the total number of observations -N ?

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Victor_G — Tue, 10 Jan 2023 13:28:41 GMT

Hi @sreekumarp,

Interesting question, and I'm afraid I won't have a definitive response regarding your question, as it depends on the dataset, types of model to consider, and practices/habits of the analyst (or person doing the analysis).

First, it's important to know what are the use and needs between each sets :

Training set : Used for the actual training of the model(s),
Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data.

There are several choices/methods to split your data depending on your objectives and the size of your dataset :

Train/Validation/test sets: Fixed sets to train, optimize and assess model performances. Recommended for larger datasets.
K-folds crossvalidation : Split the dataset in K folds. The model is trained K-times, and each fold is used K-1 times for training, and 1 time for validation. It enables to assess model robustness, as performances should be equivalent across all folds.
Leave-One-Out crossvalidation : Extreme case of the K-fold crossvalidation, where K = N (number of observations). It is used when you have small dataset, and want to assess if your model is robust.
Autovalidation/Self Validating Ensemble Model : Instead of separating some observations in different sets, you associate each observation with a weight for training and validation (a bigger weight in training induce a lower weight in validation, meaning that this observation will be used mainly for training and less for validation), and then repeat this procedure by varying the weight. It is used for very small dataset, and/or dataset where you can't independently split some observations between different sets : for example in Design of Experiments, the set of experiments to do can be based on a model, and if so, you can't split independantly some runs between training and validation, as it will bias the model in a negative way; the runs needed for estimating parameters won't be available, hence reducing dramatically the performance of the model.

All these approaches are supported by JMP : Launch the Make Validation Column Platform (jmp.com)

As a rule of thumb, a ratio 70/20/10 is often used. You can read the paper "Optimal Ratio for Data Splitting" here to have more details. Generally, the higher the number of parameters in the model, the bigger your training dataset will be, as you'll need more data to estimate precisely each of the parameters in the model, so the complexity/type of model is also something to consider when creating training/validation/test sets.

If you have a more precise use case, maybe this could be more helpful and less general to provide you some guidance ?

I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube

Hope this first answer will help you,

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

SDF1 — Tue, 10 Jan 2023 15:20:25 GMT

Hi @sreekumarp ,

In addition to what @Victor_G wrote, I would also highly recommend that you split off your Test data set and make a new data table with it. That way when you train and validate your models, you can compare them on the test holdout data table to see which model performs the best. This reduces any chance that the test data set could accidentally be used in the training or validation sets.

There are also other ways that you can use simulated data to train your models and then test the models on the real data. I sometimes use this approach when the original data set is small and I need to keep the correlation structure of the inputs.

Good luck!,

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Victor_G — Tue, 10 Jan 2023 17:15:24 GMT

Hi @SDF1,

One minor correction to the great addition you provided : validation set is the set used for model comparison and selection, not the test set (but sometimes test and validation names are used alternatively or confusely).

These two sets have very different purposes :

Test set is the holdout part of data, not used before having selected a model, in order to provide unbiased estimation of model generalization and predictive performance.
Validation set is a portion of data used for model fine-tuning and models comparison, in order to select the best candidate model.

An explanation is given here : machine learning - Can I use the test dataset to select a model? - Data Science Stack Exchange

Two explanations to this difference in use :

Practical one : In data science competitions, you don't have access to test dataset, so in order to create and select the best performing algorithm you have to split your data in training and validation set, and "hope" to have good performances (and generalization) on the unseen test set.
Theoritical one : If you're using test set to compare models, you're actually doing data/information leakage, as you can improve your results on the test set over time by selecting the best performing algorithm on the test set (and then, perhaps continue to fine-tune it based on performance on test data or try other models...). So your test set is no longer unbiased, as the choice of the algorithm (and perhaps other actions done after the selection) will be made on this.

Test set is often the last step between model creation/development and its deployment in production or publication, so the final assessment needs to be as fair and unbiased as possible.

Some ressources on the sets : Train,Test, and Validation Sets (mlu-explain.github.io)

MFML 071 - What's the difference between testing and validation? - YouTube

I hope this may avoid any confusion in the naming and use of the sets,

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

SDF1 — Tue, 10 Jan 2023 18:18:33 GMT

Hi @Victor_G ,

Sorry if it wasn't clear, but we are actually referring to the same thing. When I mentioned models before, I am talking about generating models using the same test/validation setup, but also across different platforms in JMP. I never build models with any one platform in mind, but generate and tune models using the different platforms (e.g. NN, BT, Bootstrap Forest, SVM, KNN, and XGBoost) and crossvalidation with the training and validation sets. Then, after I've generated the different models I compare their ability to predict on the holdout (test) data set, a data set that none of the models have seen during their training and validation steps. Sometimes the NN works best, and sometimes it's XGBoost, or some other platform.

Lastly @sreekumarp , one thing I forgot to mention in my previous post, is that it really helps to stratify your crossvalidation column using your output column -- the column your trying to predict. This keeps a similar distribution structure for your training, validation, and test data sets. Sometimes this can't be done, especially with highly imbalanced data sets, but if possible, I highly recommend it. If not, the data is randomly placed in each type of data set, and this might lead to a poorly divided validation column, which can often lead to poor models.

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Victor_G — Tue, 10 Jan 2023 19:21:29 GMT

Good point on the stratified sampling @SDF1 !

I'm sorry, we don't use the same naming or there is a misunderstanding.

If you bring several models in the test set to compare them (and select one), that's an issue and not the purpose of this set, no matter the platforms or models used. Comparing models should be done on the validation set (to keep a clear and "pure" test set without any information leakage or bias), hence my previous answer with some ressources on this topic.

See JMP Help also emphasizing on the differences between sets :"The testing set checks the model’s predictive ability after a model has been chosen." : https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/overview-of-the-make-validation-column-platform.shtml#ww269465

This is also a reason of the importance of well defining the different sets and either fixed them or the method, to properly evaluate different models on the same validation set (or same method).

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

SDF1 — Tue, 10 Jan 2023 19:30:55 GMT

Hi @Victor_G ,

Yeah, I think there might be some kind of misunderstanding. Perhaps I should say that there is a hold-out data set rather than a test data set. The hold-out data set is data that was never used in training, validation, or selection of a model from a given algorithm (test set). I use this hold-out set to compare the different algorithms against each other to see which algorithm performs the best. JMP's Model Comparison platform is very helpful in comparing the different algorithms against each other using a hold-out set. This allows for an unbiased and no leakage comparison of the different algorithms. It therefore becomes a sort of "test" data set in the sense that the hold-out data set is being used to see which of the different algorithms perform best on this "pure" data set.

This is the only way I am aware of to compare different algorithms against each other in JMP. What I mean is that this is the only way JMP can compare a neural net algorithm against an XGBoost algorithm, for example. I would never use, nor recommend, using the hold-out data set to compare different models from within a platform. I wouldn't compare the 20 different tuned models within the bootstrap forest platform against each other using the hold-out set, that would be the purpose of the test data set.

So, sorry if there was any misunderstanding/confusion. I'll try to refer to it as the hold-out set from now on to avoid confusion with the test set.

Thanks for the discussion!,

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

sreekumarp — Wed, 11 Jan 2023 10:29:02 GMT

Thank you for providing a detailed input on the splitting of the data sets in machine learning. I am sure this will help in my research.

Sreekumar Punnappilly

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

sreekumarp — Wed, 11 Jan 2023 10:33:49 GMT

Thank you for your input on the cross validation column.

Sreekumar Punnappilly

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Victor_G — Wed, 11 Jan 2023 12:11:28 GMT

You're welcome @sreekumarp.

If you consider one or several of these answers as solution(s), don't hesitate to mark them as solution(s), to help visitors of the JMP Community to more easily find the answers they are looking for.
If you have more questions or a concrete case on which you would like some advice, don't hesitate to answer on this topic or create a new one.

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Di_Michelson — Wed, 11 Jan 2023 13:31:22 GMT

To your original question, no, there are not specific rules about how much data to leave out. In the JMP Education analytics courses, we advise you to hold out as much data as you are comfortable with, with at least 20% held out. If you feel the training set is too small to hold back that many rows, consider k-fold cross validation. How many rows are you willing to sacrifice to validation? Use k = n / that many rows. If k < 5 using that formula, consider leave-one-out cross validation.

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Nazarkovsky — Tue, 24 Mar 2026 15:09:53 GMT

Thanks.

There are actually two points to be clarified for K-fold cross-validation (CV) and leave-one-out (LOO) validation using routine regression in JMP Pro.

Both methods are directly available in the PLSR part (partial least-squares regression), but there's a certain unclarity what to do and how to visualize the metrics if CV and/or LOO are applied for a simple regression. For example, I'm analyzing spectra and PCR and PLSR are routinely involved in this chemometric process.

I've got a little strange experience. However, after I create a 4-fold validation column (above, 4 folds across 72 observations) and run PCR (numeric response and principal components as predictors, I receive the following error response):

Hence, there's no real CV performed, right?

The second point: how to build a LOO validation (a column?) for PCR?

As the whole, please, advise me what has to be done to perform adequately cross-validation and leave-one-out validation for a typical regression and get proper metrics of fit? It's really important and unobvious yet to me.
Thanks a lot, colleagues!

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Victor_G — Wed, 25 Mar 2026 12:07:20 GMT

Hi @Nazarkovsky,

It would have been probably easier to start a new post instead of replying to this old post, as it would have provide more visibility and enable other JMP users to join the discussion about your question.

Yes, there are sometimes some errors in JMP modeling platforms when using fixed K-Folds crossvalidation column, as JMP is mostly recognizing 2 levels (training and validation or test) or 3 levels (training, validation and test) from validation column. In the error message you have, no CV is done, a linear model is fit on all your data.
Since you're using a very simple linear model, you can perhaps try to use Model Screening platform (which also includes PLS if you want to compare the different model's results), specifying your X's, your response Y, your 4-folds validation column (or doing the crossvalidation directly from the platform), and the type of model you want to fit (linear regression models options + modeling options):

Or directly with the K-folds crossvalidation option from the platform, you can obtain these results:

If you want to run a Leave-One-Out cross-validation, specify simply K=n (n being the number of runs in your table). If you want to have access to individual folds results, you can adapt the solution I provided here to your modeling platform:

Finally, I see at least one major problem in your modeling workflow that could lead to data leakage : you need to apply the cross-validation or your splitting/validation strategy on your principal component analysis too. If you don't do it, the PCA will see the entire dataset and learn the correlations between the factors from all the dataset, and the linear model you're fitting after (and evaluate thanks to cross-validation) will benefit from the entire information from the dataset, not the information from 3 out of 4 folds. Since JMP hasn't implemented a validation role (yet !) for PCA, this is why I have added this Wishlist, to prevent data leakage and possible errors in modeling workflows like yours: To summarize, you should :

Define and apply a validation strategy (CV, LOO or standard train/validation or train/validation/test split).
Preprocess the data only on training set (or preprocess the data several time depending on the folds used)
Fit a model using the same validation strategy with the preprocessed data from training set (or from the corresponding training folds).

Hope this answer will help you,

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Nazarkovsky — Wed, 25 Mar 2026 17:17:47 GMT

Anyway, the section for k-fold crossvalidation in JMP is still weak. I feel curious why it doesn't concern the developers, as this validation method is a primary tool for a preliminary assessment of any model, especially at low number of data. Moreover, in the proper option to build a K-fold CV column, it's impossible to build 3-fold cross-validation - the minimal value is 4.

I also do not like JMP's report of the metrics for CV, since there is no final option/button to build a 'Predicted' column after all the runs as an average result and compare to the predicted values delivered by Generalized Regression or PLSR.

As for the validation of PCA, this option is encountered in Unscrambler, a traditional software for chemometrics (multivariate analysis of chemical data, like spectra). Yes, the validation is scanned across the pre-set numbers of components vs. explained variance. For this test of leakage I use Unscrambler, aha.

There's a series of JMP posts here in Community dedicated to the analysis of spectra, but still short in comprehensive discussion for CV and PCR.

However, in chemometrics, SVR (support vector regression), PLSR, and PCR are always applied and their metrics are compared.

For LOO thanks, it really works properly.

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Nazarkovsky — Wed, 25 Mar 2026 15:57:48 GMT

Anyway, the part k-fold crossvalidation in JMP is still weak. I feel curious why it doesn't concern the developers, as this validation method is a primary tool for a preliminar assessing of any model, especially at low number of data. Moreover, in a proper option to build a K-fold CV column, it's impossible to build 3-fold cross-validation - the minimal value is 4.
I also do not like JMP's report of the metrics for CV, since there is no final option/button to build a 'Predicted' column after all the runs as an average result and compare to the predicted values delivered by Generalized Regression or PLSR.
As for the validation of PCA, this option is encountered in Unscrambler, a traditional software for chemometrics (multivariate analysis of chemical data, like spectra). Yes, the validation is scanned acrosse the set numbers of components vs. explained variance. For this test of leakage I use Unscrambler, aha.
There's a series of JMP posts here in Community dedicated to the analysis of spectra, but still short in comprehensive discussion for CV and PCR.
However, in chemometrics, SVR (support vector regression), PLSR, and PCR are always applied and their metrics are compared.
For LOO thanks, it really works properly.