Discussions

sreekumarp · Jun 8, 2023 9:43 AM

When using the validation column method for cross validation , we split the data set into training , validation and test sets. This split ratio is specified by the user. Is there any guideline /reference to decide on the split ratio (such as 60:20:20 / 70:15:15 / 50:25:25 / 80 :10:10). Is it chosen also based on the total number of observations -N ?

SDF1 · Jan 10, 2023 01:18 PM

Hi @Victor_G ,

Sorry if it wasn't clear, but we are actually referring to the same thing. When I mentioned models before, I am talking about generating models using the same test/validation setup, but also across different platforms in JMP. I never build models with any one platform in mind, but generate and tune models using the different platforms (e.g. NN, BT, Bootstrap Forest, SVM, KNN, and XGBoost) and crossvalidation with the training and validation sets. Then, after I've generated the different models I compare their ability to predict on the holdout (test) data set, a data set that none of the models have seen during their training and validation steps. Sometimes the NN works best, and sometimes it's XGBoost, or some other platform.

Lastly @sreekumarp , one thing I forgot to mention in my previous post, is that it really helps to stratify your crossvalidation column using your output column -- the column your trying to predict. This keeps a similar distribution structure for your training, validation, and test data sets. Sometimes this can't be done, especially with highly imbalanced data sets, but if possible, I highly recommend it. If not, the data is randomly placed in each type of data set, and this might lead to a poorly divided validation column, which can often lead to poor models.

DS

Victor_G · Jan 10, 2023 01:46 PM

Good point on the stratified sampling @SDF1 !

I'm sorry, we don't use the same naming or there is a misunderstanding.

If you bring several models in the test set to compare them (and select one), that's an issue and not the purpose of this set, no matter the platforms or models used. Comparing models should be done on the validation set (to keep a clear and "pure" test set without any information leakage or bias), hence my previous answer with some ressources on this topic.

See JMP Help also emphasizing on the differences between sets :"The testing set checks the model’s predictive ability after a model has been chosen." : https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/overview-of-the-make-validation-column...

This is also a reason of the importance of well defining the different sets and either fixed them or the method, to properly evaluate different models on the same validation set (or same method).

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

SDF1 · Jan 10, 2023 02:30 PM

Hi @Victor_G ,

Yeah, I think there might be some kind of misunderstanding. Perhaps I should say that there is a hold-out data set rather than a test data set. The hold-out data set is data that was never used in training, validation, or selection of a model from a given algorithm (test set). I use this hold-out set to compare the different algorithms against each other to see which algorithm performs the best. JMP's Model Comparison platform is very helpful in comparing the different algorithms against each other using a hold-out set. This allows for an unbiased and no leakage comparison of the different algorithms. It therefore becomes a sort of "test" data set in the sense that the hold-out data set is being used to see which of the different algorithms perform best on this "pure" data set.

This is the only way I am aware of to compare different algorithms against each other in JMP. What I mean is that this is the only way JMP can compare a neural net algorithm against an XGBoost algorithm, for example. I would never use, nor recommend, using the hold-out data set to compare different models from within a platform. I wouldn't compare the 20 different tuned models within the bootstrap forest platform against each other using the hold-out set, that would be the purpose of the test data set.

So, sorry if there was any misunderstanding/confusion. I'll try to refer to it as the hold-out set from now on to avoid confusion with the test set.

Thanks for the discussion!,

DS

sreekumarp · Jan 11, 2023 05:33 AM

Thank you for your input on the cross validation column.

Sreekumar Punnappilly

Di_Michelson · Jan 11, 2023 08:31 AM

To your original question, no, there are not specific rules about how much data to leave out. In the JMP Education analytics courses, we advise you to hold out as much data as you are comfortable with, with at least 20% held out. If you feel the training set is too small to hold back that many rows, consider k-fold cross validation. How many rows are you willing to sacrifice to validation? Use k = n / that many rows. If k < 5 using that formula, consider leave-one-out cross validation.

Discussions

CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Re: CROSS VALIDATION - VALIDATION COLUMN METHOD

Recommended Articles