Solved: Re: How good is K-fold cross validation for small datasets?

tnad · Mar 2, 2020 11:01 PM

I have a small data set with 154 molecules (attached). I'm trying to predict KV40 using 6 factors. From looking at similar studies in the literature, many use the leave-one-out validation method to build models for such smaller data sets, so to replicate this in JMP, I did the following:

Fit-model: Stepwise regression
- used response surface for factors
- used k (k-fold cross-validation) = number of samples = 154
- left everything else as default

Result:
r2 = 0.84; r2 (k-fold) = 0.72

After box-cox transformation: r2 = 0.87 (r2_adj = 0.86)

I tried to do this with a validation column (70:30) and did not get a result as good as this. I have JMP pro. My concern is am I creating a fake or overfit model when I do this type of cross-validation? Is there a better way to do this? Is there anything I should watch or test for? Thanks.

Mark_Bailey · Mar 3, 2020 08:16 AM

Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.

You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.

View solution in original post

Mark_Bailey · Mar 3, 2020 08:16 AM

Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.

You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.

Byron_JMP · Mar 9, 2020 02:09 PM

@Mark_Bailey When you mentioned, "You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population."

Is this because cross validation tends to over fit?

JMP Systems Engineer, Health and Life Sciences (Pharma)

Mark_Bailey · Mar 9, 2020 04:52 PM

Over-fitting is the concern but the reason for my statement is based on the whole scheme of 'honest assessment.' One uses some data to train the model and separate, entirely new data to validate or select the model (two hold-out sets). This approach is valid but is still limited to the data already seen by learning and selecting. It cannot speak to generalization of new data. New data is needed to test the choice of models. This data can be a third hold-out set or future data. The risk of waiting for future data depends on the situation so if a large amount of data is available, the third hold-out set is preferred.

Nazarkovsky · Aug 28, 2020 7:58 PM

Hello! May I ask you why k = 154 was for your cross-validation?

Normally I put 5 (20% for validation) for k depending on the dataset, although all they were small enough for normal training-validation-testing.

Thanks!

In the future, it is desirable for JMP Pro to implement an option to visualize the folds disttribution while processing data this way.

Reaching New Frontiers

Mark_Bailey · Aug 29, 2020 4:54 AM

Honest assessment by cross-validation is commonly implemented in one of three ways: hold out sets (train, validate, test), k-fold, or leave one out. Using k = N (sample size) is a way of using k-fold CV to achieve the last approach.

Nazarkovsky · Aug 29, 2020 08:30 AM

Wow! It is quite suprising for me, as the idea for K-fold crossvlidation rests on dividing a dataset into K-folds where K-1 is intended for training and 1 - for validation. I may be confused, though. From your post it looks like this strategy is attributed more to "leave on out", isn't it?

Reaching New Frontiers

Craige_Hales · Aug 30, 2020 02:01 PM

I just figured this out...

cooking: fold the egg into the mixture

geology: intense folding in the earth's mantle

business: the club folded after a year

sports: the runner folded after a mile

cards: know when to fold 'em

geography: the town lies in a fold in the hills

web design: above the fold text doesn't require scrolling

change: a 10-fold increase in accidents

paper: origami folds to make a duck

idiom: fold your hands

farming: put the sheep in the fold

It has to be the farming version! It means pen! The data is divided up into K pens/subsets/folds.

Craige

Mark_Bailey · Aug 31, 2020 04:56 PM

Yes, K = N produces 'leave one out' cross-validation.

russ_wolfinger · Oct 30, 2023 09:04 AM

Hello @tnad I just came across this post and hope you don't mind an additional reply 3.5 years later!

I think setting K = 154 for Stepwise in this case is not recommended since the algorithm searches for the best fold, which in this case would consist of only a single observation. In general it is important to perform stepwise nested within each fold to avoid overfitting, and this is available in the Nested K-Fold option in Model Screening. Also I think 154 observations is plenty for k-fold and that repeated nested k-fold is a great way to compare models.

Model Screening reveals that Neural Boosted works better than Stepwise for these data with Test R2 = 0.87 vs 0.77 using Log10(KV40) as Y, K = 5, and L= 4. I also tried the new Torch Deep Learning add-in (available by request at JMP Early Adopter ) and with a little tuning am getting similar if not better results with its 5-fold cross-validation.

In general, Neural, Torch, XGBoost, and other platforms in JMP can be valuable for QSAR predictive modeling.