Choose Language Hide Translation Bar
Highlighted
tnad
Level II

How good is K-fold cross validation for small datasets?

I have a small data set with 154 molecules (attached). I'm trying to predict KV40 using 6 factors. From looking at similar studies in the literature, many use the leave-one-out validation method to build models for such smaller data sets, so to replicate this in JMP, I did the following:

Fit-model: Stepwise regression
- used response surface for factors
- used k (k-fold cross-validation) = number of samples = 154
- left everything else as default

Result:
r2 = 0.84; r2 (k-fold) = 0.72

After box-cox transformation: r2 = 0.87 (r2_adj = 0.86)

 

I tried to do this with a validation column (70:30) and did not get a result as good as this. I have JMP pro. My concern is am I creating a fake or overfit model when I do this type of cross-validation? Is there a better way to do this? Is there anything I should watch or test for? Thanks.

8 REPLIES 8
Highlighted

Re: How good is K-fold cross validation for small datasets?

Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.

 

You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.

Learn it once, use it forever!
Highlighted
Byron_JMP
Staff

Re: How good is K-fold cross validation for small datasets?

@markbailey   When you mentioned,   "You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population."

 

Is this because cross validation tends to over fit?

JMP Systems Engineer, Pharm and BioPharm Sciences
Highlighted

Re: How good is K-fold cross validation for small datasets?

Over-fitting is the concern but the reason for my statement is based on the whole scheme of 'honest assessment.' One uses some data to train the model and separate, entirely new data to validate or select the model (two hold-out sets). This approach is valid but is still limited to the data already seen by learning and selecting. It cannot speak to generalization of new data. New data is needed to test the choice of models. This data can be a third hold-out set or future data. The risk of waiting for future data depends on the situation so if a large amount of data is available, the third hold-out set is preferred.

Learn it once, use it forever!
Highlighted
Nazarkovsky
Level III

Re: How good is K-fold cross validation for small datasets?

Hello! May I ask you why k = 154 was for your cross-validation?

 

Normally I put 5 (20% for validation) for k depending on the dataset, although all they were small enough for normal training-validation-testing. 

 

Thanks!

 

In the future, it is desirable for JMP Pro to implement an option to visualize the folds disttribution while processing data this way.

Reaching New Frontiers
Highlighted

Re: How good is K-fold cross validation for small datasets?

Honest assessment by cross-validation is commonly implemented in one of three ways: hold out sets (train, validate, test), k-fold, or leave one out. Using k = N (sample size) is a way of using k-fold CV to achieve the last approach.

Learn it once, use it forever!
Highlighted
Nazarkovsky
Level III

Re: How good is K-fold cross validation for small datasets?

Wow! It is quite suprising for me, as the idea for K-fold crossvlidation rests on dividing a dataset into K-folds where K-1 is intended for training and 1 - for validation. I may be confused, though. From your post it looks like this strategy is attributed more to "leave on out", isn't it?
Reaching New Frontiers
Highlighted
Craige_Hales
Staff (Retired)

Re: How good is K-fold cross validation for small datasets?

I just figured this out...

 

cooking: fold the egg into the mixture

geology: intense folding in the earth's mantle

business: the club folded after a year

sports: the runner folded after a mile

cards: know when to fold 'em

geography: the town lies in a fold in the hills

web design: above the fold text doesn't require scrolling

change: a 10-fold increase in accidents

paper: origami folds to make a duck

idiom: fold your hands

farming: put the sheep in the fold

 

It has to be the farming version! It means pen! The data is divided up into K pens/subsets/folds.

 

Craige
Highlighted

Re: How good is K-fold cross validation for small datasets?

Yes, K = N produces 'leave one out' cross-validation.

Learn it once, use it forever!
Article Labels

    There are no labels assigned to this post.