cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
tnad
Level III

How good is K-fold cross validation for small datasets?

I have a small data set with 154 molecules (attached). I'm trying to predict KV40 using 6 factors. From looking at similar studies in the literature, many use the leave-one-out validation method to build models for such smaller data sets, so to replicate this in JMP, I did the following:

Fit-model: Stepwise regression
- used response surface for factors
- used k (k-fold cross-validation) = number of samples = 154
- left everything else as default

Result:
r2 = 0.84; r2 (k-fold) = 0.72

After box-cox transformation: r2 = 0.87 (r2_adj = 0.86)

 

I tried to do this with a validation column (70:30) and did not get a result as good as this. I have JMP pro. My concern is am I creating a fake or overfit model when I do this type of cross-validation? Is there a better way to do this? Is there anything I should watch or test for? Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How good is K-fold cross validation for small datasets?

Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.

 

You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.

View solution in original post

9 REPLIES 9

Re: How good is K-fold cross validation for small datasets?

Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.

 

You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.

Byron_JMP
Staff

Re: How good is K-fold cross validation for small datasets?

@Mark_Bailey   When you mentioned,   "You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population."

 

Is this because cross validation tends to over fit?

JMP Systems Engineer, Health and Life Sciences (Pharma)

Re: How good is K-fold cross validation for small datasets?

Over-fitting is the concern but the reason for my statement is based on the whole scheme of 'honest assessment.' One uses some data to train the model and separate, entirely new data to validate or select the model (two hold-out sets). This approach is valid but is still limited to the data already seen by learning and selecting. It cannot speak to generalization of new data. New data is needed to test the choice of models. This data can be a third hold-out set or future data. The risk of waiting for future data depends on the situation so if a large amount of data is available, the third hold-out set is preferred.

Nazarkovsky
Level IV

Re: How good is K-fold cross validation for small datasets?

Hello! May I ask you why k = 154 was for your cross-validation?

 

Normally I put 5 (20% for validation) for k depending on the dataset, although all they were small enough for normal training-validation-testing. 

 

Thanks!

 

In the future, it is desirable for JMP Pro to implement an option to visualize the folds disttribution while processing data this way.

Reaching New Frontiers

Re: How good is K-fold cross validation for small datasets?

Honest assessment by cross-validation is commonly implemented in one of three ways: hold out sets (train, validate, test), k-fold, or leave one out. Using k = N (sample size) is a way of using k-fold CV to achieve the last approach.

Nazarkovsky
Level IV

Re: How good is K-fold cross validation for small datasets?

Wow! It is quite suprising for me, as the idea for K-fold crossvlidation rests on dividing a dataset into K-folds where K-1 is intended for training and 1 - for validation. I may be confused, though. From your post it looks like this strategy is attributed more to "leave on out", isn't it?
Reaching New Frontiers
Craige_Hales
Super User

Re: How good is K-fold cross validation for small datasets?

I just figured this out...

 

cooking: fold the egg into the mixture

geology: intense folding in the earth's mantle

business: the club folded after a year

sports: the runner folded after a mile

cards: know when to fold 'em

geography: the town lies in a fold in the hills

web design: above the fold text doesn't require scrolling

change: a 10-fold increase in accidents

paper: origami folds to make a duck

idiom: fold your hands

farming: put the sheep in the fold

 

It has to be the farming version! It means pen! The data is divided up into K pens/subsets/folds.

 

Craige

Re: How good is K-fold cross validation for small datasets?

Yes, K = N produces 'leave one out' cross-validation.

Re: How good is K-fold cross validation for small datasets?

Hello @tnad    I just came across this post and hope you don't mind an additional reply 3.5 years later!   

I think setting K = 154 for Stepwise in this case is not recommended since the algorithm searches for the best fold, which in this case would consist of only a single observation.   In general it is important to perform stepwise nested within each fold to avoid overfitting, and this is available in the Nested K-Fold option in Model Screening.   Also I think 154 observations is plenty for k-fold and that repeated nested k-fold is a great way to compare models. 

Model Screening reveals that Neural Boosted works better than Stepwise for these data with Test R2 = 0.87 vs 0.77 using Log10(KV40) as Y, K = 5, and L= 4.  I also tried the new Torch Deep Learning add-in (available by request at JMP Early Adopter ) and with a little tuning am getting similar if not better results with its 5-fold cross-validation.

In general, Neural, Torch, XGBoost, and other platforms in JMP can be valuable for QSAR predictive modeling.