I have a small data set with 154 molecules (attached). I'm trying to predict KV40 using 6 factors. From looking at similar studies in the literature, many use the leave-one-out validation method to build models for such smaller data sets, so to replicate this in JMP, I did the following:
Fit-model: Stepwise regression
- used response surface for factors
- used k (k-fold cross-validation) = number of samples = 154
- left everything else as default
r2 = 0.84; r2 (k-fold) = 0.72
After box-cox transformation: r2 = 0.87 (r2_adj = 0.86)
I tried to do this with a validation column (70:30) and did not get a result as good as this. I have JMP pro. My concern is am I creating a fake or overfit model when I do this type of cross-validation? Is there a better way to do this? Is there anything I should watch or test for? Thanks.
Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.
You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.
@markbailey When you mentioned, "You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population."
Is this because cross validation tends to over fit?
Over-fitting is the concern but the reason for my statement is based on the whole scheme of 'honest assessment.' One uses some data to train the model and separate, entirely new data to validate or select the model (two hold-out sets). This approach is valid but is still limited to the data already seen by learning and selecting. It cannot speak to generalization of new data. New data is needed to test the choice of models. This data can be a third hold-out set or future data. The risk of waiting for future data depends on the situation so if a large amount of data is available, the third hold-out set is preferred.
Hello! May I ask you why k = 154 was for your cross-validation?
Normally I put 5 (20% for validation) for k depending on the dataset, although all they were small enough for normal training-validation-testing.
In the future, it is desirable for JMP Pro to implement an option to visualize the folds disttribution while processing data this way.
Honest assessment by cross-validation is commonly implemented in one of three ways: hold out sets (train, validate, test), k-fold, or leave one out. Using k = N (sample size) is a way of using k-fold CV to achieve the last approach.
I just figured this out...
cooking: fold the egg into the mixture
geology: intense folding in the earth's mantle
business: the club folded after a year
sports: the runner folded after a mile
cards: know when to fold 'em
geography: the town lies in a fold in the hills
web design: above the fold text doesn't require scrolling
change: a 10-fold increase in accidents
paper: origami folds to make a duck
idiom: fold your hands
farming: put the sheep in the fold
It has to be the farming version! It means pen! The data is divided up into K pens/subsets/folds.
Yes, K = N produces 'leave one out' cross-validation.
There are no labels assigned to this post.