Solved: Re: Proportion between size of training and test data set

StefanC · Jun 8, 2023 5:18 PM

Hi, I have a relatively small data set with 95 observations/records and about 70 variables for each record. The goal is to detect which variables influence the response and find the optimal settings of these variables. Because of the small number of observations I decided to divide data into only two categories, training and test data set (omitting the validation data set). My strategy is to apply a decision tree with the partition platform followed by a least squares regression with the variables that seem most important. My question is, if there is a general advice regarding the proportions of the sizes of training and test data set? I have tried with a division where 75% of observations are included in the training data and 25% in the test data set but I am not sure if this is the optimal proportion.

I am also wondering if some bootstrap procedure could be applied for the division into training and test data set. The idea is to do a large number of divisions and somehow average the results. I am not sure as to whether this makes sense. I am running JMP 15.

Mark_Bailey · Jul 20, 2020 9:14 AM

The role of the test hold out set is useful in honest assessment by cross-validation, but there are trade-offs. You use K-fold CV because you have a small data set. Holding out a portion for testing might compromise both the training and selecting the best model. The cross-validation method is based on the confirmation of the model with future data, but using some of the existing data instead. Do you intend to evaluate the model with new observations? Then the test set might not provide a unique benefit.

View solution in original post

Mark_Bailey · Jul 17, 2020 01:22 PM

You might use k-folds cross-validation, the adaptation of CV for small data sets.

Also, with JMP Pro, you could use Generalized Regression with LASSO for variable selection.

StefanC · Jul 20, 2020 05:48 AM

Thanks, it makes sense to apply k-folds cross-validation. I suppose I should use stepwise regression then, since k-folds cross-validation is not an option with standard least squares. Unfortunately I am not a Pro user (yet) so I do not have access to Generalized Regression.

I suppose you would still reserve a portion of the data to test the predictive ability of the final model?

Mark_Bailey · Jul 20, 2020 9:14 AM

The role of the test hold out set is useful in honest assessment by cross-validation, but there are trade-offs. You use K-fold CV because you have a small data set. Holding out a portion for testing might compromise both the training and selecting the best model. The cross-validation method is based on the confirmation of the model with future data, but using some of the existing data instead. Do you intend to evaluate the model with new observations? Then the test set might not provide a unique benefit.