Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- JMP User Community
- :
- Discussions
- :
- Proportion between size of training and test data set

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Jul 17, 2020 4:31 AM
(425 views)

Hi, I have a relatively small data set with 95 observations/records and about 70 variables for each record. The goal is to detect which variables influence the response and find the optimal settings of these variables. Because of the small number of observations I decided to divide data into only two categories, training and test data set (omitting the validation data set). My strategy is to apply a decision tree with the partition platform followed by a least squares regression with the variables that seem most important. My question is, if there is a general advice regarding the proportions of the sizes of training and test data set? I have tried with a division where 75% of observations are included in the training data and 25% in the test data set but I am not sure if this is the optimal proportion.

I am also wondering if some bootstrap procedure could be applied for the division into training and test data set. The idea is to do a large number of divisions and somehow average the results. I am not sure as to whether this makes sense. I am running JMP 15.

1 ACCEPTED SOLUTION

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Jul 20, 2020 9:12 AM
| Last Modified: Jul 20, 2020 9:14 AM
(363 views)
| Posted in reply to message from StefanC 07-20-2020

The role of the test hold out set is useful in honest assessment by cross-validation, but there are trade-offs. You use K-fold CV because you have a small data set. Holding out a portion for testing might compromise both the training and selecting the best model. The cross-validation method is based on the confirmation of the model with future data, but using some of the existing data instead. Do you intend to evaluate the model with new observations? Then the test set might not provide a unique benefit.

Learn it once, use it forever!

3 REPLIES 3

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Proportion between size of training and test data set

You might use k-folds cross-validation, the adaptation of CV for small data sets.

Also, with JMP Pro, you could use Generalized Regression with LASSO for variable selection.

Learn it once, use it forever!

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Proportion between size of training and test data set

Thanks, it makes sense to apply k-folds cross-validation. I suppose I should use stepwise regression then, since k-folds cross-validation is not an option with standard least squares. Unfortunately I am not a Pro user (yet) so I do not have access to Generalized Regression.

I suppose you would still reserve a portion of the data to test the predictive ability of the final model?

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Jul 20, 2020 9:12 AM
| Last Modified: Jul 20, 2020 9:14 AM
(364 views)
| Posted in reply to message from StefanC 07-20-2020

The role of the test hold out set is useful in honest assessment by cross-validation, but there are trade-offs. You use K-fold CV because you have a small data set. Holding out a portion for testing might compromise both the training and selecting the best model. The cross-validation method is based on the confirmation of the model with future data, but using some of the existing data instead. Do you intend to evaluate the model with new observations? Then the test set might not provide a unique benefit.

Learn it once, use it forever!