Re: One hold out cross validation

tuo88138 · Jun 8, 2023 5:58 PM

Hi, I want to do one hold-out cross-validation for the random forest, but I don't know if JMP pro has this option or not. If it has where can I find it?

P_Bartell · Dec 20, 2022 12:48 PM

When you say, '...hold-out cross-validation... do you mean what is more commonly known as 'leave one out'?

tuo88138 · Dec 21, 2022 09:56 AM

yes exactly leave out one

Mark_Bailey · Dec 20, 2022 02:24 PM

Have you read the documentation for Bootstrap Forest models?

tuo88138 · Dec 21, 2022 09:58 AM

yes i did. The problem is this: I have a small size of dataset so I need to use leave out one

Victor_G · Dec 21, 2022 2:01 AM

Hi @tuo88138,

As @Mark_Bailey suggests, you can have a look at the documentation and info behind the Random Forests models, as this model is already robust and don't specifically need cross-validation, due to the bootstrapping process used :

Random Forest model fits an ensemble model by averaging many decision trees, each of which is fit to a bootstrap sample of the training data. Since bootstrapping is done on training data, not all data is used to build individual trees, so samples not used in the creation of individual trees (called "Out Of Bag" (OOB) samples) are used to estimate errors in the model, and can also be used to evaluate performance of each tree in the Random Forest : Per Tree Summaries (jmp.com)

However, if you are interested in relaunching the Random Forest platform several time with various training and validation samples, it is possible to do so by :

Creating a Validation formula column in your dataset (you can specify the size of each sets, so you can do something similar to K-fold crossvalidation or Leave-One-Out validation),
Creating a Random Forest model and specify in the "Validation" part of the menu your validation formula column.
Once the model is fit, you can right click on any metrics you would like to evaluate through cross-validation (for example, Rsquare or RASE for Training and Validation sets, or Sum of Squares in the column contributions to see if important factors are still ranked the same way no matter the training/validation sets) and click on "Simulate". The "column to switch out" will be your validation formula column, and also the "column to switch in" (in order to generate and exchange various training and validation sets). You can also specify the number of simulation to run, and specify a random seed if you plan to make a reproducible simulation.
A new datatable with the single simulations done and the results will be created.

The whole process can be seen in this presentation from Chris Gotwalt : Different goals, different models: How to use models to sharpen up your questio... - JMP User Commun... (around 17 minutes)

The cross-validation of Random Forests is indeed possible in JMP, but Random Forests are models robust to messy/noisy data and accurate on small dataset due to the bootstrapping, so unless you have a specific idea in mind to use this cross-validation, you can have good estimation of accuracy (with OOB samples) and reliable results with the model "as it is".

I hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

ih · Dec 21, 2022 04:16 PM

The Model Screening platform actually makes this pretty simple, just set K in the Folded Crossvalidation section equal to the number of active rows.

Find details for each fold in the Details section of the report, and if you want to know which row goes with which fold you can open the Training section, select a row corresponding to that fold, and press 'Save Script Selected' under the table. A new column will be added to the table which indicates the validation rows.

Victor_G · Dec 22, 2022 4:26 AM

Hi @ih,

From my side, using "Model screening" and the setup you proposed does work for K-folds crossvalidation, but not always for Leave-One-Out method (depend on dataset).

If I specify K = number of observations, an error message appears "The validation sets inside each of the folds are too small to support some methods", even if only Bootstrap Forests is checked in the "Method" panel. So I also thought about using the "Model screening platform" but it may not be possible, depending on the dataset (on "Boston housingprices" dataset with this method I have no summary of the folds and missing values in each folds details). It does work for the "Big Class" dataset.

Hi @tuo88138

You can follow the method described by Chris Gotwalt, I just tested it and it worked perfectly. You might have to uncheck "Early Stopping" in the Bootstrap Forest analysis panel, in order to avoid "blank values" for the different metrics in the output (Rsquare, RASE, etc...).

This technique can be interesting to compare contribution importance of variables across several simulations (see capture "Contribution-importance_simulations" attached), or provide confidence intervals on some metrics like Rsquare for example (see capture "Confidence_Intervals_Rsquare" for Training Rsquare on same dataset with same number of simulations).

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

ih · Jan 3, 2023 07:30 AM

Hi @Victor_G, I don't believe that is an error, just a warning message. I do not think the unavailable methods will not affect this analysis.

Victor_G · Jan 4, 2023 04:13 AM

Hi @ih,

Yes, no problem for the message in itself (as it doesn't stop the analysis), but depending on which datatable you use, you might get blank spaces everywhere instead of the expected results (see screenshot "Error_RandomForest_Model-screening_LOOcrossval" done on the housing prices dataset with leave-one-out crossvalidation).

As I mentioned before, I have the same problem when using the "regular" Bootstrap Forest platform with a validation formula column (with a single observation in validation to do leave-one-out crossvalidation) and the option "Early Stopping" checked, so this might be the same issue with the platform Model Screening, where the option "Early Stopping" is probably checked by default.

You can have a look at the housing prices table attached, with several scripts to illustrate the problems mentioned :

"Model Screening Random Forest of Price" : To illustrate the problem of the "Model Screening" platform with no results displayed,
"Bootstrap Forest of Price (with Early Stopping)" : To illustrate the similar error problem with the platform "Bootstrap Forest" when "Early Stopping" is checked,
"Bootstrap Forest of Price (without Early Stopping)" : To illustrate the solution shown in my response and done by Chris Gotwalt in his example,
"Rsquare training Leave-One-Out" and "Columns contribution (Portion) Leave-One-Out" : To show datatables created with this method and leave-one-out crossvalidation.

Hope the problem I mentioned is now clearer.

Depending on the dataset of @tuo88138, there are two methods available to choose to realize leave-one-out crossvalidation.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)