cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
dlehman1
Level V

Has the bootstrap forest changed in JMP Pro18?

I've used the bootstrap forest on many different data sets (both continuous and nominal response variables) for a number of years and I generally understand the settings for that algorithm - I usually use the defaults.  My understanding of random(bootstrap) forests is that each tree is run using a random selection of factors and a random selection of observations (with replacement).  In the past, when I look at the individual trees there are always some that use only a couple of variables and have very sparse trees - others are far more complex.  When teaching, I've used this to show how the individual trees are often nonsensical - it is the averaging that produces good predictions. 

 

However, when I run bootstrap forests in JMP Pro 18, every data set I uses shows every one of the trees with hundreds of splits.  There always used to be a few trees with <10 splits but now there are no trees that are simple at all.  Has the method of implementing the model changed? 

5 REPLIES 5
Victor_G
Super User

Re: Has the bootstrap forest changed in JMP Pro18?

Hi @dlehman1 

 

The Bootstrap Forest platform fits an ensemble model by averaging many decision trees each of which is fit to a bootstrap sample (with replacement) of the training data. Each split in each tree considers a random subset of the predictors (from Bootstrap Forest).

This randomisation in training samples selection and features subset selection enables to reduce the risk of overfitting (by creating multiple independant trees based on slightly different training data), improve accuracy and robustness to noise, and enables to handle missing values, outliers and correlated features/colinearity (thanks to subset feature selection at each split of each tree).

 

Concerning your original question, as I have not JMP 17 on my computer anymore, I won't be able to compare the outcomes. Regarding the documentations in JMP 17 and JMP 18 about the default hyperparameters used when launching the Bootstrap Forest, there doesn't look to be a difference in the default settings.

If the individual trees seem too complex, you can modify some of the default hyperparameters :

  • Minimum splits per tree : By default it's on 10, but on small datasets I tend to always reduce this value to 2. Having a large value on the minimum splits per tree tend to create complex individual trees, which may not be beneficial and may be prone to overfitting.
  • Maximum splits per tree : By default it's on 2000, but again 2000 splits maximum on any individual tree is a lot (and may lead to similar problems mentioned before) ! I tend to reduce this value depending on the size of the dataset and complexity of the task, around 100-1000.
  • Minimum size split : By default it's on 5 (minimum number of samples to have a candidate split), but on small datasets it may not be relevant (and you might reduce this number, even if this may lead to higher risk of overfitting), while on bigger datasets you might want to increaser this value to increase robustness of each individual tree.
  • Early Stopping : Make sure this option is checked if you have used a validation set, to make sure the creation of trees stop if the validation metrics do not improve with the creation of new trees.

 

Sorry for not having more precise/definitive answers to your questions, do you have some screenshots, datasets, or anything to show the differences ? It may help debugging the situation.

Best,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
dlehman1
Level V

Re: Has the bootstrap forest changed in JMP Pro18?

Your description is exactly my understanding and I have a good feel for how those tuning parameters will affect the trees that are built.  I can't find any of my old examples, but I know that I've demonstrated to classes how the individual trees don't make sense, but the averaging of independent trees is what makes the bootstrap forest a good predictor.  However, the part of the description that seems at odds with the resulting trees is "Each split in each tree considers a random subset of the predictors."  I would expect the predictors in each tree to vary, not only in the random selection but in how many are selected.  That seems to match what I used to get from the bootstrap forest - some trees used 2 predictors, others use 10, etc.  Now all the trees seem to have the same number of splits (and in so doing, use a similar number of predictors in each tree).  And the number of splits is on the order of hundreds in all of the data sets I have looked at (modest sizes, around 10K - 100K rows).  I know I used to get some trees that only had a handful of splits - and I don't get that any more.  Hence my question.

Victor_G
Super User

Re: Has the bootstrap forest changed in JMP Pro18?

Hi @dlehman1,

 

I tried to play with two sample datasets, Diamonds Data.jmp (regression task) and Mushroom.jmp (classification task).

I got the same impression as you with the regression task, even when playing with hyperparameters, the individual trees were quite complex. However, I got more simple and satisfying individual trees with the classification task, with diverse depths and complexities :

Victor_G_0-1746450084662.png


So again it won't directly answer your question about a possible change in bootstrap forest algorithm between JMP versions (this may be something best answered by technical staff at JMP, @Mark_Bailey ?), but I think the complexity of individual trees are linked to the precision needed for the task (regression equivalent to very large number of classes vs. classification) and the data available (quantity and quality). 

Hope this answer may still help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
dlehman1
Level V

Re: Has the bootstrap forest changed in JMP Pro18?

I've attached a different example:  similar size to the mushroom data set, classification model, similar number of variables (but a few continuous factors whereas all the mushroom factors are nominal).  The trees in the bootstrap forest all have hundreds of splits.  I'm still mystified.

Re: Has the bootstrap forest changed in JMP Pro18?

I don't support this platform directly, so I have no immediate explanation or help to offer. I am also committed to finishing the development of a new platform and new features in JMP 19 at the moment, so I think the fastest result will be to reach out to JMP Support (support@jmp.com) for answers. Please reply here with any insights you receive.

Recommended Articles