Re: Bootstrap forest tuning parameters, number of splits, and prediction formula

SDF1 · Feb 19, 2020 01:36 PM

Dear JMP & Community,

(W10 Enterprise, JMP Pro 15.0.0)

I'm curious as to know which of the tuning parameters for bootstrap forest modeling impact the number of splits on each factor. I'm familiar with how to use a tuning table and so forth to optimize the fit for maximizing the fit, e.g. R^2 for the validation set. Of the six parameters highlighted below, which one(s) can affect the Number of Splits (from Column Contributions)?

If you look at the report output from bootstrap forest, there is a column in Column Contributions called "Number of Splits". If you sum up these splits, it gives the total number of splits in the forest. Taken individually, these are the number of times the model split on a given parameter.

If you then start to look into detail in the formula that you can save in the data table, there can sometimes be differences between the number of times the parameter is counted in a formula and the number of splits on that parameter.

For example, if a parameter has 1000 splits, it should show up in the formula twice for a total of 2000 times, which makes sense. This doesn't always happen, though. I suspect that it's due to one of the tuning parameters, but I'm not sure. If not, what else could affect this?

To give a specific example, below is a screen shot of the column contributions from a fit and comparing it to 2*number of splits, and the number of times that parameter is counted in the formula. As can be seen, the number of times the parameter shows up in the formula (e.g. 12626 for X_1) doesn't match 2*number of splits (12636 for X_1). Sometimes it's close, and sometimes it's quite large (see column "Difference"). Also, what's interesting is that the differences sum to zero. So, whatever is causing this differences is "weighted" in such a way that it doesn't actually change the total number of splits, just how they are allocated.

I find it very interesting that in the formula the parameter X_11 is found 276 times more than 2*number of splits. To me, this suggests that at some splits, it's not evenly deciding on X_11, but rather on another parameter. And, it's doing this in such a way that the total number of splits remains constant. Could it be related to the Terms Sampled per Split or min split size? I tried some manual tests and couldn't determine how either of those affect the number of splits.

By the way, I copy the formula into Word and do a simple count on the parameter, X_1, X_2, etc. Given the size of the formula, and number of trees, there's no way I can go in an manually search through to find what's causing this. Unfortunately, there's no way I know of to evaluate the formula with Text Explorer or anything like that within JMP.

Any feedback from JMP developers, or modelers in the community, are much appreciated.

Thanks,

DS

Byron_JMP · Feb 24, 2020 10:44 AM

This might not be a very satisfying answer, but the number of splits is just a count of how many times the variable was the best place to split a branch.

There isn't really a control on how many times a variable can be split.

Maybe a couple of questions back to you:

1. How are you building your turning table, and how many runs do you use?

2. When you tune, do you tune against the whole data set, or a random subset?

-what proportion of the data do you use in a subset (if you do)

your observations on parameter name used in formulas is interesting. I might look into that a little.

JMP Systems Engineer, Health and Life Sciences (Pharma)

SDF1 · Feb 24, 2020 04:03 PM

Hi @Byron_JMP,

Yes, I understand that the number of splits is the number of times the algorithm chooses to split a tree on that particular variable. If you dig a bit deeper into the formula that's written to the column, the number of splits that's reported in the Column Contributions isn't always exactly 1/2 that of the number times the X factor is counted in the formula.

You might consider that a tree decision might be something like (just as an example):

If(Is Missing(X_1) | X_1 <=32, 41, 35)

If that's the case for each decision, then you'd expect the number of times X_1 shows up in the formula to be exactly 2*the number of splits associated with that factor.

But, if instead the decision tree looks more like:

If(Is Missing(X_1) | X_3 > 16, 20, 18)

Then you could have a situation where the number of times a tree is split on that variable differs from the number of times it appears in the formula. What might drive that when the decision trees are built?

In the particular situation of the original post, you can see that the factors X_1, X_2, and X_7 balance each other out such that the values added together in the "Difference" column add up to 0. Same is true for all the other X-factors when their differences are added up.

So, in this case, factor X_1 is actually used 5 times LESS (Difference/2) in the formula than is reported by the Number of Splits column in the Column Contributions. Similarly, for X_2 and X_7, they are used 124 times MORE and 119 time LESS in the formula, respectively. If you add those up: 5 + (-124) + 119 = 0.

Could it be informative missing? This is too "perfect" to be coincidence, IMHO. I just can't figure out what part(s) of the tuning process is responsible for this.

As for your questions:

1. I use the DOE space filling Uniform option to generate a large set of at least 60 runs (10X the number of factors I'm changing). Then, I replicate the DOE a couple hundred times and run that -- I do it through scripting so it doesn't consume all my RAM. It can take a couple days to run.

2. I take my original data table and split it 80/20 training/test data set. I then split the original data table into training only and test only tables. I use the training table and use a random hold out between about 0.18 to 0.25 of the data set to validate the model fit. Again, I do this all through scripting so I can find what optimal values for tuning and use as the validation portion. I often find that an exact 80/20 training/validation split is not as optimal as something slightly off, say .79/.21. You can often get a bit better R^2 on the validation set.

Thanks,
DS