Dear JMP & Community,
(W10 Enterprise, JMP Pro 15.0.0)
I'm curious as to know which of the tuning parameters for bootstrap forest modeling impact the number of splits on each factor. I'm familiar with how to use a tuning table and so forth to optimize the fit for maximizing the fit, e.g. R^2 for the validation set. Of the six parameters highlighted below, which one(s) can affect the Number of Splits (from Column Contributions)?
If you look at the report output from bootstrap forest, there is a column in Column Contributions called "Number of Splits". If you sum up these splits, it gives the total number of splits in the forest. Taken individually, these are the number of times the model split on a given parameter.
If you then start to look into detail in the formula that you can save in the data table, there can sometimes be differences between the number of times the parameter is counted in a formula and the number of splits on that parameter.
For example, if a parameter has 1000 splits, it should show up in the formula twice for a total of 2000 times, which makes sense. This doesn't always happen, though. I suspect that it's due to one of the tuning parameters, but I'm not sure. If not, what else could affect this?
To give a specific example, below is a screen shot of the column contributions from a fit and comparing it to 2*number of splits, and the number of times that parameter is counted in the formula. As can be seen, the number of times the parameter shows up in the formula (e.g. 12626 for X_1) doesn't match 2*number of splits (12636 for X_1). Sometimes it's close, and sometimes it's quite large (see column "Difference"). Also, what's interesting is that the differences sum to zero. So, whatever is causing this differences is "weighted" in such a way that it doesn't actually change the total number of splits, just how they are allocated.
I find it very interesting that in the formula the parameter X_11 is found 276 times more than 2*number of splits. To me, this suggests that at some splits, it's not evenly deciding on X_11, but rather on another parameter. And, it's doing this in such a way that the total number of splits remains constant. Could it be related to the Terms Sampled per Split or min split size? I tried some manual tests and couldn't determine how either of those affect the number of splits.
By the way, I copy the formula into Word and do a simple count on the parameter, X_1, X_2, etc. Given the size of the formula, and number of trees, there's no way I can go in an manually search through to find what's causing this. Unfortunately, there's no way I know of to evaluate the formula with Text Explorer or anything like that within JMP.
Any feedback from JMP developers, or modelers in the community, are much appreciated.
Thanks,
DS