I am not a de facto statistician but have gained much experience in biostatistics over many years, thanks largely to using the standard version of JMP (currently v13). I certainly cannot imagine using anything else and regularly promote it.
I am presently progressively defining partitions in a variety of continuous variables to optimise outcome in a binomial categorical variable (fail versus pass). I have read and watch much concerning this platform and do understand the necessity of including validation and to limit the number of covariates to a practical but logical short list.
However, the minimum size split is one parameter I am unsure about. My intention is to define partitions which minimise, perhaps even completely mitigate, failure in the outcome variable. The resulting partition seems to be heavily dependent on the minimum split size/proportion chosen. My feeling is that one should choose as large as possible a size (striving towards 50%) while keeping an eye on the LogWorth values being generated by each split. I would think that too small a split size might be analogous to being at risk for a Type II error. How then, other than what I have suggested, does one determine what proportion to use?
Also, and I'm assuming that the partition model is not considering correction for multiple comparisons, why does the LogWorth equivalent p-value for the initial split not match the Fisher's Exact result when one saves the resulting split and performs a 2x2 contingency analysis?
Do you have JMP Pro or access to JMP Pro? I ask because the partition platform isn't the best (most stable) approach to generate prediction formulas. The Boosted Tree and Boostrap Forest platforms are much better. Or, if you have access to higher versions, you can also run the XGBoost platform, which is very stable.
But, on to your question about min size split. The min size splits controls complexity of the tree "stump". If the number of observations in a split is less than the value that you set, there is no split generated. It is not too relevant when you use boosting in your decision tree, but can be used to control it. You don't want it too small (i.e. 1 observation) and also don't want it too big (e.g. 1000 observations). A good typical staring point would be around 5 for a simple tree method. If you're using something like the bootstrap forest platform, you'll want to use N/2 for classification schemes like you're using -- where N is the number of observations. You might be able to tune your model by adjusting this parameter and then comparing it the different models on a test data set.
If you have enough data, I'd recommend splitting off a portion, maybe 20-25% and keeping that as a test data set that isn't used to train or validate your model. With the remaining data, you can generate a stratified validation column with training and validation data (say 80/20) to optimize the complexity of your model while also keeping it from overfitting. After generating several models, you can use the previous test data as a way to compare which model is actually better at predicting the outcomes correctly. Since you're looking for a binary decision model, you can even optimize it further by looking at cost sensitive learning.
I'm not an expert in how JMP calculates the log worth or Fischer's exact test, so hopefully someone from JMP can answer that, but I'm guessing that it has to do with the whole data set and calculating log worth versus when you save the split formula and do a contingency table.
Hope this helps -- or at least gives some things to think about.
Thanks for your detailed reply. Unfortunately I don’t have the Pro version ($$$). The approach I took in the interim was to specify 20% for validation and then to start with a 45% minimum size split. If that failed, I would reduce it by 5% to 40% and so on until I achieved a split. I would then fine tune it with 1% upward increments but limit it to where the LogWorth value indicated significance. Alternatively, I would set the split to the failure count in the original set, or a little higher, to achieve placement of 100% of the failed cases into one of the resulting subsets defined by the split. I take your point regarding a test set and will try to do that provided sufficient numbers remain for the validation and training. Tried XGBoost after installing Python, but there’s a learning curve to get it to work.
XGBoost is really powerful and very stable. If you re-run it multiple times, you get the same answer, which isn't always the case for bootstrap forest or boosted tree.
I found this website to be VERY helpful on how to tune XGBoost and improve the models. And, since it's written for people running Python, you should be good to go.
Have you considered scripting your fit routines so that you can automate some of the process and can let run unsupervised? It might be quicker in the long run. I do that when searching the tuning parameter space for the other tree-based platforms in JMP -- often times running 30k's + of calculations. Depending on the platform, it can take a good 8-10 hrs, but I just let it run while I do other stuff.
You'll definitely want to monitor your R^2 for training and validation to make sure you're not overfitting the data. I'm trying to run some models now, but I'm finding that so far with the Neural Net and GenReg platforms, it's coming back with R^2 that are larger for the validation set than the training set, which is a big warning sign. Personally, I think it's because of the data set that I have and the highly non-normal distribution of the data.
One very nice thing about setting the test data set aside -- actually removing those data from the training data table so there's no way JMP might get access to them when running the fit. I have had the experience before that even though I hid and excluded certain rows, the data was still used during cross-validation, and so when comparing an actual vs predicted plot for the test data set it looked better than it actually was because some of the test data was being used in the training.
Again, thank you for your detailed response which has been most useful. I must admit that, despite being quite accomplished as a programmer in Filemaker, I’ve never really explored JSL but this is perhaps the incentive I need to do just that. Regarding R^2, I was aware of that and do certainly monitor the split history with respect to this coefficient. I did luckily stumble across the little publicized use of excluded rows for validation, which seemed most counterintuitive to what one is used to during normal operation.