I had a conversation with JMP support since I assumed a bug in JMP. However, I learned the following:
- If one marks certain rows as "excluded" (no matter if chosen visible or nor) they will be removed from many computations (e.g. doing a histogram does not take excluded rows into account)
- If one marks certain rows as "excluded" (no matter if chosen visible or nor) they will be included as a validation set for platforms like boosted tree or bootstrap forest.
Apparently this is intended behavior. Personally, it confused me and fortunately I got in contact with support to learn about it before I published the results, since in my case the excluded rows are invalid data. I just keep those for tracking purposes.
So, finally my wish: could JMP be fully consistent in the use of excluded rows?
Here's a quote that JMP support sent me:
"The Bootstrap Forest and several other platforms in JMP Pro have a feature that if some rows are excluded, and you do not otherwise specify a Validation set, those rows are used as the Validation set. To avoid those rows from being included at all, you could:
1) Subset the data table so it doesn't include those rows, then re-run the Bootstrap Forest
2) Use a different Validation method (Holdback or Validation Column)
In general, it is often a good idea to devote some rows to a Validation set. This would give you the ability to use the Early Stopping option, which can help avoid overfitting."