I have the following problem. I want to do classification using SAS Jmp Partition. The dataset has 2 classes: positive and negative. I wrote a JSL script that loads the dataset from file. The test set is defined by excluding a certain amount of rows and doing training of the decision tree on the training set (given by all included rows). The training is done by hitting the "Go" button via the JSL script (i.e. sending the Go message to the partition instance).
Now my question : How does SAS Jmp Partition know when to stop growing the tree ? According to the r-squared plots it seems tha JMP makes use of the test set to base its decision to stop growing the tree on. This would be clearly unsatisfactory since I want to use the test set performance (using the ROC curve and its AUC) to quanitfy the generalization performance of the tree on unseen data.
Or is the stopping point determined by the crossvalidation ROC/AUC ? But it seems the AUC of the Crossvalidation run can be maximized without any problems near 0.999 by growing the tree infinitely large, which would be clear overfitting.
I use the following command to start Decision tree learning:
part = Partition( Minimum Size Split( 20 ), Show Tree( 1 ), ROC Curve( 1 ), Column Contributions( 1 ), Split History( 1 ), Criterion( "Maximize Significance" ), K Fold Crossvalidation( 2 ), SendToReport( .... ) ); part << ColorPoints << Go << LeafReport;
From p816 of the JMP Stat and Graph Guide (From JMP, Help>Books>JMP Stat and Graph Guide):
Automatic Splitting The Go button (shown in Figure 37.12) appears when you have cross-validation enabled. This is done by either using the K Fold Crossvalidation command, or excluding at least 20% of rows as a holdout sample. The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, the platform performs repeated splitting until the cross-validation R-Square is better than what the next 10 splits would obtain. This rule may produce complex trees that are not very interpretable, but have good predictive power. Using the Go button turns on the Split History command. Also, if using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.