Hi forum members,
I have the following problem. I want to do classification using SAS Jmp Partition.
The dataset has 2 classes: positive and negative. I wrote a JSL script that loads
the dataset from file. The test set is defined by excluding a certain amount of rows
and doing training of the decision tree on the training set (given by all included rows).
The training is done by hitting the "Go" button via the JSL script (i.e. sending the Go
message to the partition instance).
Now my question : How does SAS Jmp Partition know when to stop growing the tree ?
According to the r-squared plots it seems tha JMP makes use of the test set to base
its decision to stop growing the tree on. This would be clearly unsatisfactory since I want
to use the test set performance (using the ROC curve and its AUC) to quanitfy the generalization
performance of the tree on unseen data.
Or is the stopping point determined by the crossvalidation ROC/AUC ? But it seems the
AUC of the Crossvalidation run can be maximized without any problems near 0.999 by growing
the tree infinitely large, which would be clear overfitting.
I use the following command to start Decision tree learning:
part = Partition(
Minimum Size Split( 20 ),
Show Tree( 1 ),
ROC Curve( 1 ),
Column Contributions( 1 ),
Split History( 1 ),
Criterion( "Maximize Significance" ),
K Fold Crossvalidation( 2 ),
SendToReport( .... )
);
part << ColorPoints << Go << LeafReport;
and selection of test set rows is done using :
Set Row States( [0,2,0,......] );
The 2 encode excluded rows.
Thanks in advance for any hints
Marc