Solved: Re: JMP Partition/Decision tree - Model Selection and Validation with Test Set

Report Inappropriate Content · Jan 13, 2010 12:11 PM

Hi forum members,

I have the following problem. I want to do classification using SAS Jmp Partition.
The dataset has 2 classes: positive and negative. I wrote a JSL script that loads
the dataset from file. The test set is defined by excluding a certain amount of rows
and doing training of the decision tree on the training set (given by all included rows).
The training is done by hitting the "Go" button via the JSL script (i.e. sending the Go
message to the partition instance).

Now my question : How does SAS Jmp Partition know when to stop growing the tree ?
According to the r-squared plots it seems tha JMP makes use of the test set to base
its decision to stop growing the tree on. This would be clearly unsatisfactory since I want
to use the test set performance (using the ROC curve and its AUC) to quanitfy the generalization
performance of the tree on unseen data.

Or is the stopping point determined by the crossvalidation ROC/AUC ? But it seems the
AUC of the Crossvalidation run can be maximized without any problems near 0.999 by growing
the tree infinitely large, which would be clear overfitting.

I use the following command to start Decision tree learning:

part = Partition(
Minimum Size Split( 20 ),
Show Tree( 1 ),
ROC Curve( 1 ),
Column Contributions( 1 ),
Split History( 1 ),
Criterion( "Maximize Significance" ),
K Fold Crossvalidation( 2 ),
SendToReport( .... )
);
part << ColorPoints << Go << LeafReport;

and selection of test set rows is done using :

Set Row States( [0,2,0,......] );

The 2 encode excluded rows.

Thanks in advance for any hints

Marc

mpb · Jan 13, 2010 03:48 PM

From p816 of the JMP Stat and Graph Guide (From JMP, Help>Books>JMP Stat and Graph Guide):

Automatic Splitting
The Go button (shown in Figure 37.12) appears when you have cross-validation enabled. This is done
by either using the K Fold Crossvalidation command, or excluding at least 20% of rows as a holdout
sample.
The Go button provides for repeated splitting without having to repeatedly click the Split button.
When you click the Go button, the platform performs repeated splitting until the cross-validation
R-Square is better than what the next 10 splits would obtain. This rule may produce complex trees that
are not very interpretable, but have good predictive power.
Using the Go button turns on the Split History command. Also, if using the Go button results in a tree
with more than 40 nodes, the Show Tree command is turned off.

View solution in original post

mpb · Jan 13, 2010 03:48 PM

From p816 of the JMP Stat and Graph Guide (From JMP, Help>Books>JMP Stat and Graph Guide):

Automatic Splitting
The Go button (shown in Figure 37.12) appears when you have cross-validation enabled. This is done
by either using the K Fold Crossvalidation command, or excluding at least 20% of rows as a holdout
sample.
The Go button provides for repeated splitting without having to repeatedly click the Split button.
When you click the Go button, the platform performs repeated splitting until the cross-validation
R-Square is better than what the next 10 splits would obtain. This rule may produce complex trees that
are not very interpretable, but have good predictive power.
Using the Go button turns on the Split History command. Also, if using the Go button results in a tree
with more than 40 nodes, the Show Tree command is turned off.