cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Check out the JMP® Marketplace featured Capability Explorer add-in
Choose Language Hide Translation Bar
kom
kom
Level IV

Why is my Decision Tree have duplicate attributes in the same branch?

Hi all. Just wondering if anybody knows why the decision tree is (in the case below) selecting Nozzle Air Pressure >42.09, then appears to contradict itself in the next node by selecting and splitting the Nozzle Air Pressure again? My understanding is that once an attribute is split, it should no longer be considered in subsequent branches. I'm using 10-fold cross validation with a holdback of 30%. I assume as I've both continuous and nominal data that JMP uses the CART algorithm?

 

kom_0-1610017637065.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
SDF1
Super User

Re: Why is my Decision Tree have duplicate attributes in the same branch?

Hi @kom ,

 

  For the decision that some values are >=42 and >=45, still makes logical sense. If the response Y is >=45, it's still at the same time >=42 and fulfills the logical decision split. I think the more purposeful question is whether or not it makes sense that there is this extra split. If you're using either holdback (validation portion) or a validation column that is stratified on your Y output, you should be able to have the decision tree automatically split, and it should reduce the tree down to the minimum number of splits needed to make the highest R2 for the validation data. That being said, you still want to determine if such splits are reasonable for the data given the science that you're working with.

 

  Are you using the decision tree to find a predictive model? If so, you might want to do it using other platforms, like NN and so on to see which model actually predicts the best using a test data set.

 

Good luck!,

DS

View solution in original post

7 REPLIES 7
SDF1
Super User

Re: Why is my Decision Tree have duplicate attributes in the same branch?

Hi @kom ,

 

  I don't know the exact algorithm that JMP uses when splitting the decision trees, however, conceptually, it is working as intended and not contradicting itself. If you look at the other splits of your decision tree, you can see that other attributes are also used multiple times. What isn't used multiple times is the same value.

 

  In your example, at the node where the air pressure is >=42, JMP is simply using that as a decision point to separate out the data. At the next node, JMP is saying that for all those instances where air pressure is >=42, it makes an additional split where the air pressure is still >=42, but is (>=42 and also <45) OR it's (>=42 and also >45).

 

  How are you using BOTH 10-fold cross validation and a holdback of 30%. Typically, you would only use one type of validation method.

 

DS

kom
kom
Level IV

Re: Why is my Decision Tree have duplicate attributes in the same branch?

Hi Diedrich. Many thanks for taking the time to reply. Firstly, you are correct - in my speed to write the mail, I incorrectly wrote k-fold AND holdback. I did actually try both, and got somewhat similar results.

 

The Decision Tree follows the IF x AND y THEN z logic to classify the data. I get that it makes the 1st split at air pressure >=42, and I can accept that >=42 AND <45 makes sense now (in fact, this is the 1st time I've seen a decision Tree specify the same attribute to be between 2 values). Having said all that, being >=42 AND >=45 makes no practical sense at all. You agree?

 

Kieran

Re: Why is my Decision Tree have duplicate attributes in the same branch?

I think that only you can determine if the split of 42 to 45 makes practical sense. But to understand multiple splits on the same variable, suppose you have a situation where the relationship between Y and X is linear. So, for example, we KNOW that Y=10+20*X. In the real world we would not know this true model, so if you fit a tree to data with this relationship, you will get multiple splits for X. It is the only way that the tree can try to mimic a continuous linear relationship.

Dan Obermiller

Re: Why is my Decision Tree have duplicate attributes in the same branch?

Oh, and what if it were a quadratic relationship? What do you think the tree splits would look like in that case?

Dan Obermiller
SDF1
Super User

Re: Why is my Decision Tree have duplicate attributes in the same branch?

Hi @kom ,

 

  For the decision that some values are >=42 and >=45, still makes logical sense. If the response Y is >=45, it's still at the same time >=42 and fulfills the logical decision split. I think the more purposeful question is whether or not it makes sense that there is this extra split. If you're using either holdback (validation portion) or a validation column that is stratified on your Y output, you should be able to have the decision tree automatically split, and it should reduce the tree down to the minimum number of splits needed to make the highest R2 for the validation data. That being said, you still want to determine if such splits are reasonable for the data given the science that you're working with.

 

  Are you using the decision tree to find a predictive model? If so, you might want to do it using other platforms, like NN and so on to see which model actually predicts the best using a test data set.

 

Good luck!,

DS

Re: Why is my Decision Tree have duplicate attributes in the same branch?

The predictor is a continuous variable. If there is a linear relationship with the response, then it might take many splits with that predictor to approximate the relationship.

kom
kom
Level IV

Re: Why is my Decision Tree have duplicate attributes in the same branch?

@Dan_Obermiller Thankfully, I don't have to give practical direction on this one, but it got me thinking as to what I would do going forward if I were to see such sequential 'double-splitting'.

@ DiedrichSchmidt I'm beginning to make sense of it now: if pressure is between 42 and 45, then there's a high probability of a choke. If pressure is greater than 45, then 38% probability of a choke. I got hung up on the > 42 AND > 45, which made no sense in real world scenario. It all comes down to how you read the split of the same parameter. No need to use ANN in this case as I wasn't looking to create a model - rather just identify important variables, their respective split points, and potential impact on the response variable (via a small tree depth and minimal overfitting).

Thank you all for your input!