Summary
In this post, we compare running a decision tree in Python versus JMP and demonstrate how to create a customized decision tree in JMP by applying mathematical criteria to determine which parameter each branch should split on. This level of customization is not available in Python, making JMP essential for building robust models where splitting parameters is not left to chance. Moreover, this customization lets you leverage your subject matter expertise to justify the selection of the splitting parameter at each branch, especially when multiple candidates show a similar ability to split the data effectively.
This example uses the Wine data set, a classification example commonly available in the Scikit-learn library. It contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars (types of grape): labeled 0,1, and 2. In the data set, 13 chemical characterization techniques were run on each sample, including the amount of flavonoids (a type of phenol), the wine’s color intensity, the ash content, the hue, the alcohol content, etc.
The goal? To create a decision tree classification model to identify the source of the wine with the fewest and cheapest characterization tests as possible.
Using Python
First, let’s see what we get when creating a decision tree model in Python. (The script is attached to this post.) The resulting model can be graphed with matplot lib, as shown below. The first model classifies according to color intensity, proline, ash content, flavonoids, and alcohol content.
Unless the random seed is set to a certain value, the tree will look different with every new run, as seen below. While this is expected, it is not practical, since you may want to choose the best set of characterization tests that are fast and/or cheap to run. In this example, there is no way to control the feature where the split occurs or at what value.
Random state = 2
|
Random state = 14
|
Random state = 24
|
|
|
|
Tests needed: color intensity, proline, alcohol content, flavonoids
|
Tests needed: color intensity, proline, ash, alkalinity of ash, malic acid, alcohol content, flavonoids
|
Tests needed: color intensity, proline, od280/od315, flavonoids, alkalinity of ash
|
Using JMP
Let's see if JMP can do a better job when selecting characterization tests.
When running the Decision Tree platform in JMP, go to Analyze > Predictive Modeling > Partition. Set the Validation Portion to 0.3 and check Informative Missing and Ordinal Restricts Order.
The report will look something like this. Click Go.
Clicking Go produces this decision tree:
Like in Python, redoing the analysis creates another structure for the decision tree.
So, what can be done to create a less randomized model that's based on some preferences over the characterization methods used? We're going to override the splits and choose the parameters ourselves.
First, we're going to click on the Prune button. By clicking on Prune, you can go a step backward and remove one branch at a time. Next, click on the Candidates list and order the features by their G^2 or Logworth Splitting Criterion, allowing you to analyze which parameters are at the top of the list.
Below you can see the different candidates for the split, with flavonoids at the top of the list, followed by od280/od315, total phenols, and proline.
The G^2 value tells us that the model will choose flavonoid as the splitting criterion. If you click on Split, the branch split will happen using flavonoids value of 1.41.
Next, expand the Candidates list under flavonoids < 1.41. Color intensity is now at the top of the list. However, perhaps it’s easier to measure the hue over the color intensity; if so, click on the red triangle and click on Split Specific. You can then choose the parameter for the split -- in this case, hue -- and click OK.
The new tree, which is below, indicates that if the flavonoids level is less than 1.41 and hue is less than 0.906, there is 0.98 probability that the wine is made from the cultivar #2. On the other hand, when the hue is higher than 0.906, it's most likely coming from the type 1 grape. When the flavonoids level is greater than 1.41, there is nearly a 50-50 chance that the cultivar is either 0 or 1. So, let's continue the split there.
Under flavonoids >=1.41, look at the candidates and their G^2 levels. Alcohol and proline have a similar G^2 value. Since it’s easier to test for alcohol, make the split at alcohol. It reveals that there is a 91% chance that the wine comes from the type 0 grape if the alcohol content is greater than 13.5.
The final decision tree model created is below. We now need less characterization tests to identify the type of grapes and the tests chosen are easier to perform without taking a black box approach to choosing these parameters. The decision rules can then be easily saved and shared with others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.