機器學習入門：通俗易懂解讀決策樹

JMP_Taiwan · May 20, 2024 04:40 PM

Today's manufacturing industry is highly competitive, and companies need to constantly seek innovative ways to improve efficiency, reduce costs and maintain product quality. In this dynamic and complex environment, statistical analysis tools have become a powerful tool in the industry, and one of the widely used tools is decision trees. Decision trees are powerful statistical models that can help the manufacturing industry solve a variety of problems, from product defect analysis to production efficiency optimization to supply chain management and quality control. In this article, we will explore what a decision tree is, its application scenarios and usage techniques to cope with the changing market needs.

What is a decision tree?

Decision tree is a machine learning method that selects a feature according to a certain rule each time, and uses a certain value of the feature as a threshold to recursively divide the training sample into several subtrees, and recursively split them according to the same rules. Finally, a tree classifier is obtained that can be used for decision-making classification. Its main algorithms are: ID3, C4.5, CART, etc. Decision tree is a commonly used classification method that requires supervised learning. Supervised learning is to give a set of samples, each sample has corresponding variables and classification results. Under the premise that the classification results are known, through learning This set of samples obtains a decision tree, which can predict and classify new samples. Next, we use a simple case to help everyone understand the analytical thinking of decision trees.

Decision tree case sharing

Given the following set of data, which contains 20 samples, we hope to predict "whether to go out to watch a movie" through the following figure (1) "whether today is the weekend" and "what is your mood today".

undefined

Figure (1)

Then use this set of samples with classification results for training. To simplify the process, we assume that the decision tree is a binary tree and is similar to the following figure (2):

undefined

Figure II)

Through the study of sample data, we can find that each variable node has a specific judgment indicator, such as weekends, non-weekends, good mood, average mood, and bad mood. We call it "threshold". If it is a numeric variable node, the threshold is generally a specific value. For example, if the result is judged whether the age is over 30 years old, 30 is called the threshold.

The generation of a decision tree is generally divided into two steps, which is achieved by learning the classification results of known samples:

Generally speaking, when the node of a variable cannot give a judgment of the result, the node will be divided into two, or if it is not a binary tree, it will be divided into many until it can no longer be divided. For example, in this sample data, "whether today is the weekend" cannot directly determine whether to watch a movie or not, then add a new variable node "how are you feeling" and then split it until it cannot be split;
Choosing an appropriate threshold can minimize the misclassification rate.

Commonly used decision tree algorithms

As introduced earlier, commonly used decision tree algorithms include ID3, C4.5 and CART (Classification And Regression Tree). Among them, CART classification effect is better than other algorithms, and it is also the method used by JMP. Next we will introduce them respectively:

ID3algorithm

Information gain is used to evaluate splitting conditions. The greater the gain for a group of sample data, the better the classification effect. The process of information gain is actually a process of entropy reduction. Entropy defines the degree of information chaos. The process of ID3 modeling is also a process of changing information from chaos to simplicity.

In the above case, two variables are mentioned, whether it is the weekend and what the mood is. Let’s see how the classification effect of each variable node is:

"Today is the weekend" to determine the decision to "watch a movie": 11 of the samples were determined to be the weekend, but 10 samples ultimately decided to watch a movie, and one was misclassified;
"In a good mood" to judge the decision to "watch a movie": 6 samples were definitely in a good mood, and 6 samples were finally decided to watch a movie, with 0 misclassifications;
"In a normal mood" to judge the decision to "watch a movie": 6 of the samples were determined to be in a normal mood, but in the end 4 samples decided to watch a movie, and 2 samples were misclassified.

Finally, it was found that "good mood" determines "watching a movie" with the lowest misclassification rate, which means the information gain is the largest. Therefore, when ID3 builds a decision tree, it will prioritize splitting the variable "how are you feeling today".

C4.5algorithm

Observing the ID3 algorithm above, you will find that the higher the number of levels, the lower the misclassification rate when splitting variable nodes, and the smaller the entropy will be. If "ID" is used as the splitting variable, the purity of the subsets can be reached to the highest (because each subset has only 1 sample at this time), and the information gain is the largest. But obviously, such split variables are meaningless for prediction, which is also called overfitting.

Therefore, in order to avoid the "trap" of too many split categories, the C4.5 algorithm uses the information gain rate to evaluate the split conditions. The denominator of the information gain rate will increase as the number of split levels increases, resulting in the information gain rate. reduce. If split according to the C4.5 algorithm, although "how are you feeling today" as the splitting variable can obtain a simpler subset, it has three levels of "good mood, average mood, bad mood", resulting in its information gain rate being reversed. It is lower than the variable "whether today is a weekend", so C4.5 gives priority to "whether today is a weekend" as the split variable. The specific information gain rate algorithm will not be described in detail in this article.

CARTalgorithm

The CART algorithm is very similar to the C4.5 algorithm in ideas, but is slightly different in specific implementation and application scenarios: the CART algorithm can not only handle categorical target responses, but also continuous target responses. The reason why it is called classification and regression tree; the ID3 algorithm can only process classification responses, that is, it can only build classification trees.

CART can be a regression tree. Ideally, classification should stop when there is only one category in each child node. However, a lot of data is not easy to be completely divided, or complete division requires many splits, which will inevitably cause a long running time, so CART can classify each child node. The data in the analysis is its mean standard deviation. When the standard deviation is less than a certain value, the splitting can be terminated, thereby reducing the calculation cost.

The CART algorithm uses the Gini index (classification tree) and standard deviation (regression tree) as the measure of purity; while the ID3 algorithm uses information entropy as the measure of purity. There are also similarities between the two. The more confusing the categories included in the population, the larger the Gini index. The CART algorithm can only build binary trees, while the ID3 and C4.5 algorithms can build multi-trees (Note: only building binary trees does not mean that the response variable can only select two-level variables. The CART algorithm will adopt the principle of merging multiple levels. thus outputting a binary tree).

CART has some small segmentation problems, that is, overfitting. In order to solve this problem, particularly long trees are pruned and cut off directly. Of course, some cross-validation methods can also be used to assist in building decision trees to prevent overfitting problems.

useJMPconductCARTModel building

Using JMP analysis-predictive modeling-segmentation (i.e. decision tree), you can quickly build a CART model and form a tree. In JMP, G² (likelihood ratio chi-square, similar to the Gini index) is used for classification trees. The smaller the G², the higher the purity and the better the classification effect. As shown in Figure (3) below:

undefined

Figure (3) shows how to place response variables and factors in the Segmentation Platform dialog box.

undefined

The CART classification tree in Figure (4) above shows that for the decision of "whether to watch a movie", "whether today is the weekend" is the most important variable, and it is also the variable that causes G² to decline the fastest. Then split the branch "Today is the weekend" and you can get a sub-branch with a G² of 0 (it is the weekend and you are in a good mood, that is, 100% decided to watch a movie), achieving the purpose of being inseparable, and the classification tree is constructed. This is the process of building a decision tree in JMP. I hope it will be helpful to everyone.

>> How-to video: Using JMP to draw a decision tree

>> Learn more: Predictive modeling and machine learning with JMP

>> Download JMP: Try the software for 30 days