Classification and regression trees are intuitive methods, often described in graphical or biological terms. A tree is typically shown growing upside down, beginning at its root. An observation passes down the tree through a series of splits, or nodes, at which a decision is made as to which direction to proceed based on the value of one of the explanatory variables. Ultimately, a terminal node or leaf is reached and predicted response is given.
Trees partition the explanatory variables into a series of boxes (the leaves) that contain the most homogeneous collection of outcomes possible. Creating splits is analogous to variable selection in regression. Trees are typically fit via binary recursive partitioning. The term binary refers to the fact that the parent node will always be split into exactly two child nodes. The term recursive is used to indicate that each child node will, in turn, become a parent node, unless it is a terminal node. To start with, a single split is made using one explanatory variable. The variable and the location of the split are chosen to minimize the impurity of the node at that point. There are many ways to minimize the impurity of each node. These are known as splitting rules. Each of the two regions that result from the initial split are then split themselves according to the same criteria, and the tree continues to grow until it is no longer possible to create additional splits or the process is stopped by some user-defined criteria. The tree may then be reduced in size using a process known as pruning.
Assigning a predicted value to the terminal nodes can be done in a number of ways. Typically, in classification trees, values at the terminal nodes are assigned the class which represents the plurality of cases in that node. The rules of class assignment can be altered based on a cost function, to adjust for the consequences of making a mistake for certain classes, or to compensate for unequal sampling of classes. In the case of regression trees, values at the terminal node are assigned using the mean of cases in that node.
As an example, consider the problem of modeling the presence or absence of the tree species Pseudotsuga menzie-sii (Douglas fir) in the mountains of northern Utah using only information about elevation (ELEV) and aspect (ASP), where data take the form:
Figure 1 illustrates a simple classification tree for this problem. Beginning with all 1544 observations at the root, the 393 cases that fall below an elevation of 2202 m are classified as having no Douglas fir. If elevation is greater than 2202 m, as is the case for 1151 observations, then more information is needed. The next split is made at an elevation of 2954 m. These very-high-elevation observations above the cutoff are also classified as having no Douglas fir. Turning now to the remaining 928 moderate-elevation observations, yet more fine-tuning is needed. The third split occurs at an elevation of 2444 m. The 622 moderately high elevation cases above 2444 m are classified as having Douglas fir present. The final split uses aspect to determine if Douglas fir is likely to grow on the remaining 306 moderately low sites, predicting Douglas fir to be present on the cooler, wetter northerly and easterly slopes, and absent on the hotter, dryer exposures.
At a minimum, construction of a tree involves making choices about three major issues. The first choice is how splits are to be made: which explanatory variables will be used and where the split will be imposed. These are defined by splitting rules. The second choice involves determining appropriate tree size, generally using a pruning process. The third choice is to determine how application-specific costs should be incorporated. This might involve decisions about assigning varying misclassification costs and or accounting for the cost of model complexity.
Was this article helpful?