Classification and regression trees

When a modeler wants to make a minimum number of assumptions about the form of model relationships and has data that include both continuous and categorical variables, the use of classification and regression tree (CART) models may be appropriate. A CART is a predictive model that sequentially maps values of predictor variables to values of the response using a tree-like structure (Figure 4). Starting with the root, each branching point corresponds to a binary decision criterion expressed in terms of the predictors. The final leaves then state the predicted value of the response variable, given the values of the predictors represented by the path from the root.

The term 'classification tree' is used when the response variable is categorical, and 'regression tree' is used when the response is continuous. CART is used to refer to both situations. As the name implies, CART models build upon the methods used for regression analysis. Specifically, for continuous response variables, the splits

Measurement model 1

Measurement model 2

Structural model

Structural model

Heat production of animal i

Figure 3 A structural equation model (SEM) relating ambient air temperature and the metabolic rate of an animal. Square nodes represent measured variables, round nodes represent latent variables, and nodes without borders represent measurement error. Reproduced from Shipley B (2000) Cause and Correlation in Biology: A User's Guide to Path Analysis, Structural Equations, and Causal Inference. Cambridge: Cambridge University Press.

Heat production of animal i

Figure 3 A structural equation model (SEM) relating ambient air temperature and the metabolic rate of an animal. Square nodes represent measured variables, round nodes represent latent variables, and nodes without borders represent measurement error. Reproduced from Shipley B (2000) Cause and Correlation in Biology: A User's Guide to Path Analysis, Structural Equations, and Causal Inference. Cambridge: Cambridge University Press.

Figure 4 Elements of a regression tree analysis to predict the abundance of the soft coral taxa Efflatounaria in terms of four spatial variables (shelf position, location, reef type, and depth) and four physical variables (sediment, visibility, waves, and slope): (a) Cross-validation plots used to select tree size. (b) Full tree before optimal pruning, in which shading indicates branches that were pruned. (c) The selected final tree with five branches. Each of the splits is labeled with the splitting criterion, and each split and leaf is labeled with the mean abundance rating (0-3) and the number of observations in the group. The relative lengths of the branches in the tree represent the proportion of the total sum of squares explained by each split. Reproduced from De'athG and Fabricius KE (2000) Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81: 3178-3192.

Figure 4 Elements of a regression tree analysis to predict the abundance of the soft coral taxa Efflatounaria in terms of four spatial variables (shelf position, location, reef type, and depth) and four physical variables (sediment, visibility, waves, and slope): (a) Cross-validation plots used to select tree size. (b) Full tree before optimal pruning, in which shading indicates branches that were pruned. (c) The selected final tree with five branches. Each of the splits is labeled with the splitting criterion, and each split and leaf is labeled with the mean abundance rating (0-3) and the number of observations in the group. The relative lengths of the branches in the tree represent the proportion of the total sum of squares explained by each split. Reproduced from De'athG and Fabricius KE (2000) Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81: 3178-3192.

used at each branching point are often determined by minimizing the sum of squared residuals within the resulting groups. For categorical variables, splits are determined by minimizing the misclassification rates of the groups (i.e., maximizing the likelihood of the data assuming a multinomial error distribution).

Finding the best-sized tree to model a particular situation can be a challenge. A tree that is too large will overfit the data and underestimate the prediction error in applying the model to new data. A tree that is too small may sacrifice some predictive ability by neglecting important and useful splits. Thus, finding the right tree size amounts to finding an appropriate estimate of prediction error. Various procedures, such as applying a penalty for model complexity (such as Akaike's information criterion (AIC)) or using cross-validation techniques, have been used for this purpose. Presently, it seems that cross-validation methods are preferable but can be computationally demanding.

In addition to being able to handle a variety of data types, CART models have a number of advantages related to ease of understanding and interpretation. Most people can quickly understand a CART model because of the graphical representation and the fact that every condition on the response variable is easily explained and predicted by Boolean logic.

The handling of missing values is also fairly straightforward. If a predictor variable used to form a split has missing values, one can either decide to stop such cases at the split, assigning them the response value of all subsequent cases, or use a nonmissing surrogate variable to determine the split criterion for that case. If a surrogate is chosen that highly correlates with the variable that is missing, this technique can be quite effective at minimizing information loss.

Uncertainty in predictions of a CART model is described by the distribution of categorical responses at each leaf, or the mean and variance of continuous responses. As described above with reference to tree-size selection, cross-validation methods can be used to generate more realistic estimates of prediction error.

Solar Power

Solar Power

Start Saving On Your Electricity Bills Using The Power of the Sun And Other Natural Resources!

Get My Free Ebook


Post a comment