Finding the smallest decision tree that would fit a given data set is known to be computationally expensive (NP-hard). Heuristic search, typically greedy, is thus employed to build decision trees. The common way to induce decision trees is the so-called 'top-down induction of decision trees' (TDIDT). Tree construction proceeds recursively starting with the entire set of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute.
For discrete attributes, a branch of the tree is typically created for each possible value of the attribute. For continuous attributes, a threshold is selected and two branches are created based on that threshold. For the subsets of training examples in each branch, the tree construction algorithm is called recursively. Tree construction stops when the examples in a node are sufficiently pure (i.e., all are of the same class) or if some other stopping criterion is satisfied (there is no good attribute to add at that point). Such nodes are called leaves and are labeled with the corresponding values of the class.
Different measures can be used to select an attribute in the attribute selection step. These also depend on whether we are inducing classification or regression trees. For classification, Quinlan uses information gain, which is the expected reduction in entropy of the class value caused by knowing the value of the given attribute. Other attribute selection measures, however, such as the Gini index or the accuracy of the majority class, can and have been used in classification tree induction. In regression tree induction, the expected reduction in variance of the class value can be used.
An important mechanism used to prevent trees from overfitting data is tree pruning. Pruning can be employed during tree construction (prepruning) or after the tree has been constructed (postpruning). Typically, a minimum number of examples in branches can be prescribed for prepruning and a confidence level in accuracy estimates for leaves for postpruning.
Was this article helpful?