As discussed in the pruning section, an overly large tree can easily be grown to some user-defined minimum node size. Often, though, the final tree selected through tree pruning is substantially smaller than the original overly large tree. In the case of regression trees, the final tree may be 10 times smaller. This result can be a substantial amount of wasted computing time. Consequently, one can specify a penalty for cost complexity which is equal to the resubstitution error rate (error obtained using just the training data) plus some penalty parameter multiplied by the number of nodes. A very large tree will have a low misclassification rate but high penalty, while a small tree will have a high misclassification but low penalty. Cost complexity can be used to reduce the size of the initial overly large tree grown prior to pruning, which can greatly improve computational efficiency, particularly when cross-validation is being used.
One process that combines the cross-validation and cost complexity ideas is to generate a sequence of trees of increasing size by gradually decreasing the penalty parameter in the cost-complexity approach. Then, tenfold cross-validation is applied to this relatively small set of trees to choose the smallest tree whose error falls within one standard error of the minimum. Because each time a tenfold cross-validation procedure is run a modeler might see a different tree size chosen, multiple (like 50) tenfold processes may be run, with the most frequently appearing tree size chosen.
Was this article helpful?