If the sample size is not large, it is necessary to retain all the data for training purposes. However, pruning and testing must be done using independent data. A way around the dilemma is through »-fold cross-validation. Here, all the data are used to fit an initial overly large tree. The data is then divided into (usually) v = 10 subgroups, and 10 separate models fit. The first model uses subgroups 1-9 for training, and subgroup 10 for testing. The second model uses groups 1-8 and 10 for training, and group 9 for testing, and so on. In all cases, an independent test subgroup is available. These 10 test subgroups are then combined to give independent error rates for the initial overly large tree which was fit using all the data. Pruning of this initial tree proceeds as it did in the case of the independent test set, where error rates are calculated for the full tree as well as all smaller subtrees. The subtree with the smallest error rate based on the independent test set is then chosen as the optimal tree.
Questions often arise as to whether one should use an independent test set or cross-validated estimates of error rates. One thing to consider is that cross-validated error rates are based on models built with only 90% of the data. Consequently, they will not be as good as a model built with all of the data and will consistently result in slightly higher error rates, providing the modeler a conservative independent estimate of error. However, in regression tree applications in particular, this overestimate of error can be substantially higher than the truth, giving more incentive to the modeler to find an independent test set.
Was this article helpful?