What is cross validation?
Crossvalidation is a method for estimating what the error rate of a subtree (of the maximal tree) would be if you had test data. Regardless of what value you set for Vfold cross validation, CART grows the same maximal tree. The monograph provides evidence that using a V of 1020 gives better results than using a smaller number, but each number could result in a slightly different error estimate. The optimal tree — which is derived from the maximal tree by pruning — could differ from one V to another because each crossvalidation run will come up with slightly different estimates of the error rates of subtrees and thus might differ in which tree was actually best.
Normally, a test sample is used to prune the maximal tree down to an "optimal" tree. This is especially recommended for large data sets, from which a test sample can be withdrawn. However, there are times when the size of the data set makes withdrawing a test sample difficult. In the absence of a test sample and without using cross validation, no pruning is done — this is called EXPLORATORY — and the maximal tree is the result. Note that the maximal tree in an exploratory run is identical to the maximal tree when using a test sample, provided that the learn sample is the same for each run.
When you are unwilling to use a test sample but still desire estimates of the error rates of each tree in the sequence, cross validation may be used. In a nutshell, cross validation establishes how much to prune the maximal tree by building a series of "ancillarycross validation trees" from which error rates of the maximal tree and its subtrees can be estimated. Cross validation does not affect the growth of the maximal tree at all because it is conducted after the maximal tree is grown. The V ancillary crossvalidation trees may be similar to the maximal tree, but not necessarily. Here is how it works:

The maximal tree is grown and saved. Note that we do not have any "independent" estimate of the error rates for each node in the maximal tree, because we do not have a test sample. A pruning sequence is defined based on node complexities of the maximal tree, although the error rate for each tree in the sequence is not yet known. In other words, we know which nodes to prune off the tree and in what order, and we have a series of subtrees defined by the pruning sequence, but we do not know how far to prune.

V ancillary crossvalidation trees are then grown, each on a partition of the learn sample. For instance, if 10 crossvalidation trees are grown, each uses 90% of the learn sample for tree growth and the remaining 10% as a pseudo test sample with which to estimate error rates for the nodes in the crossvalidation tree.

Error rates from each of the V crossvalidation trees are combined and mapped to the nodes in the original maximal tree. The V crossvalidation trees are then discarded.
Now that estimates of the error/cost for each node in the maximal tree are known, we are in a position to prune the maximal tree and declare an optimal tree.
Q: We typically use the default of 10fold cross validation in CART. However, when we change to, say, 20fold cross validation, CART indicates a different optimal tree. Why?
A: In both cases the maximal tree is the same. 20fold cross validation will partition the learning sample into 20 subsets and will generate 20 ancillary crossvalidation trees. These trees, each with their own error rates, will be combined to yield estimated error rates for the maximal tree. Since we are combining 20 trees rather than 10, it is almost certain that the 20fold combined error rates estimated for the maximal tree will differ from those estimated by combining 10fold crossvalidation trees. Although the pruning sequence is the same in both runs, a different tree may be chosen as optimal between the two runs due to the differing error rate estimates. In other words, the maximal tree and pruning sequence is the same, but the 10 and 20fold crossvalidation procedures will result in a different amount of pruning.
Q: In the tree sequence and on the "select tree" dialog we see "crossvalidated relative cost" (with confidence intervals) and "resubstitution relative cost," for each tree in the tree sequence, e.g.:
Sr. 
Terminal Tree Nodes 
CrossValidation Relative Cost 
Resubstitution Relative Cost 
Complexity Parameter 
1 
15 
0.7457930 +/ 0.0142744 
0.6738151 
0.0019035 
2 
10 
0.7506419 +/ 0.0135887 
0.6981514 
0.0024436 
3 
9 
0.7533725 +/ 0.0136544 
0.7033467 
0.0026077 
4 
7 
0.7476655 +/ 0.0137743 
0.7145392 
0.0028081 
5** 
6 
0.7439012 +/ 0.0135847 
0.7221265 
0.0038037 
6 
3 
0.7605784 +/ 0.0142045 
0.7499018 
0.0046392 
7 
1 
1.0000000 +/ 0.0000896 
1.0000000 
0.0625345 
A: Crossvalidated relative cost is the error rate of the tree, relative to the root node, using the crossvalidation method. If you had used a test sample instead of cross validation, you would have been presented with “test sample relative cost.” The resubstitution relative cost depicts the error rate that would be estimated had you used a copy of the learn sample as your test sample. Note that this rate always decreases as the tree gets larger. This is a property of using the same data to estimate errors that were used to build the tree in the first place. The +/ number is a measure of the uncertainty around the actual (crossvalidation or test sample) error rate of the tree in question when confronted with new data. The crossvalidation error rate is derived from one crossvalidation procedure, whereas a test sample error rate is derived from a onetest sample. Either way, if you ran another crossvalidation procedure or used a different test sample you would likely see another (slightly) different error rate. The +/ gives an idea of the uncertainty of the error rate estimate.
[J#376:1602]
Tags: Frequently Asked Questions, FAQs, CART, Support, SalfordSystems