In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.
In 1995 Leo Breiman was actively experimenting with his first version of the bagger, and that at time I was in constant contact with him via email. In some cases at Salford Systems we implemented ideas of Leo's as we were discussing them with him. At other times we debated certain details and exchanged ideas in a lively give and take. Leo's initial ideas always took as a given that the bagged trees needed to be pruned and he was using 10–fold cross validation to do so. Because this added a substantial computational burden to the process I suggested that he use the OOB (out of bag) data to test and prune each bagged tree. In response, Leo began experimenting with this idea and eventually concluded that the entire training sample (both in–bag and out of bag) should be used to prune each bagged tree. Of course, subsequent research showed that unpruned trees were in fact ideal and thus the topic of using OOB data for pruning trees fell by the wayside. OOB data became very important in Leo”s subsequent work on RandomForests four years later.
The emails here are a selection of messages I received from Leo in mid–1995 on the topic. Unfortunately, we do not appear to have any copies of my side of the conversation. We hope to post other messages from Leo here from time to time as his remarks covered a very broad range of topics pertaining to trees and data mining.
Users of cross validation (CV) in CART, MARS, and TreeNet have become accustomed to simply requesting this testing method when setting up a predictive model and allowing the software to take care of the details. Of course, the Salford software prepares the data automatically and uses stratified sampling to randomly assign each record to a CV bin. The user has no influence and no control over how the bins are managed.
There will be times, however, when it is advantageous to construct these CV bins yourself. This can occur, for example, if you want to compare results across different software tools so as to be sure that any differences in results between methods are not due to the cross-validation process itself. By using the same CV bins in every modeling run, you can be sure that any differences in performance are due only to differences in modeling methods. Analysts with repeated observations on subjects will want to assign subjects rather than individual data records to CV bins, keeping all data belonging to a given subject together at all times. In data with a temporal dimension, it may be desirable to break the data into bins along the time dimension (for example, assigning records from every calendar month to a distinct bin). Although assigning data records to CV bins needs to be conducted with care to ensure that the right kind of balancing of the data is maintained, the process is quite simple and mechanical. In preparing the data for analysis you need to create a new column of data on which the bin assignments for every record in the training data will be recorded. If you prefer to work with a numeric bin variable, we recommend that you use the integers from 1 through K, where K is the number of bins you want. If you prefer to work with a text or character variable to record bin assignments, then you are free to come up with any unique labels you like for the bins.
Once the bin variables are created you can use the GUI to let CART, MARS, or TreeNet know that you intend to use your own CV bin variable on the MODEL setup dialog’s TEST tab. In the screen capture below you can see that we have selected “Variable determines cross-validation bins” as our test method.

Alternatively, you can issue a command of the form:
which is the command that would be generated by the GUI for you from the screen above.
If you are going to create a CVBIN variable we recommend that you create several versions using different random number seeds because the results will vary across different partitionings of the data and it is useful to know how sensitive each of the results is.