Random Forests OOB vs. Test Partition Performance
Random Forests is the unique learning machine that has no need of an explicit test sample because of its use of bootstrap sampling for every tree. This ensures that every tree in the forest is built on about 63% of the available data, leaving the remaining approximately 37% for testing [the OOB (out-of-bag) data].
Since the OOB 37% of the data available for test changes from tree-to-tree, some care needs to be taken in order to leverage it to arrive at an estimate of the accuracy of the forest on new data. To accomplish this we construct a record-specific forest for every row of data in the learning sample, extracting from the full forest just the subset of trees for which the row of data in question was OOB. If the record was not used to grow the tree in question, then the record can be treated as a true test record with respect to that tree. If we grow 500 trees then on average a record will be OOB for about .37*500=185 trees; if we use just these trees to generate predictions then we have what appears to be an honest estimate of the performance of our forest for that record. We can repeat this process for every record in the learning partition, arriving at an assessment for the overall learn sample.
Regression models will be evaluated by mean squared error or mean absolute error. Binary classification models will normally be evaluated by AUC/ROC (area under the ROC curve), and multinomial classification models will be evaluated by some averaging of the misclassification rates across the classes.
The question we address here is whether the OOB performance measures can be expected to be reasonably similar to the results we would obtain if we decided to allocate some data to a test partition. The simple answer is that some differences are in fact to be expected for the following reasons:
OOB predictions are based on a rather small subsample of the trees in the forest and are thus at a disadvantage relative to predictions that can be legitimately based on the entire forest. We would expect that a prediction based on 500 trees would be (on average) more accurate than one based on a subset of 185 of those trees.
OOB predictions are expected to exhibit worse performance because their distribution is not the same as the distribution on which the individual trees are grown. This difference is behind the .632 bootstrap, which estimates the prediction error of a single model as a weighted average of the OOB error rate obtained from repeated bootstrap sampling and the naive error rate based on the training data. See Efron and Tibshirani, (1993), p. 253). As they say, "the bootstrap samples used to compute the [OOB error rate] are further away on the average than a typical test sample, by a factor of 1/.632." In other words, Efron and Tibshirani argue that OOB-based error estimates tend to be pessimistic.
Both reasons lead us to expect that OOB results will be pessimistic - but typically only mildly so. A later paper by Dr. Leo Breiman (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.3712&rep=rep1&type=pdf) shows experimental results comparing OOB and test sample error estimates.
Check out this on-demand webinar series: "The Evolution of Regression Modeling"