E-mail Conversation with Leo Breiman on using Out of Bag (OOB) data for pruning Bagged Trees
In 1995 Leo Breiman was actively experimenting with his first version of the bagger, and that at time I was in constant contact with him via email. In some cases at Salford Systems we implemented ideas of Leo's as we were discussing them with him. At other times we debated certain details and exchanged ideas in a lively give and take. Leo's initial ideas always took as a given that the bagged trees needed to be pruned and he was using 10–fold cross validation to do so. Because this added a substantial computational burden to the process I suggested that he use the OOB (out of bag) data to test and prune each bagged tree. In response, Leo began experimenting with this idea and eventually concluded that the entire training sample (both in–bag and out of bag) should be used to prune each bagged tree. Of course, subsequent research showed that unpruned trees were in fact ideal and thus the topic of using OOB data for pruning trees fell by the wayside. OOB data became very important in Leo"s subsequent work on RandomForests four years later.
The emails here are a selection of messages I received from Leo in mid–1995 on the topic. Unfortunately, we do not appear to have any copies of my side of the conversation. We hope to post other messages from Leo here from time to time as his remarks covered a very broad range of topics pertaining to trees and data mining.
Date: Tue, 6 Jun 1995 10:07
I've been doing revisions on my bagging paper (its been accepted in Machine Learning). What I've noticed is that on some data sets the setting of the minimum node size can have an appreciable effect on the error rate. I suggest that the default for minimum node size be put at one.
Subject: Re: BAGGER
The idea of using the unused portion of the bootstrap as a test sample--maybe to select a pruned subtree is intriguing. Bagging is pretty compute intensive, and that idea, if it works, would save a lot of time. I'll give it a spin soon.
The reason I suggested not emphasizing class probability trees is that I haven't seen any convincing evidence that it produces better estimates of class probabilities that straight mis– classification pruning. A good example would convince me other– wise. If you want accurate estimates of class probabilities, go to gagging!
(I mean bagging!)
Rememebr my run in the bagging paper which shows an very large increase in aaccuracy of class proability estimnates if bagging is used. I've got a revision of the bagging paper coming out in the next few days and will send you a copy when it comes of of the press.
I implemented your idea of using the cases left out in the bootstrap sample to use as a test set in trimming the tree and ran it on a bunch of data sets. So here is the story:
It works pretty well, but in almost every case the error using it is slightly higher that using cv or using a fixed proportion of the bootstrap sample as a test set (for large data sets). The idea is really promising but I think that we don't know yet what the rigfht way is to implement it. The advantage is that is reduces computation by an order of magnitude—but I don't think thats enough to offset the (small) lossa in accuracy in todays market of cheap cycles.
Here is what works like a charm in bagging. Construct the tree using the bootstrap sample and then select the pruned subtree using the original data as a test sample. Not only does it work in practice, but there are also good theoretical reasons why it works.
The good thing about this is that it cuts the amount of computing needed in bagging by a factor of ten.
Subject: Re: bagging and testing
Glad the tutorial went well. Nothing like thousands of happy CART users.
The reason using the original data works well is this: The bagged sample used to grow the tree is drawn from the underlying bootstrap probability distribution that assigns probability 1/(sample size) to each case in the data set.
To prune the tree grtown on the bootstrap sample, you wasnt a test set drawn from the same underlying distribution—i.e. another bootstrap sample. But the test set should be as large as possible. So consider drawing an almost infinite bootstrap sample from the bootstrap probability distribution. In this latter case every case in the original sample appears a number of times nb(case id.) in the bootsample. All of the nb are large and a little argument based on the binomial (or using intuition) shows that the ratio of the nb for different cases goes to one. Ergo, using an "infinite" bootstrap sample is equivalent to using the original data set.
Pretty neat, huh? I thought of it at night last week sitting in our hot tub.