Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
  • Statistical Learning and Data Mining III
    October 01, 2012
    Boston, MA
  • DMA
    October 13, 2012 - October 19, 2012
    Las Vegas, NV
  • INFORMS
    October 14, 2012 - October 16, 2012
    Phoenix, AZ
View full calendar
Home Blog Dan Steinberg Dan Steinberg E-mail Conversation with Leo Breiman on using Out of Bag (OOB) data for pruning Bagged Trees

E-mail Conversation with Leo Breiman on using Out of Bag (OOB) data for pruning Bagged Trees

Written by  Dan Steinberg Tuesday, November 01 2011
Rate this item
(0 votes)

In 1995 Leo Breiman was actively experimenting with his first version of the bagger, and that at time I was in constant contact with him via email. In some cases at Salford Systems we implemented ideas of Leo's as we were discussing them with him. At other times we debated certain details and exchanged ideas in a lively give and take. Leo's initial ideas always took as a given that the bagged trees needed to be pruned and he was using 10–fold cross validation to do so. Because this added a substantial computational burden to the process I suggested that he use the OOB (out of bag) data to test and prune each bagged tree. In response, Leo began experimenting with this idea and eventually concluded that the entire training sample (both in–bag and out of bag) should be used to prune each bagged tree. Of course, subsequent research showed that unpruned trees were in fact ideal and thus the topic of using OOB data for pruning trees fell by the wayside. OOB data became very important in Leo”s subsequent work on RandomForests four years later.

The emails here are a selection of messages I received from Leo in mid–1995 on the topic. Unfortunately, we do not appear to have any copies of my side of the conversation. We hope to post other messages from Leo here from time to time as his remarks covered a very broad range of topics pertaining to trees and data mining.

From This e-mail address is being protected from spambots. You need JavaScript enabled to view it. Tue Jun 6 10:07:15 1995
Date: Tue, 6 Jun 1995 10:07:10&ndah;0700
To: dstein

Hi, Dan
I've been doing revisions on my bagging paper (its been accepted in Machine Learning). What I've noticed is that on some data sets the setting of the minimum node size can have an appreciable effect on the error rate. I suggest that the default for minimum node size be put at one.

Leo

From This e-mail address is being protected from spambots. You need JavaScript enabled to view it. Tue Jun 6 22:56:50 1995
To: dstein
Subject: Re: BAGGER

Hi, Dan
The idea of using the unused portion of the bootstrap as a test sample--maybe to select a pruned subtree is intriguing. Bagging is pretty compute intensive, and that idea, if it works, would save a lot of time. I'll give it a spin soon.

The reason I suggested not emphasizing class probability trees is that I haven't seen any convincing evidence that it produces better estimates of class probabilities that straight mis– classification pruning. A good example would convince me other– wise. If you want accurate estimates of class probabilities, go to gagging!

(I mean bagging!)

Rememebr my run in the bagging paper which shows an very large increase in aaccuracy of class proability estimnates if bagging is used. I've got a revision of the bagging paper coming out in the next few days and will send you a copy when it comes of of the press.

Leo

From This e-mail address is being protected from spambots. You need JavaScript enabled to view it. Wed Jun 7 20:04:22 1995
To: dstein

Hi, Dan

I implemented your idea of using the cases left out in the bootstrap sample to use as a test set in trimming the tree and ran it on a bunch of data sets. So here is the story:

It works pretty well, but in almost every case the error using it is slightly higher that using cv or using a fixed proportion of the bootstrap sample as a test set (for large data sets). The idea is really promising but I think that we don't know yet what the rigfht way is to implement it. The advantage is that is reduces computation by an order of magnitude—but I don't think thats enough to offset the (small) lossa in accuracy in todays market of cheap cycles.

Leo

From This e-mail address is being protected from spambots. You need JavaScript enabled to view it. Thu Jun 8 10:04:14 1995
To: dstein
Subject: bagging

Hi, Dan

Here is what works like a charm in bagging. Construct the tree using the bootstrap sample and then select the pruned subtree using the original data as a test sample. Not only does it work in practice, but there are also good theoretical reasons why it works.

The good thing about this is that it cuts the amount of computing needed in bagging by a factor of ten.

Leo

From This e-mail address is being protected from spambots. You need JavaScript enabled to view it. Mon Jun 12 08:42:23 1995
To: dstein
Subject: Re: bagging and testing

Hi, Dan
Glad the tutorial went well. Nothing like thousands of happy CART users.

The reason using the original data works well is this: The bagged sample used to grow the tree is drawn from the underlying bootstrap probability distribution that assigns probability 1/(sample size) to each case in the data set.

To prune the tree grtown on the bootstrap sample, you wasnt a test set drawn from the same underlying distribution—i.e. another bootstrap sample. But the test set should be as large as possible. So consider drawing an almost infinite bootstrap sample from the bootstrap probability distribution. In this latter case every case in the original sample appears a number of times nb(case id.) in the bootsample. All of the nb are large and a little argument based on the binomial (or using intuition) shows that the ratio of the nb for different cases goes to one. Ergo, using an “infinite” bootstrap sample is equivalent to using the original data set.

Pretty neat, huh? I thought of it at night last week sitting in our hot tub.

Leo

Dan Steinberg

Dan Steinberg

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.