A Few Comments On Boosting Decision Trees
Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of "ensembles." Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of "YES" or "NO" then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict "NO" otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).
The CART® decision tree had an early version written by Jerome Friedman in 1975 and a completely independent constructed version developed by Leo Breiman and Chuck Stone around 1978. The two sets of ideas were then synthesized by the original four authors of the CART monograph (Breiman, Friedman, Olshen, and Stone) and the authors conducted extensive real world research using the new technology over the subsequent years, culminating in their monograph which reported much of their research and thought process in 1984. Not long after the publication of the monograph the authors began thinking about ways to improve and extend the original CART machinery. Friedman focused on regression models and developed the CART inspired methodology of MARS and Stone separately wrote a series of important papers on related spline regression themes. Olshen observed in the late 1980s that collections of trees might be the way to extract another increment in predictive performance. The challenge facing modelers such as Olshen thinking about ensembles in the 1980s was how to conveniently generate an ensemble. It is often hard enough to come up with one good model. How can you rapidly develop dozens, hundreds, or even thousands of decent models?
In 1995, Breiman began working on his highly influential BAGGER, or bootstrap aggregation, which became a widely adopted strategy for the automatic generation of ensembles of models. The BAGGER was important precisely because it solved the problem of how to obtain the models for the ensemble. Breiman's idea was based on bootstrap resampling. The bootstrap approach starts with a fixed training data set and then samples randomly from that training set with replacement — meaning that we keep sampling randomly again and again from the training set without paying attention to whether a record was previously selected or not. This allows records to be chosen more than once for the new sample, but essentially guarantees it if records are sampled many times. The beauty of a bootstrap sample of the same size as the original training file is that statistical theory supports the notion that the new sample is reasonable facsimile of the original while assuredly different. We thus are perturbing the training data in a way that on average does not alter the essential relationships between the predictors and the target we want to predict.
The magic of the bootstrap resample is that we can repeat the resampling process as often as we like automatically, thus generating a set of revised training files that are all drawn from the same distribution but are all different from each other. The automatically generated bootstrap resamples can then be used to train our learning machine, and each cycle of training should yield at least a slightly different model. Having generated all these models all that is left is to combine the results into an ensemble prediction, and all of this can be done automatically. With an ensemble we lose the ability to easily display the underlying predictive model but we may gain considerable predictive accuracy. Further, Breiman devoted considerable effort some years later to extract a number of very interesting reports from the collection of models to help make ensembles easier to understand.
The BAGGER turned out to be a moderate "hit" and after publication of his paper on the topic in 1996 researchers started bagging everything they could lay their hands on. And the method is still in favor as we shall observe later. However, as Breiman observed in his early papers on the topic, the benefits of bagging were often slight and modest and Breiman was convinced that there had to be yet a better way. At the same time as Breiman was working on the bagger the Bell Labs researchers Freund and Schapire had come up with an alternative strategy for combining more than one model. However, unlike the idea of the bagger, where every model is developed separately and without reference to any of the other models, in the work of Freund and Schapire, all the models are built in a disciplined sequence. We start in the same way, developing an initial model. If the model is typical of real world work then it will not predict perfectly. So Freund and Schapire suggested re-weighting the training to place more emphasis on the records which were misclassified in the first tree. Re–weighting the data bears some resemblance to resampling, which can also be described as a form of re-weighting. But the Freund and Schapire reweighting is based on results of the first tree while bootstrap resampling reweights totally randomly. After reweighting Freund and Schapire rerun the model and then combine the results of the two models to generate new predicted classifications. Any errors from these new models induce another cycle of reweighting, and this process can be continued at will. The Freund and Schapire method became known as "boosting" and the new method displayed the potential for dramatic improvements in classification accuracy. Breiman followed the lead of Freund and Schapire and adapted the bagger with a similar mechanism that he called ARCing. ARCing never became as popular as the bagger however.
While boosting gained considerable attention from the research community, its limitations were quickly discovered as well. First, the classical boosting mechanism could frequently fail to converge to any single model, with substantial oscillation in the predictions that might be made for a given record of data. Second, the upweighting of mistakes could lead to later generated models being dominated completely by records with extremely high weights. Third, the method was ultra–sensitive to data errors, especially in the dependent variable. In real world applications classical boosting turned out to be unreliable and unstable, and fell out of favor almost as rapidly as it initially gained popularity. All this was taking place in the second half of the decade of the 1990s. In 1999 Jerome Friedman revolutionized the field definitively with his work on Stochastic Gradient Boosting, sometimes also simply known as gradient boosting. This methodology, which Salford introduced to the world of data mining in 2000, and which we call TreeNet, has definitively established itself as one of the most important advances ever in predictive analytics, data mining, and machine learning.
In TreeNet®, we also build an ensemble of models, each of which is generally built on a slightly different sample (similar to bootstrap resampling). And in TreeNet, like in classical boosting, the models are built sequentially; with each new model taking into account the mistakes made by earlier versions of the ensemble. But in TreeNet, we do NOT reweight records. Instead, we construct residuals from predecessor versions of the ensemble and use the residual as the new dependent variable in a new model. Further, the "influence" of large residuals (in absolute value) can be controlled, or even eliminated. The end result is a boosting technology that is reliable, stable, robust to data errors, and phenomenally accurate.