In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.
At Salford Systems we are frequently asked what the difference between the trademarked decision tree CART® is and the various clones that have been created by other companies, or that have been contributed as user written packages to community oriented systems. Our website contains a variety of essays and FAQs on this matter and we’ve link to them below. But here is a very brief summary of the details:
The original and true CART was written entirely by Stanford University Professor Jerome H. Friedman, and has always been proprietary source code available only to Salford Systems. Friedman is one of the inventors of CART and widely regarded as one of the most influential and important researchers in data mining. He is also considered one of the world's best algorithm writers and scientific programmers. In other words, we offer the only true CART written by a creator of this revolutionary technology. It contains everything discussed in the original CART monograph and much more that was not touched upon in the book.
Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of “ensembles.” Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of “YES” or “NO” then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict “NO” otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).
In 1995 Leo Breiman was actively experimenting with his first version of the bagger, and that at time I was in constant contact with him via email. In some cases at Salford Systems we implemented ideas of Leo's as we were discussing them with him. At other times we debated certain details and exchanged ideas in a lively give and take. Leo's initial ideas always took as a given that the bagged trees needed to be pruned and he was using 10–fold cross validation to do so. Because this added a substantial computational burden to the process I suggested that he use the OOB (out of bag) data to test and prune each bagged tree. In response, Leo began experimenting with this idea and eventually concluded that the entire training sample (both in–bag and out of bag) should be used to prune each bagged tree. Of course, subsequent research showed that unpruned trees were in fact ideal and thus the topic of using OOB data for pruning trees fell by the wayside. OOB data became very important in Leo”s subsequent work on RandomForests four years later.
The emails here are a selection of messages I received from Leo in mid–1995 on the topic. Unfortunately, we do not appear to have any copies of my side of the conversation. We hope to post other messages from Leo here from time to time as his remarks covered a very broad range of topics pertaining to trees and data mining.
When we started our work to release our first release of CART (a 1993 command line version running very nicely on UNIX), I was startled by some (now long forgotten) articles claiming to describe a new technology that was more accurate, or faster, on some class of analytic problem. At the time, I assumed that such articles needed to be taken seriously because they represented peer-reviewed, solidly-researched scientific advances.
SAN DIEGO – Salford Systems, a pioneer in developing data mining and predictive analytics software, has once again provided the winning technology in a major competitive analytics event, this time at the 2010 Direct Marketing Association (DMA) Analytic Challenge, sponsored by the CAC Group, Inc.