The theory behind the CART decision tree, as laid out in detail in the classic monograph Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone, dictates that CART trees always be grown to their largest possible size before pruning. This means that the smallest allowed terminal node will have only one record in it! In theory, this is not a problem because large CART trees are just the raw material that the pruning engine starts with to arrive at the optimal tree. Thus, it is likely that all the small nodes will be pruned off anyway.
In practice, however, real world data sets do not always work that way. In fact, in small to moderate sized data sets it can pay to manipulate the controls that govern the sample sizes allowed in CART nodes. In this note we discuss this topic and provide some examples and instruction on how to use the controls.
In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.
Learn how to grow CART trees, view tree details, understand CART's color–coding mechanism and print your results.
Salford Systems Predictive Modeler, including CART®, MARS®, TreeNet®, and RandomForests®, can handle any number of variables you care to work with. By default your software will launch prepared to work with up to 32,768 variables which is sufficient for many users. However, if you need to work with a larger number you just need to let the software know at the time the application is launched.
If you are working with non–GUI version you make use of command line arguments informing the application of your preferences. For example the command line syntax is:
SPM.EXE -v< N > Specifies max N variables for the session.
With the GUI version you essentially do the same adding the command line arguments by modifying the properties of the application.
Just follow the following steps, for example, to inform SPM you expect to work with up to 50,000 variables:
The value used for this parameter reflects the number of variables allowed to be used in the application. For example, if you need to use 75,000 variables, then you would need to set this parameter at –V75000.
There are two ways to interpret your question:
Does CART® allow multi–class targets (eg, a class label with values 1,2,3,...etc)
CART has been used in real world classification problems with more than 400 classes.
In one project our goal was to predict which specific model of new car a given person actually bought. In the project there were more than 400 different car models available and the predictors were drawn from a lengthy set of attitude and interest questions.
For such models to be useful you need to have a decent sample size for each level of the target. In the car purchase study some models had been bought by more than 2000 people (a good sample size) while some exotic and expensive cars had been bought by fewer than 10 people (the total sample size was over 50,000 records). Naturally, we could not place much faith in the predictions concerning the least frequently bought cars. However, overall, the models built were both quite accurate and generated considerable insight into the factors influencing consumer choice in car purchases.
The most recent versions of Salford Predictive Modeler™ SPM PRO EX include a new BATTERY to invoke bootstrapped replication of most model types available in SPM. One of our reasons for adding this BATTERY was to provide access to the full CART engine when generating RandomForests® (RF) models. The principle advantages of this are:
Breiman’s original RF uses a stripped down and simplified tree growing algorithm designed for speed. It lacks tree growing options and missing handling, and fort many users Breiman's RF is confined to classification problems. By accessing the full CART engine with all of its Salford extensions and customized controls, modelers can accomplish far more sophisticated analyses, handle missing values with surrogates, apply penalties and constraints, and most importantly for those interested in continuous dependent variables, BATTERY BOOTSTRAP gives access to both Least Squares (LS) and Least Absolute Deviation (LAD) regression trees.
The principle drawback of BATTERY BOOTSTRAP is that the extra machinery comes with a computational price: RF runs under BATTERY BOOTSTRAP are much slower than under Breiman–RF. The extra robustness, ability to handle huge problems, and added controls should often make the slower runs worthwhile. Also observe that at the moment the RF post–model visualization machinery is not available.
At Salford Systems we are frequently asked what the difference between the trademarked decision tree CART® is and the various clones that have been created by other companies, or that have been contributed as user written packages to community oriented systems. Our website contains a variety of essays and FAQs on this matter and we’ve link to them below. But here is a very brief summary of the details:
The original and true CART was written entirely by Stanford University Professor Jerome H. Friedman, and has always been proprietary source code available only to Salford Systems. Friedman is one of the inventors of CART and widely regarded as one of the most influential and important researchers in data mining. He is also considered one of the world's best algorithm writers and scientific programmers. In other words, we offer the only true CART written by a creator of this revolutionary technology. It contains everything discussed in the original CART monograph and much more that was not touched upon in the book.
SAN DIEGO — CART® and RandomForests® co–developers include two of the prominent speakers for Salford Systems’ Analytics and Data Mining Conference, which will be held in San Diego, CA May 24–25, 2012.
CART co–developer Dr. Richard Olshen’s interests regarding research are in statistics and mathematics and their applications to medicine and biology. Many efforts have concerned binary tree–structured algorithms for classification, regression, survival analysis, and clustering. Those for classification and survival analysis have been used with success in computer–aided diagnosis and prognosis, especially in cardiology, oncology, and toxicology.