CART® vs. The Clones
At Salford Systems we are frequently asked what the difference between the trademarked decision tree CART® is and the various clones that have been created by other companies, or that have been contributed as user written packages to community oriented systems. Our website contains a variety of essays and FAQs on this matter and we've link to them below. But here is a very brief summary of the details:
The original and true CART was written entirely by Stanford University Professor Jerome H. Friedman, and has always been proprietary source code available only to Salford Systems. Friedman is one of the inventors of CART and widely regarded as one of the most influential and important researchers in data mining. He is also considered one of the world's best algorithm writers and scientific programmers. In other words, we offer the only true CART written by a creator of this revolutionary technology. It contains everything discussed in the original CART monograph and much more that was not touched upon in the book.
Sometimes data analysts ask if the clones could be acceptable given that the methodology is described in the CART monograph. It is important to keep these things in mind: the monograph is a conceptual work designed to convey some of the core thinking of the authors Breiman, Friedman, Olshen, and Stone. It does not provide full details on how many essential CART processes should be programmed. Since the programming was done solely by Professor Friedman while the book was written by the other three co–authors, it should not be surprising that the book and the software sometimes diverge, and sometimes the explanations in the monograph are cryptic to say the least.
In short, it would be impossible to reproduce the original CART from the monograph as the monograph is neither an engineering blueprint, nor complete. The book is a brilliant discussion of many important concepts central to modern approaches to analytics, along with excellent coverage of perhaps 20 percent of what goes on in the CART software. The choice is between the code written by the inventor and world-wide acclaimed data mining visionary Jerome Friedman and whatever was put together by the employees of the clone makers.
Even so, you might ask if the clones might be "good enough." We first want to point out that we used real CART to win two first place prizes in the hotly contested KDDCup 2000 data mining competition. And CART has played a role in other wins we have garnered during the last decade. You will not find such a track record for any other decision tree knock–off.
There is also a matter of features. We have been working with the CART authors since 1990 (yes more than twenty years of collaboration). During that time we have worked out enhancements, new controls, new splitting options, new reports, and new uses for CART, all of which are built into Salford's latest software. You can always run CART without enhancements if you like and compare results with some of the new features, such as our patented constraints on tree structures, and new linear combination splits. The new features are part of our ongoing research and development effort to continually upgrade and improve the core algorithmic functionality and the user features. The end result is a decision tree with more features, better performance, more reliability, and more capability than you will find in any other decision tree.
Finally, there is the matter of speed, which becomes important in today's huge data sets. The true CART was written for speed from the beginning, and CART can accomplish the complete growing and pruning process remarkably quick. We find that clones simply cannot compete. As a result, clones often rely on secret short–cuts that might not be to your liking. Most popular clone "cheats:"
—Stop tree growing after a shallow tree is built to save run times
You always have the freedom to do this with the true CART. But with the clones you don't have a choice or the clone tries to fool you into thinking the shallow tree is good enough. True CART requires the growing of a substantially oversized tree followed by cost–complexity pruning. Only true CART can do this in lightning speed. The clones typically skip the essential details and hope you don't understand the technology well enough to realize what they have done.
—Radically down sample from the training data set
Again, if you want to do this, you are free to do so in real CART. But some clones do not give you a choice, and further attempt to hide this from you. Suppose you have a dataset with 100,000 records. One clone trick is to take random sample of 5,000 records in order to search for the split on the root node. (Yes, the sampling ratio is just 5 percent). If you think this is a good idea you can ask the true CART to do this. But if not, true CART will use all 100,000 records to search for the best splitter. In some clones you simply do not have a choice. And the reason is that true CART is so fast it can do what CART is supposed to do. The fakes cannot execute the core CART procedures in reasonable time so they give you something that they can do in reasonable time. The problem is that you do not get a choice and further the sellers of the clones try to keep this from you or preach it as a virtue. In our opinion you need the real CART which gives you total control of when and where to sample (if ever) and where and when to cut tree growth short (if ever).
A final note. In the end, what matters is accuracy, reliability, robustness, ease–of–use, and analytical controls. On every one of these dimensions we outperform the universe of decision trees. We are lucky to be standing on the shoulders of the giants in the fields of data mining, machine learning, and predictive analytics. Those giants have enabled us to offer the world's best data mining software. So be part of the data analysis revolution inaugurated by Breiman, Friedman, Olshen, and Stone, and go with real CART