CART Classification And Regression Trees

CART

Classification and Regression Trees

Ultimate Classification Tree:
CART is the ultimate classification tree that has revolutionized the entire field of advanced analytics and inaugurated the current era of data mining. CART, which is continually being improved, is one of the most important tools in modern data mining. Others have tried to copy CART but no one has succeeded as evidenced by unmatched accuracy, performance, feature set, built-in automation and ease of use. Designed for both non-technical and technical users, CART can quickly reveal important data relationships that could remain hidden using other analytical tools.
Proprietary Code:
Technically, CART is based on landmark mathematical theory introduced in 1984 by four world-renowned statisticians at Stanford University and the University of California at Berkeley. Salford Systems' implementation of CART is the only decision tree software embodying the original proprietary code. The CART creators continue to collaborate with Salford Systems to continually enhance CART with proprietary advances.
Fast and Versatile:
Patented extensions to CART are specifically designed to enhance results for market research and web analytics. CART supports high-speed deployment, allowing Salford models to predict and score in real time on a massive scale. Over the years CART has become known as the fastest and most versatile predictive modeling algorithm available to analyst, it is also used as a foundation to many modern data mining approaches based on bagging and boosting.

 

 

[K#512:1305]

Features

CART Features available in Basic, Pro, ProEx, and Ultra.


ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engine:
CART (Decision Trees)
o o o o
Linear Combination Splits o o o o
Optimal tree selection based on area under ROC curve o o o o
User defined splits for the root node and its children   o o o
Automation: Generate models with alternative handling of missing values (Battery MVI)   o o o
Automation: RULES: build a model using each splitting rule (six for classification, two for regression).   o o o
Automation: Build a series of models using all available splitting strategies (six for classification, two for regression) (Battery RULES)   o o o
Automation: Build a series of models varying the depth of the tree (Battery DEPTH)   o o o
Automation: Build a series of models changing the minimum required size on parent nodes (Battery ATOM)   o o o
Automation: Build a series of models changing the minimum required size on child nodes (Battery MINCHILD)   o o o
Automation: Explore accuracy versus speed trade-off due to potential sampling of records at each node in a tree (Battery SUBSAMPLE)   o o o
Multiple user defined lists for linear combinations     o o
Constrained trees     o o
Ability to create and save dummy variables for every node in the tree during scoring     o o
Report basic stats on any variable of user choice at every node in the tree     o o
Comparison of learn vs. test performance at every node of every tree in the sequence     o o
Hot-Spot detection to identify the richest nodes across multiple trees     o o
Automation: Vary the priors for the specified class (Battery PRIORS)     o o
Automation: Build a series of models limiting the number of nodes in a tree (Battery NODES)     o o
Automation: Build a series of models trying each available predictor as the root node splitter (Battery ROOT)     o o
Automation: Explore the impact of favoring equal sized child nodes (Battery POWER)     o o
Automation: Vary the priors for the specified class (Battery PRIORS)     o o
Automation: Build a series of models by progressively removing misclassified records thus increasing the robustness of trees and posssibly reducing model complexity (Battery REFINE)     o o
Automation: Bagging and ARCing using the legacy code (COMBINE)     o o
Build a CART tree utilizing the TreeNet engine to gain speed as well as alternative reporting       o
Build a Random Forests model utlizing the CART engine to gain alternative handling of missing values via surrogate splits (Battery BOOTSTRAP RSPLIT)       o

 additional cart features

[K#513:1305]

Requirements

Minimum System Requirements for Windows

Minimum System Requirements for Windows

We suggest the following minimum and recommended, system requirements:

  • 80486 processor or higher.
  • 512MB of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE they will. We highly recommend that you follow the recommended memory configuration that applies to the particular version you have purchased. Using less than the recommended memory configuration results in hard drive paging, reducing performance significantly, or application instability.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
  • CD-ROM or DVD drive.
  • Windows XP/2003/2008 and Windows 7.

Recommended System Requirements

Because Salford tools are extremely CPU intensive, the faster your CPU the faster they will run. For optimal performance, we strongly recommend they run on a machine with a system configuration equal to, or greater than, the following:

  • Pentium 4 processor running 2.0+ GHz.
  • 2 GIG of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE they will. We highly recommend that you follow the recommended memory configuration that applies to the particular version you have purchased. Using less than the recommended memory configuration results in hard drive paging, reducing performance significantly, or application instability.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
  • CD-ROM or DVD drive.
  • Windows XP/2003/2008 and Windows 7.
  • 2 GIG of additional hard disk space available for virtual memory and temporary files.

Ensuring Proper Permissions

If you are installing on a machine that uses security permissions, please read the following note.

  • You must belong to the Administrator group on Win-XP, Win-2003/2008 and Windows & to be able to properly install and license. Once the application is installed and licensed, any member with read/write/modify permissions to the applications /bin and temp directories can execute and run the application.

Licensing Application

CART uses a system of application system ID and associated unlock key. When installation is complete, the user will need to email the application "system ID." This system ID is clearly displayed in the License Information displayed the first time the application is started. You can alternatively get to this window by selecting the Help->License menu option.

Method 1: Fixed License
With a fixed license, each machine must have its own copy of the licensed program installed. If your license terms permit more than one copy, then the license must be activated on each machine that will be used.

Method 2: Floating License
This method of licensing your program is used if you intend the program application to be used by more than one user concurrently over a network. A floating license tracks the number of copies "checked out." When that number exceeds your license terms, a message is provided informing the user "all copies are checked out." The licensed program may be installed on a machine that each client machine can access. Machines that are not connected to the network must be issued a fixed license (Method 1 above).

A floating license is particularly useful when the number of potential users exceeds the number of seats specified in your license terms.

Minimum System Requirements for UNIX/Linux

Minimum System Requirements for UNIX/Linux

Supported Architectures

  • Alpha: DEC 3000 or AlphaServer running Tru64 UNIX 5.0 or higher
  • Linux/i386: i586 or higher processor; Linux 2.4 or higher kernel; glibc 2.3 or higher
  • Linux/AMD64: AMD64 or Intel EM64T processor; Linux 2.6 or higher kernel; glibc 2.3 or higher
  • Sun: UltraSPARC processor; Solaris 2.6 or higher
  • RS/6000: POWER or PowerPC processor; AIX 4.2 or higher
  • HP 9000: PA/RISC 1.1 or higher processor; HP/UX 11.x
  • SGI: MIPS 4 or higher processor; IRIX 6.5

Minimum System Requirements

  • Minimum RAM requirement for all non-GUI app's is 32 MB of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG).
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).

Recommended System Requirements

  • Recommended random-access memory (RAM) is 1.5 times the licensed data limit (32 MB, 64 MB, etc), up to the maximum permitted by the target architecture. On UNIX systems, it is generally recommended that there be at least twice as much swap space as there is RAM.
  • >Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).

All Salford apps are CPU intensive, so more memory and a faster CPU are always helpful.

[K#514:1305]

Price

[K#517:1305]

Download

The SPM Salford Predictive Modeler® software suite is a highly accurate and ultra-fast platform for creating predictive, descriptive, and analytical models from databases of any size, complexity, or organization. The SPM® software suite has automation that accelerates the process of model building by conducting substantial portions of the model exploration and refinement process for the analyst. While the analyst is always in full control, we optionally anticipate the analyst's next best steps and package a complete set of results from alternative modeling strategies for easy review. Do in one day what normally requires a week or more using other systems.

The Salford Predictive Modeler® software suite includes:

CART
The definitive classification tree developed by world renowned statisticians including Drs Jerome Friedman and Leo Breiman. CART is one of most well known data mining algorithms considered to be algorithm responsible for bringing out university into business
MARS:
Ideal for users who prefer results in a form similar to traditional regression while capturing essential non–linearities and interactions.
TreeNet:
TreeNet is salford's most flexible and powerful data mining tool capable of consistently generating extremely accurate models has been responsible for the majority modeling competition awards demonstrates remarkable performance both regression classification algorithm typically generates thousands small decision trees built in a sequential error correcting process to converge an model
RandomForests:
RF features include prediction, clusters and segment discoveries, anomaly tagging detection and multivariate class description. The method was developed by Leo Breiman and Adele Cutler of University of California, Berkeley.


New Components & Features available in version 7.0!

GPS:
Generalized Path Seeker is Jerry Friedman's approach to regularized regression this technology offers high speed lasso for extreme data set configurations with upwards of 100,000 predictors and possibly very few rows such sets are commonplace in gene research text mining. The new both supremely fast efficient
RuleLearner:
RuleLearner is a powerful post–processing technique which selects the most influential subset of nodes, thus reducing model complexity. RuleLearner allows the modeler to take advantage of the increased accuracy of very complicated TreeNet and RandomForests models while still yielding the simplicity of CART models.
[K#601:1306]

University Program

Salford Systems' University Program provides CART at significantly reduced licensing fees to the educational community. Eligible educational institutions are colleges, universities, community colleges, technical schools, and science centers. Additionally, a 90-day free evaluation is available upon request.

The University Program gives eligible educational institutions the right to distribute CART and other Salford tools right-to-use licenses to all faculty, staff, and students for personal computers, and to install UNIX versions of these tools on University workstations and servers. For more information on this special program, please contact our sales department.

Salford Systems is committed to supporting education and research in universities worldwide and offers special packaging and pricing.

We also offer academics cost-free access to our tutorial materials for classroom use.

 

[K#516:1305]

Product Versions

SPM 7 Product Versions

Ultra
The best of the best. For the modeler who must have access to leading edge technology available and fastest run times including major advances in ensemble modeling, interaction detection and automation. ULTRA also provides advance access to new features as they become available in frequent upgrades.
ProEx
For the modeler who needs cutting-edge data mining technology, including extensive automation of workflows typical for experienced data analysts and dozens of extensions to the Salford data mining engines.
Pro
A true predictive modeling workbench designed for the professional data miner. Variety of supporting conventional statistical modeling tools, programming language, reporting services, and a modest selection of workflow automation options.
Basic
Literally the basics. Salford Systems award winning data mining engines without extensions or automation or surrounding statistical services, programming language, and sophisticated reporting. Designed for small budgets while still delivering our world famous engines

[K#523:1308]

Scalability

A user's license sets a limit on the amount of learn sample data that can be analyzed. The learn sample is the data used to build the model. Note that there is no limit to the number of test sample data points that may be analyzed. In other words, rows -by- columns of variables and observations used to build the model. Variable not used in the model do not count. Observations reserved for testing, or excluded for other reasons, do not count.

For example, suppose our 32MB version that sets a learn sample limitation of 8 MB. Each data point occupies 4 bytes. For instance, a 8MB capacity license will allow up to 8 * 1024 * 1024 / 4 = 2,097,152 learn sample data points to be analyzed.A data point is a represented by a 1-variable by- 1-observation (1-row by-1-column).

The following is a table that describes the current set of "sizes" available. Please note that the minimum required RAM is **not** the same as the learn sample limitation.

Size Data Limit MB Data Limit # of values  
minimum required
physical memory
(RAM) in MB
Licensed learn sample
data sizein MB 
(1 MB = 1,048,576 bytes)
Licensed # of learn
sample values
(rows by columns)
 
32 8 2,097,152  
64 18 4,718,592  
128 45 11,796,480  
256 100 26,214,400  
512 200 52,428,800  
1024 400 104,857,600  
2048 800 209,715,200 **64-bit only
3072 1200 324,572,800 **64-bit only

Additional larger capacity is available under 64-bit operating systems, using our non-GUI (command-line) builds. The non-GUI is very flexible and can be licensed for large data limits not currently available in the GUI product line. The current MAXIMUM is 8-GIG data capacity for our non-GUI builds.

[K#515:1305]

Videos

Click on title to open slide

Introduction to CART

Introduction to CART
By: Mikhail Golovnya, Salford Systems.

[k#630:1307]

Training in CART

Six Part Video Presentation

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

[k#636:1307]

Training in Advanced CART

Multi-part video presentation

Part 1

Part 2

Part 3

[k#637:1307]

[K#627:1307]

Testimonials

Adrian Gepp, Australia

Bond University:

The failure of businesses is an enduring and costly concern. Business failure prediction models attempt to provide early warnings to mitigate some of the costs of future failure, if not avoid it altogether. Research has shown that CART (by Salford Systems) is a good choice for building such models.

In research published in a top academic journal in 2010, empirical evidence was presented to suggest that decision-tree techniques are superior predictors of business failure. On the hold-out data, the CART decision trees were found to outperform See5 decision trees and discriminant analysis at predicting business failure.

In peer-reviewed research presented at a 2012 academic conference, CART decision trees were compared with a semi-parametric Cox survival analysis model for predicting corporate financial distress over a variety of misclassification costs and prediction intervals. The results from the hold-out data suggest that CART decision trees are the superior predictors of financial distress. Using a weighted error cost metric, CART models had a lower cost of prediction for all misclassification costs and prediction intervals.
References
*Gepp, A., Kumar, K. & Bhattacharya, S. (2010). Business failure prediction using decision trees.Journal of Forecasting, 29[6]: pp. 536-555.
* Gepp, A. & Kumar, K. (2012). Financial Distress Prediction using Decision Trees and Survival Analysis. Presented at 7th Annual London Business Research Conference, 9-10 July, London.

Adrian Gepp, Bond University, Australia


Dr. Martin Kidd, IMT, South Africa

Government:

As a statistician in the Naval environment, I have been involved in the field of data mining for the past four years. Classification trees have become one of the primary tools with which I extract useful information from large data bases. I have used various different classification tree software, and have found CART to be the superior product. What I find particularly useful are the following:
* The colour codes of the nodes which one can use to pick the most important branches (or rules).
* The relative cost vs number of nodes graph which I always use to select the 'least complicated' with 'low' relative cost.
The Gains chart provides a good graphical view for assessing tree performance.

Dr. Martin Kidd, IMT, South Africa


Steven Li, Senior Manager, Risk Technology, Sears, Roebuck and Co

CART is an important statistical analysis tool that we use to segment our databases and predict risk factors for the Sears Card. The advantage of the decision tree format is that our results are easy to interpret; especially with CART, we are able to see a great deal of detail about each of the nodes, such as the node's misclassification costs, the count of data assigned to that node, and a display of the surrogate values substituted for the node.

 Steven Li, Senior Manager, Risk Technology, Sears, Roebuck and Co


Andrea S. Laliberte, Remote Sensing Scientist at Earthmetrics

I have used CART in conjunction with remote sensing and digital image processing for producing vegetation classifications. CART is an excellent approach for determining the most suitable features (image bands, image ratios, elevation, slope, etc.) for image classification, and for reducing the number of input features to a reasonable number. In comparison with other feature reduction and selection methods, the CART approach has always worked superior for my applications. I really like the intuitive approach, easy to use manual, and the visual interface which makes it easy to interpret the data. In addition, all my interactions with the people at Salford Systems have been wonderful. I highly recommend the software.

 Andrea S. Laliberte, Remote Sensing Scientist at Earthmetrics
Oregon, USA


Anneli Anglund, PhD student at University College Cork

I am a PhD student in the field of marine bioacoustics and while I was looking into analysis methods for my thesis I came across CART. I thought it seemed like an interesting approach and when I tried it I was immediately impressed by the easy to use manual. Even though the examples were not necessarily within my field of study, they made sense and I found it easy to apply the methods to my own data. I would very much like to recommend this software and the very helpful staff of Salford Systems.

 Anneli Anglund, PhD student at University College Cork
Ireland


Chris Gooley, Founder and President at eTs Marketing Science

I've used Salford Systems software products ever since 1991 when Dan Steinberg and his team were first developing Salford tools in conjunction with the pioneering data mining scientists at Stanford and Berkeley.

I am an extensive user of SAS and SPSS software products. However, when it comes to decision trees and highly predictive models, I always to turn to CART and other Salford Systems software products. Not only is the user interface simple to use but writing your own syntax is easy to do as well.

The reasons I like Salford Systems tools and CART specifically include:

  1. The large number of options for tuning the algorithm, including statistical methods, tree depth, minimum node size, and cross validation procedures
  2. Easy to use facilities for building ensemble models via bagging, boosting, and arcing methods
  3. Intuitive, easy to understand metrics such as variable importance that are useful for checking if a model makes “business sense”
  4. Scoring and translating models is very fast and easy
  5. Ease of integration with SAS and SPSS

I can guarantee any analyst that invests a modest amount of their time with Salford tools will
never regret the experience nor go back to using less powerful alternatives!

Chris Gooley, Founder and President at eTs Marketing Science


Dean Abbott, Founder and President at Abbott Analytics/Abbott Consulting

I've used Salford Systems tools for years and have recommended purchase of the suite to many companies I've worked with. Reasons I like it so much include:
* The trees build super fast, even with large numbers of rows and columns
* CART shows you the entire sequence of trees that have been built; you can customize the depth you find most appropriate or let CART decide the optimum depth
* Default settings are great but you can still customize
* Battery options let you loop over key settings

Dean Abbott, Founder and President at Abbott Analytics/Abbott Consulting
San Diego, CA USA


Eric Weiss, Ph.D., Consultant; Arid Lands Resource Sciences, University of Arizona

Academic
As a research scientist in both academic and professional environments, I work with databases too large and complex to process manually. CART, unlike multiple linear programming and other methods that are constrained by functional forms, shows me truer characterizations of interrelationships between the data. CART is also a robust program that can support a diverse set of applications ranging, in my case, from food security analyses to pattern recognition and remote sensing problems.

 Eric Weiss, Ph.D., Consultant; Arid Lands Resource Sciences, University of Arizona


Feng Xu, Senior Manager, AT&T Universal Card Services

Telecommunications:
When we purchased CART, it was the only comprehensive classification and segmentation software available that could handle the large data sets we use for credit card risk management. In addition, CART provides us with a great deal of flexibility by allowing us, for example, to specify a higher penalty for misclassifying a certain data value.

 Feng Xu, Senior Manager, AT&T Universal Card Services


Marsha Wilcox, Ed.D., Vice President, PreVision Marketing

Marketing
PreVision Marketing's clients include Fortune 500 companies from telecommunications, automotive, retail and packaged goods industries. We apply our database marketing and analysis expertise to turn our clients' usual wealth of customer information into beneficial marketing information and customer relationship programs. At PreVision, this typically includes developing models of customer and prospective customer behavior. CART's recursive partitioning abilities give us a proven statistical method for generating marketing models in an easy-to-understand decision tree format. This format is accessible to all of our clients, even those with limited statistical backgrounds, and the clarity of the decision tree display gives our clients added confidence in the validity and utility of the models we create.

 Marsha Wilcox, Ed.D., Vice President, PreVision Marketing


Terence Mak, VP, Lead Analytic Consultant, Fleet Financial Group

Banking/Finance
CART offers two distinctive advantages that other database segmentation tools do not. First, it allows the analyst to identify the smallest target segment possible, such as ten out of tens of thousands, with exceptional precision. In addition, CART allows us to specify a higher penalty for misclassifying a potentially poor prospect than for rejecting a good one; this makes us more confident that, for products with very thin margins, our segmentation models avoid prospects who would likely be non-profitable. CART is an invaluable data mining and modeling tool for Fleet Financial Group.

 Terence Mak, VP, Lead Analytic Consultant, Fleet Financial Group


Wesley Johnston, Chevron Information Technology Co.

Industrial:
At Chevron, we conduct a lot of exploratory work for oil well drilling. Instead of taking many expensive core samples, we can use stet monitoring tools to characterize geographic areas; data capture generates small data sets with variables that are complex and interrelated rather than independent. CART, with its v-fold cross-validation capability, is our tool of choice for analyzing these small, complex data sets.

 Wesley Johnston, Chevron Information Technology Co.


William Burrows, Meteorological Research Scientist, Atmospheric Environment Service

Government:
I use CART to provide Canadian meteorologists with dynamic statistical models for predicting lake effect snowfall, ozone levels and other weather issues that affect Canada. The optimal tree models I create in CART have proven their accuracy many times over when the tree is used with independent data.

 William Burrows, Meteorological Research Scientist, Atmospheric Environment Service


[k#628:1307]

 

download-now  ondemand-video

[K#511:1305]

 

FacebookTwitterLinkedin