SPM v7.0 - Salford Predictive Modeler® software suite

SPM 7

Brainpower:
56 Pre-packaged scenarios inspired by how leading model analysts structure their work.
Efficiencies:
Cleverly designed automation to relieve the gruntwork/burden on the analyst, allowing the analyst to focus on the creative aspects of model development.
Sophistication:
Advanced Algorithms not found anywhere else.
Enhanced Regression:
Regression and Logistic Regression vastly enhanced to incorporate the key concepts of modern data mining approaches specifically geared toward massive datasets.
Improvements:
Ever expanding stream of additions and modifications to our core tools, based on user feedback and new levels of understanding of our flagship products.
Bridging-the-gap
between advances in academic thinking pioneered by Jerome Friedman and real-world applications.

Because Accuracy Matters

The SPM Salford Predictive Modeler® software suite is a highly accurate and ultra-fast analytics and data mining platform for creating predictive, descriptive, and analytical models from databases of any size, complexity, or organization. This suite of data mining tools includes Salford Systems' flagship products of CART, MARS, TreeNet, and Random Forests. The SPM software suite's automation accelerates the process of model building by conducting substantial portions of the model exploration and refinement process for the analyst. While the analyst is always in full control, we optionally anticipate the analysts next best steps and package a complete set of results from alternative modeling strategies for easy review. Do in one day what normally requires a week or more using other systems!

 

 

[K#522:1308]

What's New

Improvements to Existing Features and Components

  • CART Classification and Regression Trees:
    User defined linear combination lists for splitting; Constrains on trees; Automatic addition of missing value indicators; Enhanced GUI reporting; User controlled Cross Validation; Out-of-bag performance stats and predictions; Profiling terminals nodes based on user supplied variables; Comparison of Train vs. Test consistency across nodes; RandomForests-style variable importance.
  • MARS (Automated Nonlinear Regression):
    Updated GUI interface; Model performance based on independent test sample or Cross Validation; Support for time series models
  • TreeNet (Gradient Boosting, Boosted Trees):
    One-Tree TreeNet (CART alternative); RandomForests via TreeNet (RandomForests regression alternative) Interaction Control Language (ICL); Interaction strength reporting; Enhanced partial dependency plots; RandomForests-style randomized splits
  • RandomForests (Bagging Trees):
    RandomForests regression; Saving out-of-bag scores; Speed enhancements
  • High-Dimensional Multivariate Pattern Discovery:
    Battery Target is now available to identify mutual dependencies in the data
  • Automation (Batteries):
    56 pre-packaged scenarios based on years of high-end consulting
  • Hotspot Detection
    Segment Extraction (Battery Priors)
  • Interaction Detection
  • Missing Value Handling and Imputation
  • Model Assessment and Selection:
    Unified reporting of various performance measures across different models
  • Model Translation:
    (SAS, C, Java, PMML, Classic) + Java
  • Data Access (all popular statistical formats supported):
    Updated Stat Transfer Drivers including R workspaces
  • Model Scoring:
    Score Ensemble (combines multiple models into a powerful predictive machine)

New Algorithms and Features Specific to SPM v7.0

  • Unsupervised Learning
    Breiman’s Column Scrambler
  • Text Mining (STM is stand-alone, available upon request)
  • Model Compression and Rule Extraction:
    Unified reporting of various performance measures
  • Parallel Processing:
    Automatic support of multiple cores via multithreading
  • Outlier Detection:
    GUI reports, tables, and graphs
  • Linear Methods for Regression, Recent Advances and Discoveries:
    OLS Regression; Regularized Regression Including: LAR/LASSO Regression; Ridge Regression; Elastic Net Regression
  • Linear Methods for Classification, Recent Advances and Discoveries:
    LOGIT; LAR/LASSO; Ridge; Elastic Net/ Generalized Path Seeker
  • Ensemble Learning:
    Battery Bootstrap; Battery Model
  • Time Series Modeling
  • Data Preparation:
    Battery Bin for automatic binning of a user selected set of variables with large number of options
  • Model Simplification Methods
    ISLE, RuleLearner
  • Large Data Handling:
    64 bit support; Large memory capacity limited only by your hardware

[K#524:1308]

Automation

New in SPM 7.0: 56 Pre-packaged scenarios, basically experiments, inspired by how leading model analysts structure their work. We call them "batteries." These "batteries" or experiments create multiple models automatically so that the analyst can easily see choices.

Example 1: Banking Applications

BATTERY SHAVING

Battery Shaving helps to identify subsets of informative data within large datasets containing correlated variables within the account data. With automation, you may accomplish significant model reduction with minimal (if any) sacrifice to model accuracy. For example, start with a complete list of variables, and run automated shaving from the top to eliminate variables that look promising on the learn sample but fail to generalize. Later you can run shaving from the bottom to automatically eliminate a major bulk of redundant and unnecessary predictors. Then follow up with "shaving error" to quickly zero in on the most informative subset of features.

As opposed to typical data mining tools, Battery Shaving offers more than the typical variable importance list. Additionally, the analyst is provided with a full set of variable importance subsets/variations enabling the analyst to quickly optimize/select the final variable list and eliminating the burden of repetitive testing. Expert modelers typically devote a lot of time and effort to optimizing their variable importance list; Battery Shaving automates this process.

Example 2: Fraud Detection

BATTERY PRIORS

In typical fraud detection applications the analyst is concerned with identifying different sets of rules leading to a varying probability of fraud. Decision trees and TreeNet gradient boosting technology are typically used to build classification rules for detecting fraud. Any classification tree is constructed based on a specific user-supplied set of prior probabilities.

One set of priors will force trees to search for rules with high levels of fraud, while other sets of priors will produce trees with somewhat relaxed assumptions. To gain the most benefits of tree-based rule searching approaches, analysts will try a large number of different configurations of prior probabilities. This process is fully automated in Battery Priors. The result is a large collection of rules ranging from extremely high confidence fraud segments with low support to moderate indication of fraud segments with very wide support. For example, you can identify small segments with 100% fraud or you may find a large segment with a lesser probability of fraud, and everything in-between.

Example 3: Market Research - Surveys

BATTERY MVI (MISSING VALUE INDICATORS)

In any survey, a large fraction of information may be missing. Often, the respondent will not answer questions either because they don't want to or are unable to do so. In addition to Salford Systems' expertise in handling missing values, a new automation feature allows the analyst to automatically generate multiple models including: 1) a model predicting response based solely on the pattern of missing values; 2) a model that automatically creates dummy missing value indicators in addition to the original set of predictors; and/or 3) a model that relies on engine-specific internal handling of missing values.

Example 4: Engineering Application

BATTERY TARGET

In a modern engineering application, as part of the experimental design, a large collection of sampled points may be gathered under different operating conditions. It can be challenging to identify mutual dependencies among the different parameters. For example, temperatures could be perfectly dependent on each other, or could be some unknown functions of other operating conditions like pressure and/or revolutions. Battery Target gives you powerful means to automatically explore and extract all mutual dependencies among predictors. By the word "dependencies," we mean a potentially nonlinear multivariate relationship that goes way beyond the simplicity of conventional correlations. Furthermore, as a powerful side effect, this Battery provides general means for missing value imputation, which is extremely useful to support those modeling engines that do not directly handle missing values.

Example 5: Web Advertising

BATTERY SAMPLE

In an online ad placing application one has to balance the amount of data used vs. the time it takes to complete the model building. In web advertising there can be virtually an unlimited amount of data. So while ideally you would wish you use all available data, there is always a limit on how much can be used for real-time deployment. Battery Sample allows the analyst to automatically explore the impact of learn sample size on model accuracy. For example, you may discover that using 200,000,000 transactions provides no additional benefit in terms of model accuracy compared to 100,000,000 transactions.

Example 6: Microarray Application

BATTERY MCT ( MONTE CARLO SHUFFLING OF THE TARGET)

Microarray research datasets are characterized by an extremely large number of predictors (genes) and a very limited number of records (patients). This opens up a vast area of ambiguity resulting from the fact that even a random subset of predictors may produce a seemingly good looking model. Battery MCT (Monte Carlo Shuffling of the Target) allows you to determine whether the model performance is as accurate as it appears to be. Battery MCT automatically constructs a large number of auxiliary models based on randomly shuffled target variables. By comparing the actual model performance with the reference distribution (no dependency models), a final decision on model performance can be made. This technology could result in challenges to some of the currently produced papers in microarray research. If a dataset with deliberately destroyed target dependency can give you a model with good accuracy, then relying on the original model becomes rather dubious.

[K#525:1308]

User Guide

The Salford Predictive Modeler Software Suite

Core components include CART, MARS, TreeNet, Random Forests, and Generalized PathSeeker

CART Classification and Regression Trees

Open PDF in New Window / Tab Welcome to CART, a robust decision-tree tool for data mining, predictive modeling, and data preprocessing. CART (Classification and Regresion Trees) automatically searches for important patterns and relationships, uncovering hidden structure even in highly complex data. CART trees can be used to generate accurate and reliable predictive models for a broad range of applications from bioinformatics to risk management and new applications are being reported daily.
Salford Systems' CART is the only decision-tree system based on the original CART code developed by world-renowned Stanford University and University of California at Berkeley statisticians Breiman, Friedman, Olshen and Stone.

[K2:484/1309]

MARS Multivariate Adaptive Regression Splines

Open PDF in New Window / TabMARS, is considered the world’s first truly successful automated regression modeling tool. Multivariate Adaptive Regression Splines (MARS) has become widely known in the data mining and business intelligence worlds only recently through our seminars and the enthusiastic endorsement of leading data mining specialists. MARS is an innovative and flexible modeling tool that automates the building of accurate predictive models for continuous and binary dependent variables. It excels at finding optimal variable transformations and potential interaction within any regression-based modeling solution and easily handles the complex data structure that often hides in high-dimensional data. In doing so, this new approach to regression modeling effectively uncovers important data patterns and relationships that are difficult, if not impossible, for other methods to reveal.

[K2:467/1309]

TreeNet Stochastic Gradient Boosting

Open PDF in New Window / Tab TreeNet is a revolutionary advance in data mining technology developed by Jerome Friedman, one of the world's outstanding data mining researchers. TreeNet offers exceptional accuracy, blazing speed, and a high degree of fault tolerance for dirty and incomplete data. It can handle both classification and regression problems and has been proven to be remarkably effective in traditional numeric data mining and text mining.

[K2:485/1309]

Random Forests

Open PDF in New Window / Tab This guide describes what’s under the hood, beginning with why RandomForests’ engine is both unique and innovative. Because RandomForests is such a new tool, we assume no prior knowledge of the adaptive modeling methodology underlying RandomForests. To put this methodology into context, the first section discusses the modeler’s challenge and addresses how RandomForests meets this challenge. The remaining sections provide detailed explanations of how the RandomForests model is generated, how RandomForests handles categorical variables and missing values, how the “optimal” model is selected and, finally, how testing regimens are used to protect against overfitting.

[K2:487/1309]

GPS Generalized Path Seeker

Open PDF in New Window / Tab GPS or Generalized PathSeeker is a highly specialized and flexible regression (and logistic regression) procedure developed by Jerome Friedman (the co-creator of CART and the developer and inventor of MARS and TreeNet, among several other major contributions to data mining and machine learning). GPS is a "regularized regression" procedure meaning that it is designed to handle modeling challenges that are difficult or impossible for everyday regression.

[K2:468/1309]

[K#537:1308]

Price

[k#546:1308]

Download

The SPM Salford Predictive Modeler® software suite is a highly accurate and ultra-fast platform for creating predictive, descriptive, and analytical models from databases of any size, complexity, or organization. The SPM® software suite has automation that accelerates the process of model building by conducting substantial portions of the model exploration and refinement process for the analyst. While the analyst is always in full control, we optionally anticipate the analyst's next best steps and package a complete set of results from alternative modeling strategies for easy review. Do in one day what normally requires a week or more using other systems.

The Salford Predictive Modeler® software suite includes:

CART
The definitive classification tree developed by world renowned statisticians including Drs Jerome Friedman and Leo Breiman. CART is one of most well known data mining algorithms considered to be algorithm responsible for bringing out university into business
MARS:
Ideal for users who prefer results in a form similar to traditional regression while capturing essential non–linearities and interactions.
TreeNet:
TreeNet is salford's most flexible and powerful data mining tool capable of consistently generating extremely accurate models has been responsible for the majority modeling competition awards demonstrates remarkable performance both regression classification algorithm typically generates thousands small decision trees built in a sequential error correcting process to converge an model
RandomForests:
RF features include prediction, clusters and segment discoveries, anomaly tagging detection and multivariate class description. The method was developed by Leo Breiman and Adele Cutler of University of California, Berkeley.


New Components & Features available in version 7.0!

GPS:
Generalized Path Seeker is Jerry Friedman's approach to regularized regression this technology offers high speed lasso for extreme data set configurations with upwards of 100,000 predictors and possibly very few rows such sets are commonplace in gene research text mining. The new both supremely fast efficient
RuleLearner:
RuleLearner is a powerful post–processing technique which selects the most influential subset of nodes, thus reducing model complexity. RuleLearner allows the modeler to take advantage of the increased accuracy of very complicated TreeNet and RandomForests models while still yielding the simplicity of CART models.
[K#601:1306]

Webinars

Click on title to open slide

The Evolution of Regression Modeling

The Evolution of Regression Modeling: from Classical Linear Regression to Modern Ensembles

Webinar Title: The Evolution of Regression Modeling: from Classical Linear Regression to Modern Ensembles

Date/Time: Friday, March 1, 15, 29, and April 12 2013, 10am-11am, PST


Course Description:
Regression is one of the most popular modeling methods, but the classical approach has significant problems. This webinar series address these problems. Are you are working with larger datasets? Is your data challenging? Does your data include missing values, nonlinear relationships, local patterns and interactions? This webinar series is for you! We will cover improvements to conventional and logistic regression, and will include a discussion of classical, regularized, and nonlinear regression, as well as modern ensemble and data mining approaches. This series will be of value to any classically trained statistician or modeler.

Part 1

Part 1: Regression methods discussed (download slides)

  • Classical Regression
  • Logistic Regression
  • Regularized Regression: GPS Generalized Path Seeker
  • Nonlinear Regression: MARS Regression Splines

Part 2

Step-by-step demonstration

 

Part 3

Part 3: Regression methods discussed (download slides)
*Part 1 is a recommended pre-requisite

  • Nonlinear Ensemble Approaches: TreeNet Gradient Boosting; Random Forests; Gradient Boosting incorporating RF
  • Ensemble Post-Processing: ISLE; RuleLearner

 

Part 4

Part 4: Hands-on demonstration of concepts discussed in part 3 (download slides)

  • Step-by-step demonstration
  • Datasets and software available for download
  • Instructions for reproducing demo at your leisure
  • For the dedicated student: apply these methods to your own data (optional)

[K#533:1308]

Advances in TreeNet Gradient Boosting

Advances in Gradient Boosting: The Power of Post Processing

Click to View / Download PDF

Advances in Gradient Boosting: the Power of Post-Processing

Learn how TreeNet stochastic gradient boosting can be improved by post processing techniques such as GPS Generalized Path Seeker, RuleLearner, and ISLE.

 

 

Course Outline:

 

I. Gradient Boosting and Post-Processing:

  • What is missing from Gradient Boosting?
  • Why post-processing techniques are used?

II. Applications Benefiting from Post-Processing: Examples from a variety of industries.

  • Financial Services
  • Biomedical
  • Environmental
  • Manufacturing
  • Adserving

III. Typical Post-Processing Steps

 

IV. Techniques

  • Generalized Path Seeker (GPS): Modern high-speed LASSO-style regularized regression
  • Importance Sampled Learning Ensembles (ISLE): identify and reweight the most influential trees
  • RuleLearner: ISLE on “steroids.” Identify the most influential nodes and rules

V. Case Study Example

  • Output/Results without Post-Processing
  • Output/Results with Post-Processing
  • Demo

Watch the Video

[K#534:1308]

Combining CART and TreeNet

TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination

Click to View/Download PDF

 

Combining CART decision trees with TreeNet stochastic gradient boosting: A winning combination.

Learn about how you can combine the best of both tools in this 1 hour webinar.

 

Course Outline

 

I. Classification and Regression Trees Pros/Cons

II. Stochastic Gradient Boosting: a promising way to overcome the shortcomings of a single tree

III. Introducing Stochastic Gradient Boosting, a powerful modern ensemble of boosted trees

  • Methodology
  • Reporting
  • Interpretability
  • Post-Processing
  • Interaction Detection

IV. Advantages of using both Classification and Regression Trees and Tree Ensembles

 

Watch the Video

[K#535:1308]

 

 

[K#538:1308]

University Program

Salford Systems' University Program provides SPM, CART®, MARS®, TreeNet® , and RandomForests® at significantly-reduced licensing fees to the educational community. Eligible educational institutions are colleges, universities, community colleges, technical schools, and science centers. The University Program gives eligible educational institutions the right to distribute MARS and other Salford tools right-to-use licenses to all faculty, staff, and students for personal computers, and to install UNIX versions of these tools on University workstations and servers. For more information on this special program, please contact our sales department.

Salford Systems is committed to supporting education and research in universities world-wide and offers special packaging and pricing.

We also offer academics cost-free access to our tutorial materials for classroom use.

 

[k2:685/1309]

 

Product Versions

SPM 7 Product Versions

Ultra
The best of the best. For the modeler who must have access to leading edge technology available and fastest run times including major advances in ensemble modeling, interaction detection and automation. ULTRA also provides advance access to new features as they become available in frequent upgrades.
ProEx
For the modeler who needs cutting-edge data mining technology, including extensive automation of workflows typical for experienced data analysts and dozens of extensions to the Salford data mining engines.
Pro
A true predictive modeling workbench designed for the professional data miner. Variety of supporting conventional statistical modeling tools, programming language, reporting services, and a modest selection of workflow automation options.
Basic
Literally the basics. Salford Systems award winning data mining engines without extensions or automation or surrounding statistical services, programming language, and sophisticated reporting. Designed for small budgets while still delivering our world famous engines

[K#523:1308]

Requirements

 

Windows - Minimum System Requirements

We suggest the following minimum and recommended, system requirements:

  • 80486 processor or higher.
  • 512MB of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE they will. We highly recommend that you follow the recommended memory configuration that applies to the particular version you have purchased. Using less than the recommended memory configuration results in hard drive paging, reducing performance significantly, or application instability.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
  • CD-ROM or DVD drive.

Recommended System Requirements

Because Salford Tools are extremely CPU intensive, the faster your CPU, the faster they will run. For optimal performance, we strongly recommend they run on a machine with a system configuration equal to, or greater than, the following:

  • Pentium 4 processor running 2.0+ GHz.
  • 2 GIG of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE they will. We highly recommend that you follow the recommended memory configuration that applies to the particular version you have purchased. Using less than the recommended memory configuration results in hard drive paging, reducing performance significantly, or application instability.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
  • CD-ROM or DVD drive.
  • 2 GIG of additional hard disk space available for virtual memory and temporary files.

Ensuring Proper Permissions

If you are installing on a machine that uses security permissions, please read the following note.

  • You must belong to the Administrator group on Windows 2003 / 2008, Windows 7 / 8 to be able to properly install and license. Once the application is installed and licensed, any member with read/write/modify permissions to the applications /bin and temp directories can execute and run the application.

UNIX/Linux - Minimum System Requirements

Supported Architectures

  • Alpha: DEC 3000 or AlphaServer running Tru64 UNIX 5.0 or higher
  • Linux/i386: i586 or higher processor; Linux 2.4 or higher kernel; glibc 2.3 or higher
  • Linux/AMD64: AMD64 or Intel EM64T processor; Linux 2.6 or higher kernel; glibc 2.3 or higher
  • Sun: UltraSPARC processor; Solaris 2.6 or higher
  • RS/6000: POWER or PowerPC processor; AIX 4.2 or higher
  • HP 9000: PA/RISC 1.1 or higher processor; HP/UX 11.x
  • SGI: MIPS 4 or higher processor; IRIX 6.5

Minimum System Requirements

  • Minimum RAM requirement for all non-GUI app's is 32 MB of random-access memory (RAM). This value depends on the "size"
    you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG).
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).

Recommended System Requirements

  • Recommended random-access memory (RAM) is 1.5 times the licensed data limit (32 MB, 64 MB, etc), up to the maximum permitted by the target architecture. On UNIX systems, it is generally recommended that there be at least twice as much swap space as there is RAM.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).

All Salford apps are very CPU intensive, so more memory and a faster CPU are always helpful.

Licensing Application

TreeNet uses a system of application system ID and associated unlock key. When installation is complete, the user will need to email the application "system ID." This system ID is clearly displayed in the License Information displayed the first time the application is started. You can alternatively get to this window by selecting the Help->License menu option.

Method 1: Fixed License
With a fixed license, each machine must have its own copy of the licensed program installed. If your license terms permit more than one copy, then the license must be activated on each machine that will be used.

Method 2: Floating License
This method of licensing your program is used if you intend the program application to be used by more than one user concurrently over a network. A floating license tracks the number of copies "checked out." When that number exceeds your license terms, a message is provided informing the user "all copies are checked out." The licensed program may be installed on a machine that each client machine can access. Machines that are not connected to the network must be issued a fixed license (Method 1 above).

A floating license is particularly useful when the number of potential users exceeds the number of seats specified in your license terms.

[K#740:1407]

 

General Features

SPM 7 engines - General Features.


ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engine: CART (Decision Trees) o o o o
Modeling Engine: MARS (Nonlinear Regression) o o o o
Modeling Engine: TreeNet (Stochastic Gradient Boosting) o o o o
Modeling Engine: RandomForests for Classification o o o o
Reporting ROC curves during model building and model scoring o o o o
Model performance stats based on Cross Validation o o o o
Model performance stats based on out of bag data during bootstrapping o o o o
Reporting performance summaries on learn and test data partitions o o o o
Reporting Gains and Lift Charts during model building and model scoring o o o o
Automatic creation of Command Logs o o o o
Built-in support to create, edit, and execute command files o o o o
Translating models into SAS ® -compatible language o o o o
Reading and writing datasets in all current database/statistical file formats o o o o
Option to save processed datasets into all current database/statistical file formats o o o o
Automation: Build a series of models using every available data mining engine (Battery MODELS) o o o o
Additional Modeling Engines: Regression, Logistic Regression, RandomForests for Regession   o o o
Automatic creation of missing value indicators   o o o
Option to treat missing value in a categorical predictor as a new level   o o o
License to any level supported by RAM (currently 32MB to 1TB)   o o o
License for multi-core capabilities   o o o
Using built-in BASIC Programming Language during data preparation   o o o
Automatic creation of lag variables based on user specifications during data preparation   o o o
Automatic creation and reporting of key overall and stratified summary statistics for user supplied list of variables   o o o
Display charts, histograms, and scatter plots for user selected variables   o o o
Command Line GUI Assistant to simplify creating and editing command files   o o o
Translating models into SAS/PMML/C/Java/Classic   o o o
An alternative to variable importance based on Leo Breiman's scrambler   o o o
Unsupervised Learning - Breiman's column scrambler   o o o
Scoring any Battery (pre-packaged scenario of runs) as an ensemble model   o o o
Custom selection of a new predictors list from an existing variable importance report   o o o
User defined bins for Cross Validation   o o o
Automated imputation of all missing values   o o o
Automation: Build two models reversing the roles of the learn and test samples (Battery FLIP)   o o o
Automation: Explore model stability by repeated random drawing of the learn sample from the original dataset (Battery DRAW)   o o o
Automation: For time series applications, build models based on sliding time window using a large array of user options (Battery DATASHIFT)   o o o
Automation: Explore mutual multivariate dependencies among available predictors (Battery TARGET)   o o o
Automation: Explore the effects of the learn sample size on the model performance (Battery SAMPLE)   o o o
Automation: Explore alternative strategies to handling of missing values (Battery MVI)   o o o
Automation: Check the validity of model performance using Monte Carlo shuffling of the target (Battery MCT)   o o o
Automation: Build a series of models varying the number of bins for Cross Validation (Battery CV)   o o o
Automation: Repeat Cross Validation process many times to explore the variance of estimates (Battery CVR)   o o o
Automation: Build a series of models using a user-suppllied list of binning variables for cross-validation (Battery CVBIN)   o o o
Automation: Build a series of models by varying the random number seed (Battery SEED)   o o o
Automation: Explore the marginal contribution of each predictor to the existing model (Battery LOVO)   o o o
Save out of bag predictions during Cross Validation     o o
Automation: Generate detailed univariate stats on every continuous predictor to spot potential outliers and problematic records (Battery OUTLIERS)     o o
Automation: Convert (bin) all continuous variables into categorical (discrete) versions using a large array of user options (equal width, weights of evidence, Naïve Bayes, superwised) (Battery BIN)     o o
Automation:Explore model stability by repeated repartitioning of the data into learn, test, and possibly hold-out samples (Battery PARTITION)     o o
Automation: Build a series of models using different backward variable selection strategies (Battery SHAVING)     o o
Automation: Build a series of models using the forward-stepwise variable selection strategy (Battery STEPWISE)     o o
Automation: Explore nonlinear univariate relationships between the target and each available predictor (Battery ONEOFF, Battery XONY)     o o
Automation: Build a series of models using randomly sampled predictors (Battery KEEP)     o o
Automation: Explore the impact of a potential replacement of a given predictor by another one (Battery SWAP)     o o
Automation: Explore the impact of penalty on categorical predictors (Battery PENALTY=HLC)     o o
Automation: Explore the impact of penalty on missing values (Battery PENALTY=MISSING)     o o
Automation: Bootstrapping process (sampling with replacement from the learn sample) with a large array of user options (randomforests style sampling of predictors, saving in-bag and out-of-bag scores, proximity matrix, and node dummies) (Battery BOOTSTRAP)     o o
Automation: Parametric bootstrap process (Battery PBOOT)     o o
Automation: Build a series of models for each strata defined in the dataset (Battery STRATA)     o o
Automation: Build two linked models, where the first one predicts the binary event while the second one predicts the amount (Battery RELATED). For example, predicting whether someone will buy and how much will be spent     o o
Automation: Build a series of models limiting the number of nodes in a tree thus controlling the order of interactions (Battery NODES)     o o
Automation: Build a series of models varying the speed of learning (Battery LEARNRATE)     o o
Automation: Build a series of models by progressively imposing additivity on individual predictors (Battery ADDITIVE)     o o
Automation: Build a series of models utilizing different regression loss functions (Battery TNREG)     o o
Automation: Build a series of models by varying subsampling fraction (Battery TNSUBSAMPLE)     o o
Automation: Build a series of models using varying degree of penalty on added variables (Battery ADDEDVAR)     o o
Modeling Pipelines: RuleLearner, ISLE       o
Build a CART tree utilizing the TreeNet engine to gain speed as well as alternative reporting       o
RandomForests inspired sampling of predictors at each node during model building       o
Build a RandomForests model utilizing the TreeNet engine to gain speed as well as alternative reporting       o
Build a Random Forests model utilizing the CART engine to gain alternative handling of missing values via surrogate splits (Battery BOOTSTRAP RSPLIT)       o

 

 

[K#526:1402]

CART

Additional CART Features are available in Basic, Pro, ProEx, and Ultra.


ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engine:
CART (Decision Trees)
o o o o
Linear Combination Splits o o o o
Optimal tree selection based on area under ROC curve o o o o
User defined splits for the root node and its children   o o o
Automation: Generate models with alternative handling of missing values (Battery MVI)   o o o
Automation: RULES: build a model using each splitting rule (six for classification, two for regression).   o o o
Automation: Build a series of models using all available splitting strategies (six for classification, two for regression) (Battery RULES)   o o o
Automation: Build a series of models varying the depth of the tree (Battery DEPTH)   o o o
Automation: Build a series of models changing the minimum required size on parent nodes (Battery ATOM)   o o o
Automation: Build a series of models changing the minimum required size on child nodes (Battery MINCHILD)   o o o
Automation: Explore accuracy versus speed trade-off due to potential sampling of records at each node in a tree (Battery SUBSAMPLE)   o o o
Multiple user defined lists for linear combinations     o o
Constrained trees     o o
Ability to create and save dummy variables for every node in the tree during scoring     o o
Report basic stats on any variable of user choice at every node in the tree     o o
Comparison of learn vs. test performance at every node of every tree in the sequence     o o
Hot-Spot detection to identify the richest nodes across multiple trees     o o
Automation: Vary the priors for the specified class (Battery PRIORS)     o o
Automation: Build a series of models limiting the number of nodes in a tree (Battery NODES)     o o
Automation: Build a series of models trying each available predictor as the root node splitter (Battery ROOT)     o o
Automation: Explore the impact of favoring equal sized child nodes (Battery POWER)     o o
Automation: Vary the priors for the specified class (Battery PRIORS)     o o
Automation: Build a series of models by progressively removing misclassified records thus increasing the robustness of trees and posssibly reducing model complexity (Battery REFINE)     o o
Automation: Bagging and ARCing using the legacy code (COMBINE)     o o
Build a CART tree utilizing the TreeNet engine to gain speed as well as alternative reporting       o
Build a Random Forests model utlizing the CART engine to gain alternative handling of missing values via surrogate splits (Battery BOOTSTRAP RSPLIT)       o

 additional cart features

[K#527:1308]

MARS

Additional MARS Features are available in Pro, ProEx, and Ultra.


ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engine: MARS (Nonlinear Regression) o o o o
Automation: Build a series of models varying the maximum number of basis functions (Battery BASIS)   o o o
Automation: Build a series of models varying the smoothness parameter (Battery MINSPAN)     o o
Automation: Build a series of models varying the order of interactions (Battery INTERACTIONS)     o o
Automation:Build a series of models varying the modeling speed (Battery SPEED)     o o
Automation:Build a series of models using varying degree of penalty on added variables (Battery PENALTY MARS)       o

 additional mars features

[K#528:1308]

TreeNet

Additional TreeNet Features are available in Pro, ProEx, and Ultra.

ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engine: TreeNet (Stochastic Gradient Boosting) o o o o
Spline-based approximations to the TreeNet dependency plots   o o o
Exporting TreeNet dependency plots into XML file   o o o
Automation:Build a series of models changing the minimum required size on child nodes (Battery MINCHILD)   o o o
Flexible control over interactions in a TreeNet model     o o
Interaction strength reporting     o o
Build a CART tree utilizing the TreeNet engine to gain speedas well as alternative reporting       o
Build a RandomForests model utilizing the TreeNet engine to gain speed as well as alternative reporting       o
RandomForests inspired sampling of predictors at each node during model building       o
Automation:Explore the impact of influence trimming (outlier removal) for logistic and classification models (Battery INFLUENCE)       o
Automation:Exhaustive search and ranking for all interactions of the specified order (Battery ICL)       o

 additional treenet features

[K#529:1308]

Random Forests

Additional Random Forests Features are available in Pro, ProEx, and Ultra.


ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engine: RandomForests for Classification o o o o
Additional Modeling Engine:RandomForests for Regession   o o o
Spline-based approximations to the TreeNet dependency plots   o o o
Exporting TreeNet dependency plots into XML file   o o o
Automation:Build a series of models changing the minimum required size on child nodes (Battery MINCHILD)   o o o
Flexible control over interactions in a TreeNet model     o o
Interaction strength reporting     o o
Build a CART tree utilizing the TreeNet engine to gain speed as well as alternative reporting       o
Build a RandomForests model utilizing the TreeNet engine to gain speed as well as alternative reporting       o
RandomForests inspired sampling of predictors at each node during model building       o
Automation: Explore the impact of influence trimming (outlier removal) for logistic and classification models (Battery INFLUENCE)       o
Automation:Exhaustive search and ranking for all interactions of the specified order (Battery ICL)       o

 
additional random forests features

[K#530:1308]

GPS

Additional GPS Generalized Path Seeker Features are available only in ProEx and Ultra.


ComponentsBasicProProExUltra
ComponentsBasicProProExUltra
Modeling Engines:Regularized Regression
(LASSO/Ridge/LARS/Elastic Net/GPS)
  o o o
Automation: Build a series of models by forcing different limit on the maximum correlation among predictors (Battery MAXCORR)     o o

additional gps features

[K#531:1308]

Testimonials

Brian Griner, Chief Methodologist at Quintiles

The Salford Predictive Modeler Software Suite
Great product! Very easy to test different models, compare results and export code to score a database.

 Brian Griner, Chief Methodologist at Quintiles
New York, USA


Jim Kenyon, Director of Operations for Optimization Group.

We use SPM because it lets us quickly and easily build predictive models that produce useful and usable results for our clients.

 Jim Kenyon, Director of Operations at Optimization Group
Ann Arbor, MI, USA


[k2:648/1309]

Adrian Gepp, Australia

Bond University:

The failure of businesses is an enduring and costly concern. Business failure prediction models attempt to provide early warnings to mitigate some of the costs of future failure, if not avoid it altogether. Research has shown that CART (by Salford Systems) is a good choice for building such models.

In research published in a top academic journal in 2010, empirical evidence was presented to suggest that decision-tree techniques are superior predictors of business failure. On the hold-out data, the CART decision trees were found to outperform See5 decision trees and discriminant analysis at predicting business failure.

In peer-reviewed research presented at a 2012 academic conference, CART decision trees were compared with a semi-parametric Cox survival analysis model for predicting corporate financial distress over a variety of misclassification costs and prediction intervals. The results from the hold-out data suggest that CART decision trees are the superior predictors of financial distress. Using a weighted error cost metric, CART models had a lower cost of prediction for all misclassification costs and prediction intervals.
References
*Gepp, A., Kumar, K. & Bhattacharya, S. (2010). Business failure prediction using decision trees.Journal of Forecasting, 29[6]: pp. 536-555.
* Gepp, A. & Kumar, K. (2012). Financial Distress Prediction using Decision Trees and Survival Analysis. Presented at 7th Annual London Business Research Conference, 9-10 July, London.

Adrian Gepp, Bond University, Australia


Dr. Martin Kidd, IMT, South Africa

Government:

As a statistician in the Naval environment, I have been involved in the field of data mining for the past four years. Classification trees have become one of the primary tools with which I extract useful information from large data bases. I have used various different classification tree software, and have found CART to be the superior product. What I find particularly useful are the following:
* The colour codes of the nodes which one can use to pick the most important branches (or rules).
* The relative cost vs number of nodes graph which I always use to select the 'least complicated' with 'low' relative cost.
The Gains chart provides a good graphical view for assessing tree performance.

Dr. Martin Kidd, IMT, South Africa


Steven Li, Senior Manager, Risk Technology, Sears, Roebuck and Co

CART is an important statistical analysis tool that we use to segment our databases and predict risk factors for the Sears Card. The advantage of the decision tree format is that our results are easy to interpret; especially with CART, we are able to see a great deal of detail about each of the nodes, such as the node's misclassification costs, the count of data assigned to that node, and a display of the surrogate values substituted for the node.

 Steven Li, Senior Manager, Risk Technology, Sears, Roebuck and Co


Andrea S. Laliberte, Remote Sensing Scientist at Earthmetrics

I have used CART in conjunction with remote sensing and digital image processing for producing vegetation classifications. CART is an excellent approach for determining the most suitable features (image bands, image ratios, elevation, slope, etc.) for image classification, and for reducing the number of input features to a reasonable number. In comparison with other feature reduction and selection methods, the CART approach has always worked superior for my applications. I really like the intuitive approach, easy to use manual, and the visual interface which makes it easy to interpret the data. In addition, all my interactions with the people at Salford Systems have been wonderful. I highly recommend the software.

 Andrea S. Laliberte, Remote Sensing Scientist at Earthmetrics
Oregon, USA


Anneli Anglund, PhD student at University College Cork

I am a PhD student in the field of marine bioacoustics and while I was looking into analysis methods for my thesis I came across CART. I thought it seemed like an interesting approach and when I tried it I was immediately impressed by the easy to use manual. Even though the examples were not necessarily within my field of study, they made sense and I found it easy to apply the methods to my own data. I would very much like to recommend this software and the very helpful staff of Salford Systems.

 Anneli Anglund, PhD student at University College Cork
Ireland


Chris Gooley, Founder and President at eTs Marketing Science

I've used Salford Systems software products ever since 1991 when Dan Steinberg and his team were first developing Salford tools in conjunction with the pioneering data mining scientists at Stanford and Berkeley.

I am an extensive user of SAS and SPSS software products. However, when it comes to decision trees and highly predictive models, I always to turn to CART and other Salford Systems software products. Not only is the user interface simple to use but writing your own syntax is easy to do as well.

The reasons I like Salford Systems tools and CART specifically include:

  1. The large number of options for tuning the algorithm, including statistical methods, tree depth, minimum node size, and cross validation procedures
  2. Easy to use facilities for building ensemble models via bagging, boosting, and arcing methods
  3. Intuitive, easy to understand metrics such as variable importance that are useful for checking if a model makes “business sense”
  4. Scoring and translating models is very fast and easy
  5. Ease of integration with SAS and SPSS

I can guarantee any analyst that invests a modest amount of their time with Salford tools will
never regret the experience nor go back to using less powerful alternatives!

Chris Gooley, Founder and President at eTs Marketing Science


Dean Abbott, Founder and President at Abbott Analytics/Abbott Consulting

I've used Salford Systems tools for years and have recommended purchase of the suite to many companies I've worked with. Reasons I like it so much include:
* The trees build super fast, even with large numbers of rows and columns
* CART shows you the entire sequence of trees that have been built; you can customize the depth you find most appropriate or let CART decide the optimum depth
* Default settings are great but you can still customize
* Battery options let you loop over key settings

Dean Abbott, Founder and President at Abbott Analytics/Abbott Consulting
San Diego, CA USA


Eric Weiss, Ph.D., Consultant; Arid Lands Resource Sciences, University of Arizona

Academic
As a research scientist in both academic and professional environments, I work with databases too large and complex to process manually. CART, unlike multiple linear programming and other methods that are constrained by functional forms, shows me truer characterizations of interrelationships between the data. CART is also a robust program that can support a diverse set of applications ranging, in my case, from food security analyses to pattern recognition and remote sensing problems.

 Eric Weiss, Ph.D., Consultant; Arid Lands Resource Sciences, University of Arizona


Feng Xu, Senior Manager, AT&T Universal Card Services

Telecommunications:
When we purchased CART, it was the only comprehensive classification and segmentation software available that could handle the large data sets we use for credit card risk management. In addition, CART provides us with a great deal of flexibility by allowing us, for example, to specify a higher penalty for misclassifying a certain data value.

 Feng Xu, Senior Manager, AT&T Universal Card Services


Marsha Wilcox, Ed.D., Vice President, PreVision Marketing

Marketing
PreVision Marketing's clients include Fortune 500 companies from telecommunications, automotive, retail and packaged goods industries. We apply our database marketing and analysis expertise to turn our clients' usual wealth of customer information into beneficial marketing information and customer relationship programs. At PreVision, this typically includes developing models of customer and prospective customer behavior. CART's recursive partitioning abilities give us a proven statistical method for generating marketing models in an easy-to-understand decision tree format. This format is accessible to all of our clients, even those with limited statistical backgrounds, and the clarity of the decision tree display gives our clients added confidence in the validity and utility of the models we create.

 Marsha Wilcox, Ed.D., Vice President, PreVision Marketing


Terence Mak, VP, Lead Analytic Consultant, Fleet Financial Group

Banking/Finance
CART offers two distinctive advantages that other database segmentation tools do not. First, it allows the analyst to identify the smallest target segment possible, such as ten out of tens of thousands, with exceptional precision. In addition, CART allows us to specify a higher penalty for misclassifying a potentially poor prospect than for rejecting a good one; this makes us more confident that, for products with very thin margins, our segmentation models avoid prospects who would likely be non-profitable. CART is an invaluable data mining and modeling tool for Fleet Financial Group.

 Terence Mak, VP, Lead Analytic Consultant, Fleet Financial Group


Wesley Johnston, Chevron Information Technology Co.

Industrial:
At Chevron, we conduct a lot of exploratory work for oil well drilling. Instead of taking many expensive core samples, we can use stet monitoring tools to characterize geographic areas; data capture generates small data sets with variables that are complex and interrelated rather than independent. CART, with its v-fold cross-validation capability, is our tool of choice for analyzing these small, complex data sets.

 Wesley Johnston, Chevron Information Technology Co.


William Burrows, Meteorological Research Scientist, Atmospheric Environment Service

Government:
I use CART to provide Canadian meteorologists with dynamic statistical models for predicting lake effect snowfall, ozone levels and other weather issues that affect Canada. The optimal tree models I create in CART have proven their accuracy many times over when the tree is used with independent data.

 William Burrows, Meteorological Research Scientist, Atmospheric Environment Service


[k#628:1307]

Bill Heavlin, Advanced Micro Devices, Inc.

MARS brings a new generation of statistical modeling technology to industrial statistics. MARS models are much more flexible than conventional response surface methods. The output is much more visual and has proven the source of insights in presentations to engineers. Finally, the windows- type GUI opens the door to training engineers to use the analysis software effectively.

 Bill Heavlin, Advanced Micro Devices, Inc.


David Broadhurst, Assistant Professor of Biostatistics at University of Alberta

Salford Systems provides a fast and effective solution to many complex multivariate classification/regression tasks. It is particularly effective in isolating influential features in 'Omic based data sets (proteomics/metabolomics etc). Although there are open source versions of much of the underlying mechanics, Salford Systems have provided a very thorough interface which rapidly decreases the learning curve for a set of very powerful algorithms. I'm a particular fan of MARS :-)

 David Broadhurst, Assistant Professor of Biostatistics at University of Alberta
Edmonton, Canada Area


Herb Edelstein, President, Two Crows Data Mining Consultancy

For years, I have been predicting that MARS would be one of the hottest algorithms and it will be. MARS addresses some shortcomings of decision trees, and it does so in a fairly elegant fashion.

 Herb Edelstein, President, Two Crows Data Mining Consultancy


Richard DeVeaux, Williams College

MARS is in many cases both more accurate and much faster than neural nets.

Richard DeVeaux, Williams College


Sadi Eserce, Senior Analyst at Chadwick Martin Bailey

I recommend this product.

 Sadi Eserce, Senior Analyst at Chadwick Martin Bailey 
Greater Boston Area


Thomas Brauch, Marketing Manager, Data Driven Marketing Department, Fireman's Fund Insurance

MARS is an essential tool for any data miner. It finds significant effects in complex data structures where other methods simply fail. I use it as both a stand alone solution and as a transformation tool for simpler modeling techniques.

Thomas Brauch, Marketing Manager, Data Driven Marketing Department, Fireman's Fund Insurance


Wayne Danter, University of Western Ontario

The MARS interface is smooth, intuitive and worked well. I think you have hit another home run with this data mining and modeling tool. I look forward to using it in a number of medical research projects. Also, I very much appreciate the outstanding customer support I have received.

Wayne Danter, University of Western Ontario


Broadband propensity project: comparing TreeNet with Enterprise Miner using logistic regression.

We’re seeing these benefits
1. TreeNet (Stochastic gradient boosting) method injects randomization to the selection of candidate predictors and training data, making this method much more robust than the traditional statistical models especially in dealing with messy data. For example, in our dataset, there is a part of information missing like customer’s portfolio and usage data. Although we do the data replacement for the statistical models, it would still affect the final results as it uses the whole dataset for training. By using TreeNet, only a subset of data and predictors are used each time and this process will be repeated for hundreds of time. This method greatly reduced the influence of messy data and improves the robustness of the final model. In terms of the nature of modeling, growing a large number of small trees instead of using a single complex tree has been proved to be more accurate and robust.
Our modeling dataset is always of big size. In this example, data size is above 500,000 and the initial predictor set is about 160 predictors. TreeNet is computational efficient and scalable for the large dataset which is much faster than the enterprise miner.
In the result analysis, the detailed relationships between predictors and the target are much easier to be visualized. Battery automates the process of running multiple experiments which reduces a lot of efforts in the predictor selection. In this example, 16 predictors are finally selected after 5 cycles and 2 battery processes.
2. Insights: TreeNet can dig out very granular information. For example, it helps to find the impact of a specific sector of predictors. In this example, predictors regarding to product holding and usage are emphasized in the TreeNet model while they are not prominent in the traditional logistic regression model. Information in these sectors contributed a lot in improving the prediction of customer’s propensity in buying our fix broadband products.
3. We’re seeing various levels of performance gains over traditional statistical models. For the best performance we’ve seen, the improvement of Lift is consistently about 40%, helping to capture more than 30% customers who are willing to buy our product.

Predictive Analytics Manager at Leading Telco in Singapore.
**Her team works on scientific marketing initiatives using statistics, data science and optimization methodologies.


David Vogel, CEO Voloridge Investment Management and Captain of the winning Heritage Health prize team

I have multiple versions of gradient boosting I could find including popular open source versions and TreeNet outperforms them all in predictive accuracy (consistently across many different kinds of data sets) while maintaining the ability to train models quickly.

David Vogel, CEO Voloridge Investment Management and Capatain of the winning Heritage Health prize team
Florida, USA


 

Brad Turner, Vice President of Marketing and Business Development, Inkiru

Everyday, the Inkiru product predicts sales for 2000 items in an e-commerce context. In addition, the product generates a customized confidence interval for each prediction. The input is dynamic and it consists of 1 year of historical data. Each record contains approximately 150 features with information about sales, products, customers, and promotions.

The problem was very challenging from a modeling point of view. Important parts of the data were continuous, categorical, highly non-linear, sparse, missing, and noisy. We found Salford Systems adequate to deal with these characteristics of the data.

Precision was an important goal in this project. A validation with real data reports 90% of the predictions lying within 7 units of the actual sales and 50% within 2 units. Salford Systems was definitely an important tool to reach this degree of accuracy in the product.

 Brad Turner, Vice President of Marketing and Business Development, Inkiru
California, USA


Andrew Russo, Vice President, Modeling and Analytics at AccuData Integrated

As a traditional modeler, I had been primarily using regression and logistic regressions. I began to test TreeNet last fall. Since then I have built several models that are now market-tested and are performing as predicted by TreeNet. The real value of TreeNet has been the speed in which it builds data, the accuracy of its predictions and the incremental lift it is experiencing in side-by-side tests of regressions. It has also proven to be a tremendous data prep time saver in its ability to deal with outliers, missing data as well as doing a decent job distinguishing between scale and categorical data. Importantly, the ability for less-hands-on model builds has enabled us to offer new modeling products to our clients that otherwise would not have had the budget to do a modeling project. In short this new, advanced capability is giving my company a competitive advantage.

 Andrew Russo, Vice President, Modeling and Analytics at AccuData Integrated Marketing
Florida, USA


Tom Osborn, Adjunct Professor at University of Technology

I've used TreeNet on commercial projects since '04. For customer and prospect targeting, it outperforms logistic family regression, neural nets and other methods in my kitbag. Key strengths: handling of missing values, robustness, general non-linearity, variable interactions. Clients like feedback on variable importance (more general than Shapley or PMVD). They also like seeing how the variable contribute to predictions. Fast and easy to use. Best - is developed on Jerry Friedman's great maths.

 Tom Osborn, Adjunct Professor (analytics/data mining) at University of Technology, Sydney
Sydney, Australia


[k#646:1407]

 

download-now   ondemand-video

[K#536:1305]

FacebookTwitterLinkedin