On Demand Introductory Videos
Download Now Instant Evaluation
Get Price Quote

Advances in Gradient Boosting: The Power of Post Processing

Advances in Gradient Boosting: The Power of Post Processing

Click to View / Download PDF

Advances in Gradient Boosting: the Power of Post-Processing

Learn how TreeNet stochastic gradient boosting can be improved by post processing techniques such as GPS Generalized Path Seeker, RuleLearner, and ISLE.

Course Outline:

I. Gradient Boosting and Post-Processing:

  • What is missing from Gradient Boosting?
  • Why post-processing techniques are used?

II. Applications Benefiting from Post-Processing: Examples from a variety of industries.

  • Financial Services
  • Biomedical
  • Environmental
  • Manufacturing
  • Adserving

III. Typical Post-Processing Steps

 

IV. Techniques

  • Generalized Path Seeker (GPS): Modern high-speed LASSO-style regularized regression
  • Importance Sampled Learning Ensembles (ISLE): identify and reweight the most influential trees
  • RuleLearner: ISLE on “steroids.” Identify the most influential nodes and rules

V. Case Study Example

  • Output/Results without Post-Processing
  • Output/Results with Post-Processing
  • Demo

Watch the Video

 

[J#59:1603]

Algorithms

Algorithms

Components and Features

Download Components and Features 

SPM Components and FeaturesWhat's New
SPM Components and FeaturesWhat's New
CART (Classification and Regression Trees) User defined linear combination lists for splitting; Constrains on trees; Automatic addition of missing value indicators; Enhanced GUI reporting; User controlled Cross Validation; Out-of-bag performance stats and predictions; Profiling terminals nodes based on user supplied variables; Comparison of Train vs. Test consistency across nodes; RandomForests-style variable importance
MARS (Automated Nonlinear Regression) Updated GUI interface; Model performance based on independent test sample or Cross Validation; Support for time series models
TreeNet (Gradient Boosting, Boosted Trees) One-Tree TreeNet (CART alternative); RandomForests via TreeNet (RandomForests regression alternative) Interaction Control Language (ICL); Interaction strength reporting; Enhanced partial dependency plots; RandomForests-style randomized splits;
RandomForests (Bagging Trees) RandomForests regression; Saving out-of-bag scores; Speed enhancements
High-Dimensional Multivariate Pattern Discovery Battery Target (link) to identify mutual dependencies in the data
Unsupervised Learning (Breiman's Column Scrambler) New
Text Mining New
Model Compression and Rule Extraction New: ISLE; RuleLearner; Hybrid Compression
Automation 56 pre-packaged scenarios based on years of high-end consulting
Parallel Processing New: Automatic support of multiple cores via multithreading
Interaction Detection  
Hotspot Detection Segment Extraction (Battery Priors)
Missing Value Handling and Imputation  
Outlier Detection New: GUI reports, tables, and graphs
Linear Methods for Regression, Recent Advances and Discoveries New: OLS Regression; Regularized Regression Including: LAR/LASSO Regression; Ridge Regression; Elastic Net Regression/ Generalized Path Seeker
Linear Methods for Classification, Recent Advances and Discoveries New: LOGIT; LAR/LASSO; Ridge; Elastic Net/ Generalized Path Seeker
Model Assessment and Selection Unified reporting of various performance measures across different models
Ensemble Learning New: Battery Bootstrap; Battery Model
Time Series Modeling New
Model Simplification Methods   
Data Preparation New: Battery Bin for automatic binning of a user selected set of variables with large number of options
Large Data Handling 64 bit support; Large memory capacity limited only by your hardware
Model Translation (SAS, C, Java, PMML, Classic) Java
Data Access (all popular statistical formats supported) Updated Stat Transfer Drivers including R workspaces
Model Scoring Score Ensemble (combines multiple models into a powerful predictive machine)

[J#57:1602]

MARS - Multivariate Adaptive Regression Splines®

MARS

Automated Non-Linear Regression
MARS software is ideal for users who prefer results in a form similar to traditional regression while capturing essential nonlinearities and interactions. The MARS approach to regression modeling effectively uncovers important data patterns and relationships that are difficult, if not impossible, for other regression methods to reveal. MARS builds its model by piecing together a series of straight lines with each allowed its own slope. This permits MARS to trace out any pattern detected in the data.
High-Quality Probability
The MARS model is designed to predict continuous numeric outcomes such as the average monthly bill of a mobile phone customer or the amount that a shopper is expected to spend in a web site visit. MARS is also capable of producing high quality probability models for a yes/no outcome. MARS performs variable selection, variable transformation, interaction detection, and self-testing, all automatically and at high speed.
High-Performance Results
Areas where MARS has exhibited very high-performance results include forecasting electricity demand for power generating companies, relating customer satisfaction scores to the engineering specifications of products, and presence/absence modeling in geographical information systems (GIS).

 

[J#74:1604]

Product Versions

SPM® 8 Product Versions

Ultra
The best of the best. For the modeler who must have access to leading edge technology available and fastest run times including major advances in ensemble modeling, interaction detection and automation. ULTRA also provides advance access to new features as they become available in frequent upgrades.
ProEx
For the modeler who needs cutting-edge data mining technology, including extensive automation of workflows typical for experienced data analysts and dozens of extensions to the Salford data mining engines.
Pro
A true predictive modeling workbench designed for the professional data miner. Variety of supporting conventional statistical modeling tools, programming language, reporting services, and a modest selection of workflow automation options.
Basic
Literally the basics. Salford Systems award winning data mining engines without extensions or automation or surrounding statistical services, programming language, and sophisticated reporting. Designed for small budgets while still delivering our world famous engines

[J#48:1603]

[art#41:1611]

Megan Sun, Data Mining Analyst, Marketing Department at Genworth Financial

I have 6 years using SAS and other statistical software to conduct academic and business projects. I started using SPM to build predictive models in May 2014. Our team mainly uses SPM TreeNet to build models for direct mail campaigns. I think the SPM software (Salford Predictive Modeler) is S.P.M. - SMART, PRODUCTIVE and MANAGEABLE.

SMART

Fast with big data

We use a lot of data. Most of the time, our model data sets have hundreds of thousands records and thousands of variables. SPM can handle these large data sets super-fast and builds predictive models in as short as few minutes. It also gives out pop-up messages if it finds some data issues so that I can identify problems more easily.

Powerful Battery tools to reduce variables

When I build predictive models with thousands of variables, I find one of the hardest tasks is to reduce the number of important variables. My goal is move from over a 1,000 variables to fewer than 20.

Models with fewer important variables without losing much lift are much easier to implement in our business environment. SPM provides 31 powerful Battery tools to do this for me. The top 3 Battery options that I most often use are Shaving, LOVO and Keep. All three can help you remove those least contributed variables from the model in order to maintain those most important predictors in your model.

Machine learning for missing values

While I build TreeNet models in SPM, I don’t need to spend lots of time dealing with missing values because SPM can take care of this for me and automatically learns the pattern from the build data set and then assigns proper values for the missing records. This feature saves me lots of time and manual work.

User friendly interface and no programming required

SPM makes model building easy for me even though I’m not programmer or statistician. With its user friendly interface design, it is easy to build a robust model in a few minutes. No hard coding needed to build models. It saves me lots of time in programming and code testing.

PRODUCTIVE

Build not only good, but reliable models

SPM provides many algorithms like Cart, TreeNet, and Random Forests, which I can choose to build different models. For instance, instead of building one decision tree one time, TreeNet is able to build hundreds of decisions trees in minutes and find the optimal one for me.

The CVR battery tool also helps me validate the model performance by building 20 or 30 models with different cross validation sets. This has helped tremendously in improving reliability of our models when I have thin data.

Easy scoring even with millions records

The scoring feature from SPM makes scoring data super easy. If I need to score data that has a relatively smaller size (hundreds of thousands records), I can get it done with SPM on my Windows environment in a couple of minutes. If I need to score a large data set that has millions of customer records, I can export the model from SPM into SAS and then do the scoring on SAS server without any problem.

MANAGEABLE

Manage modeling process

The statistics summary file helps track the modeling process, report error message and show descriptive statistics information to help me manage the modeling and scoring process.

Helpful support team

The support team from SPM has rich knowledge in model building and is very helpful when I have questions. They take my questions or requests by email or via phone calls and always get back me in a timely manner with helpful answer. Additionally, they provide useful resources that help me better understand the topics.

[J#66:1602]

Random Forests®

Random Forests

Breiman and Cutler’s Random Forests:
Random Forests is a bagging tool that leverages the power of multiple alternative analyses, randomization strategies, and ensemble learning to produce accurate models, insightful variable importance ranking, and laser-sharp reporting on a record-by-record basis for deep data understanding. Its strengths are spotting outliers and anomalies in data, displaying proximity clusters, predicting future outcomes, identifying important predictors, discovering data patterns, replacing missing values with imputations, and providing insightful graphics
Cluster and Segment:
Much of the insight provided by Random Forests is generated by methods applied after the trees are grown and include new technology for identifying clusters or segments in data as well as new methods for ranking the importance of variables. The method was developed by Leo Breiman and Adele Cutler of the University of California, Berkeley, and is licensed exclusively to Salford Systems. Ongoing research is being undertaken by Salford Systems in collaboration with Professor Adele Cutler, the surviving co-author of Random Forests.
Suited for Wide Datasets:
Random Forests is a collection of many CART trees that are not influenced by each other when constructed. The sum of the predictions made from decision trees determines the overall prediction of the forest. Random Forests is best suited for the analysis of complex data structures embedded in small to moderate data sets containing less than 10,000 rows but potentially millions of columns.

 

 

[J#102:1605]

Product Versions

SPM® 8 Product Versions

Ultra
The best of the best. For the modeler who must have access to leading edge technology available and fastest run times including major advances in ensemble modeling, interaction detection and automation. ULTRA also provides advance access to new features as they become available in frequent upgrades.
ProEx
For the modeler who needs cutting-edge data mining technology, including extensive automation of workflows typical for experienced data analysts and dozens of extensions to the Salford data mining engines.
Pro
A true predictive modeling workbench designed for the professional data miner. Variety of supporting conventional statistical modeling tools, programming language, reporting services, and a modest selection of workflow automation options.
Basic
Literally the basics. Salford Systems award winning data mining engines without extensions or automation or surrounding statistical services, programming language, and sophisticated reporting. Designed for small budgets while still delivering our world famous engines

[J#48:1603]

[art#39:1611]

SPM® Scalability

A user's license sets a limit on the amount of learn sample data that can be analyzed. The learn sample is the data used to build the model. Note that there is no limit to the number of test sample data points that may be analyzed. In other words, rows -by- columns of variables and observations used to build the model. Variable not used in the model do not count. Observations reserved for testing, or excluded for other reasons, do not count.

For example, suppose our 32MB version that sets a learn sample limitation of 8 MB. Each data point occupies 4 bytes. For instance, a 8MB capacity license will allow up to 8 * 1024 * 1024 / 4 = 2,097,152 learn sample data points to be analyzed.A data point is a represented by a 1-variable by- 1-observation (1-row by-1-column).

The following is a table that describes the current set of "sizes" available. Please note that the minimum required RAM is **not** the same as the learn sample limitation.

Size Data Limit MB Data Limit # of values  
minimum required
physical memory
(RAM) in MB
Licensed learn sample
data sizein MB 
(1 MB = 1,048,576 bytes)
Licensed # of learn
sample values
(rows by columns)
 
32 8 2,097,152  
64 18 4,718,592  
128 45 11,796,480  
256 100 26,214,400  
512 200 52,428,800  
1024 400 104,857,600  
2048 800 209,715,200 **64-bit only
3072 1200 324,572,800 **64-bit only

Additional larger capacity is available under 64-bit operating systems, using our non-GUI (command-line) builds. The non-GUI is very flexible and can be licensed for large data limits not currently available in the GUI product line. The current MAXIMUM is 8-GIG data capacity for our non-GUI builds.

[J#88:1602]

  • 1
  • 2

Get In Touch With Us

Contact Us

9685 Via Excelencia, Suite 208, San Diego, CA 92126
Ph: 619-543-8880
Fax: 619-543-8888
info (at) salford-systems (dot) com