Autodiscovery leverages the stability advantages of multiple trees to rank variables for importance and thus select a subset of predictors for modeling. In SPM 7 and earlier Autodiscovery runs a very simple training data only TreeNet model growing out to 200 trees. The variable importance ranking generated from this model is then used to reduce the list of all available predictors down to the top performing predictors in this background model. Autodiscovery is fast and easy, as there are no control parameters to set, but it is just a mechanism for quickly testing whether a substantial refinement in the number of predictors can improve model performance.
In most serious modeling projects we would supplement Autodiscovery with more intensive variable selection mechanisms such as we have built into BATTERY SHAVING, where the model, rank, select, and model again cycle is repeated possibly a very large number of times.
Salford Predictive Modeler™ and its component data mining engines CART®, MARS®, TreeNet®, and RandomForests® contain a variety of tools to help modelers work quickly and efficiently. One of the most effective tools for rapid model development is found in the BATTERY tab of the MODEL Set Up dialog. Because there are so many tools embedded in that dialog we are going to start a series of posts going through the principal BATTERY choices, one at a time.
Let’s start with the idea of the BATTERY. The BATTERY mechanism is an automated system for running experiments and trying out different modeling ideas. Instead of you having to think about how you would like to tweak your model to try to make it better the BATTERY does it for you. Each BATTERY is a planned experiment in which we take some useful modeling control and run a series of models in which we systematically change that control. The best part of this is the SUMMARY which provides you with an executive summary of the results and points you to the best performing model. We recommend that you use the BATTERY often; some modelers don’t do anything without setting up pre–packaged or user customized batteries.
The most recent versions of Salford Predictive Modeler™ SPM PRO EX include a new BATTERY to invoke bootstrapped replication of most model types available in SPM. One of our reasons for adding this BATTERY was to provide access to the full CART engine when generating RandomForests® (RF) models. The principle advantages of this are:
Breiman’s original RF uses a stripped down and simplified tree growing algorithm designed for speed. It lacks tree growing options and missing handling, and fort many users Breiman's RF is confined to classification problems. By accessing the full CART engine with all of its Salford extensions and customized controls, modelers can accomplish far more sophisticated analyses, handle missing values with surrogates, apply penalties and constraints, and most importantly for those interested in continuous dependent variables, BATTERY BOOTSTRAP gives access to both Least Squares (LS) and Least Absolute Deviation (LAD) regression trees.
The principle drawback of BATTERY BOOTSTRAP is that the extra machinery comes with a computational price: RF runs under BATTERY BOOTSTRAP are much slower than under Breiman–RF. The extra robustness, ability to handle huge problems, and added controls should often make the slower runs worthwhile. Also observe that at the moment the RF post–model visualization machinery is not available.
Use Battery SHAVE in the Salford Predictive Modeler™ to improve your model performance, increase model simplicity, and decrease the number of predictors needed for an accurate model. Using this battery will hep streamline and automate your model for optimal results.
We can dig deeper than we did in our previous post into the reasons why more compact predictor lists can improve decision trees. Recall that a CART tree is grown by searching for splits across all predictors and all possible split points in a given partition of the learning data. There is no guarantee that this same split will be as good on the previously-unseen test data. Occasionally, the best split on the learn data will be a lucky draw, and the split will not be confirmed on test data. In the original CART monograph, large sample theory was intended to assure that in very large samples CART will always correct any unfortunate splits made as the tree evolves by making the correct splits lower down in the tree. With sufficiently large samples, enough data always are left to converge to the best model. In most real world situations, however, we will not want to rely on massive data sets to get to the best model, and we may not have enough data to assure the desired result.
The Salford Predictive Modeler™ suite (SPM) includes a number of automated tools to assist in the process of feature selection under the BATTERY mechanism. For example,
BATTERY KEEP
Selects a subset of features at random and builds a model from this random subset only. The GUI will guide you in how to use this option, but from the command line you would issue something like:
BATTERY KEEP=100, 15
Which requests 100 models, each of which includes 15 randomly-selected predictors. If we are sure that we want certain variables included in every such model, the command would look like:
BATTERY KEEP=100, 15 CORE= X1, X2, X3, X4, X5