Autodiscovery leverages the stability advantages of multiple trees to rank variables for importance and thus select a subset of predictors for modeling. In SPM 7 and earlier Autodiscovery runs a very simple training data only TreeNet model growing out to 200 trees. The variable importance ranking generated from this model is then used to reduce the list of all available predictors down to the top performing predictors in this background model. Autodiscovery is fast and easy, as there are no control parameters to set, but it is just a mechanism for quickly testing whether a substantial refinement in the number of predictors can improve model performance.
In most serious modeling projects we would supplement Autodiscovery with more intensive variable selection mechanisms such as we have built into BATTERY SHAVING, where the model, rank, select, and model again cycle is repeated possibly a very large number of times.
Salford Predictive Modeler™ and its component data mining engines CART®, MARS®, TreeNet®, and RandomForests® contain a variety of tools to help modelers work quickly and efficiently. One of the most effective tools for rapid model development is found in the BATTERY tab of the MODEL Set Up dialog. Because there are so many tools embedded in that dialog we are going to start a series of posts going through the principal BATTERY choices, one at a time.
Let’s start with the idea of the BATTERY. The BATTERY mechanism is an automated system for running experiments and trying out different modeling ideas. Instead of you having to think about how you would like to tweak your model to try to make it better the BATTERY does it for you. Each BATTERY is a planned experiment in which we take some useful modeling control and run a series of models in which we systematically change that control. The best part of this is the SUMMARY which provides you with an executive summary of the results and points you to the best performing model. We recommend that you use the BATTERY often; some modelers don’t do anything without setting up pre–packaged or user customized batteries.
The most recent versions of Salford Predictive Modeler™ SPM PRO EX include a new BATTERY to invoke bootstrapped replication of most model types available in SPM. One of our reasons for adding this BATTERY was to provide access to the full CART engine when generating RandomForests® (RF) models. The principle advantages of this are:
Breiman’s original RF uses a stripped down and simplified tree growing algorithm designed for speed. It lacks tree growing options and missing handling, and fort many users Breiman's RF is confined to classification problems. By accessing the full CART engine with all of its Salford extensions and customized controls, modelers can accomplish far more sophisticated analyses, handle missing values with surrogates, apply penalties and constraints, and most importantly for those interested in continuous dependent variables, BATTERY BOOTSTRAP gives access to both Least Squares (LS) and Least Absolute Deviation (LAD) regression trees.
The principle drawback of BATTERY BOOTSTRAP is that the extra machinery comes with a computational price: RF runs under BATTERY BOOTSTRAP are much slower than under Breiman–RF. The extra robustness, ability to handle huge problems, and added controls should often make the slower runs worthwhile. Also observe that at the moment the RF post–model visualization machinery is not available.
Use Battery SHAVE in the Salford Predictive Modeler™ to improve your model performance, increase model simplicity, and decrease the number of predictors needed for an accurate model. Using this battery will hep streamline and automate your model for optimal results.
We can dig deeper than we did in our previous post into the reasons why more compact predictor lists can improve decision trees. Recall that a CART tree is grown by searching for splits across all predictors and all possible split points in a given partition of the learning data. There is no guarantee that this same split will be as good on the previously-unseen test data. Occasionally, the best split on the learn data will be a lucky draw, and the split will not be confirmed on test data. In the original CART monograph, large sample theory was intended to assure that in very large samples CART will always correct any unfortunate splits made as the tree evolves by making the correct splits lower down in the tree. With sufficiently large samples, enough data always are left to converge to the best model. In most real world situations, however, we will not want to rely on massive data sets to get to the best model, and we may not have enough data to assure the desired result.
The Salford Predictive Modeler™ suite (SPM) includes a number of automated tools to assist in the process of feature selection under the BATTERY mechanism. For example,
BATTERY KEEP
Selects a subset of features at random and builds a model from this random subset only. The GUI will guide you in how to use this option, but from the command line you would issue something like:
BATTERY KEEP=100, 15
Which requests 100 models, each of which includes 15 randomly-selected predictors. If we are sure that we want certain variables included in every such model, the command would look like:
BATTERY KEEP=100, 15 CORE= X1, X2, X3, X4, X5
Did you know you can easily build a family of CART models with the BATTERY feature? It’s true! BATTERY is one of the most powerful aspects of the Salford Predictive Modeling Suite (SPM). For instance, suppose you wish to consider how the size of your CART tree affects the tree’s predictive accuracy. You might build a series of individual trees yourself, or you can let BATTERY do it for you. Four batteries -- ATOM, MINCHILD, DEPTH and NODES -- work in similar ways by varying the allowable size of the atom, minchild, tree depth and the number of nodes permitted in the maximal tree. These controls constrain how large your CART tree is permitted to grow. Because they are tree-oriented controls, they work with TreeNet and RandomForests models too. For example, by issuing just the following simple series of commands you will find yourself with eight CART trees, which you can easily compare against one another to find a tradeoff between predictive accuracy and tree complexity that works best for you:


The commands above, using BATTERY MINCHILD, will vary the "minchild parameter" in your models. This is a constraint on the minimum child node allowed in the tree: no split is permitted that produces a child node smaller than the minchild. BATTERY ATOM works in a similar way, except that it controls the atom size: a node smaller than the atom will not be split at all. BATTERY NODES varies the number of nodes permitted in the maximal tree, while BATTERY DEPTH varies the maximum depth permitted for the tree. Note that all four of these batteries can be combined, to produce a series of 28 models. The commands:
produce the following:

These batteries also work well with TreeNet and RandomForests models. For instance, you may wish to consider how the number of nodes affects the performance of your TreeNet model. Suppose you wish to try five tree sizes in your TreeNet modeling:
The first model will build a TreeNet model consisting of trees having one split only (structurally precluding any interactions), while the remaining models will allow successively more interactions to occur because each tree can contain several splits. In this particular example, cross entropy (CXE) and classification error improve as the number of nodes permitted in the trees increases, but ROC and lift are relatively unaffected.


SPM has over 50 different BATTERY options. We will describe some of these options others in the coming weeks. These commands will generate a series of eight models, presented below in a brief summary table that shows the accuracy of each model. Note that because the same learn/test sample split is used in all eight models, an honest comparison of their predictive accuracies can be made. Each model can be explored in detail by clicking on its line in the summary report, which will bring up a navigator with full tree detail. Two or more navigators can be viewed on screen at once.
A model battery is simply a series of predictive models that are built on your data using some systematic variation of a model parameter, or by a mechanism in which one model determines how a subsequent model is built. The underlying predictive model algorithm could be CART, TreeNet, RandomForests or MARS.
To begin, let me introduce by way of example one of SPM's simplest batteries, BATTERY SAMPLE. This battery repeatedly cuts the learn sample down while leaving the test sample unchanged, in an effort to illustrate the effect that dataset size has on the accuracy or size of the model. In this example, consider CART models and how they respond when the learn sample is altered.
Let's consider a binary (0/1) target in a dataset with 4601 records and 57 predictors. 20% of the data will be randomly selected and held aside as a test sample (N=943), while the remainder of the data will serve as the learn sample (N=3658). I prefer to use the ROC statistic (actually, the integrated area under the Receiver Operating Characteristic curve, also referred to as the AUC statistic) as measured on the test sample to determine how well the models perform, since this is a commonly-used measure in many of the industries in which the Salford Predictive Model Builder is used. Note that the ROC/AUC statistic is also provided for the learn sample, for those that are curious.
BATTERY SAMPLE builds a series of five models, in this case five CART trees. The test sample remains the same in all five models, but the learn sample is repeatedly cut. Starting with 100% of the learn sample, models are then built with 3/4, 1/2, 1/4 and 1/8 of the learn sample. A summary of these five models is presented comparing the number of terminal nodes, the ROC/AUC statistic, and the size of the learn sample among all five models.

What is notable in this example is that the ROC does not vary overly much among the five models in spite of the fact that the learn sample drops by almost 90%. In other words, while the first model built on all the learn data has 136 terminal nodes and an ROC/AUC test sample statistic of 0.9269, the smallest model built on only 1/8 of the learn data has many fewer nodes (10), yet its ROC/AUC statistic is not much less: 0.9147. It should be pointed out that these direct comparisons are possible because a single test sample is used for all five models.
These results suggest that there is a strong signal in the data and that where CART is applied to these data and this target, it is reasonably impervious to the amount of learn sample data. Indeed, much of the signal required to predict the target can be achieved with the smallest tree containing only ten terminal nodes. The largest and most complex tree in the first model may be significantly better on the test sample for some performance measures, but using ROC/AUC to judge the models the marginal improvement obtained by going from 1/8 of the learn sample to the entire learn sample is not great.
SPM batteries, of which there are over 50, are particularly useful for getting a good handle on the modeling properties inherent in your data. Every dataset has its own idiosyncracies, and sometimes many models must be generated to get a sense of what a dataset's particular properties are. Whether your preferred analysis tool is CART, TreeNet, MARS or RandomForests, SPM batteries make investigating these matters quick and easy.
Naturally, when building a single model destined for deployment and prediction-making, all available data should be used. In this example, the first model based on all the data is a good candidate for this purpose. However, using BATTERY SAMPLE to generate four additional models served to illustrate how the amount of learn data affects the size and performance of the model, lending confidence that the model would not change in any profound way by the addition of more data. This fortunate property of robustness is not shared by all datasets, however, and it is wise to establish this when a new data mining project is begun. In our consulting work at Salford Systems, we routinely use BATTERY SAMPLE to quickly and easily assess this aspect of the data we analyze, often as one of the first analysis efforts we carry out for our clients.