Users of cross validation (CV) in CART, MARS, and TreeNet have become accustomed to simply requesting this testing method when setting up a predictive model and allowing the software to take care of the details. Of course, the Salford software prepares the data automatically and uses stratified sampling to randomly assign each record to a CV bin. The user has no influence and no control over how the bins are managed.
There will be times, however, when it is advantageous to construct these CV bins yourself. This can occur, for example, if you want to compare results across different software tools so as to be sure that any differences in results between methods are not due to the cross-validation process itself. By using the same CV bins in every modeling run, you can be sure that any differences in performance are due only to differences in modeling methods. Analysts with repeated observations on subjects will want to assign subjects rather than individual data records to CV bins, keeping all data belonging to a given subject together at all times. In data with a temporal dimension, it may be desirable to break the data into bins along the time dimension (for example, assigning records from every calendar month to a distinct bin). Although assigning data records to CV bins needs to be conducted with care to ensure that the right kind of balancing of the data is maintained, the process is quite simple and mechanical. In preparing the data for analysis you need to create a new column of data on which the bin assignments for every record in the training data will be recorded. If you prefer to work with a numeric bin variable, we recommend that you use the integers from 1 through K, where K is the number of bins you want. If you prefer to work with a text or character variable to record bin assignments, then you are free to come up with any unique labels you like for the bins.
Once the bin variables are created you can use the GUI to let CART, MARS, or TreeNet know that you intend to use your own CV bin variable on the MODEL setup dialog’s TEST tab. In the screen capture below you can see that we have selected “Variable determines cross-validation bins” as our test method.

Alternatively, you can issue a command of the form:
which is the command that would be generated by the GUI for you from the screen above.
If you are going to create a CVBIN variable we recommend that you create several versions using different random number seeds because the results will vary across different partitionings of the data and it is useful to know how sensitive each of the results is.
Did you know you can easily build a family of CART models with the BATTERY feature? It’s true! BATTERY is one of the most powerful aspects of the Salford Predictive Modeling Suite (SPM). For instance, suppose you wish to consider how the size of your CART tree affects the tree’s predictive accuracy. You might build a series of individual trees yourself, or you can let BATTERY do it for you. Four batteries -- ATOM, MINCHILD, DEPTH and NODES -- work in similar ways by varying the allowable size of the atom, minchild, tree depth and the number of nodes permitted in the maximal tree. These controls constrain how large your CART tree is permitted to grow. Because they are tree-oriented controls, they work with TreeNet and RandomForests models too. For example, by issuing just the following simple series of commands you will find yourself with eight CART trees, which you can easily compare against one another to find a tradeoff between predictive accuracy and tree complexity that works best for you:


The commands above, using BATTERY MINCHILD, will vary the "minchild parameter" in your models. This is a constraint on the minimum child node allowed in the tree: no split is permitted that produces a child node smaller than the minchild. BATTERY ATOM works in a similar way, except that it controls the atom size: a node smaller than the atom will not be split at all. BATTERY NODES varies the number of nodes permitted in the maximal tree, while BATTERY DEPTH varies the maximum depth permitted for the tree. Note that all four of these batteries can be combined, to produce a series of 28 models. The commands:
produce the following:

These batteries also work well with TreeNet and RandomForests models. For instance, you may wish to consider how the number of nodes affects the performance of your TreeNet model. Suppose you wish to try five tree sizes in your TreeNet modeling:
The first model will build a TreeNet model consisting of trees having one split only (structurally precluding any interactions), while the remaining models will allow successively more interactions to occur because each tree can contain several splits. In this particular example, cross entropy (CXE) and classification error improve as the number of nodes permitted in the trees increases, but ROC and lift are relatively unaffected.


SPM has over 50 different BATTERY options. We will describe some of these options others in the coming weeks. These commands will generate a series of eight models, presented below in a brief summary table that shows the accuracy of each model. Note that because the same learn/test sample split is used in all eight models, an honest comparison of their predictive accuracies can be made. Each model can be explored in detail by clicking on its line in the summary report, which will bring up a navigator with full tree detail. Two or more navigators can be viewed on screen at once.
A model battery is simply a series of predictive models that are built on your data using some systematic variation of a model parameter, or by a mechanism in which one model determines how a subsequent model is built. The underlying predictive model algorithm could be CART, TreeNet, RandomForests or MARS.
To begin, let me introduce by way of example one of SPM's simplest batteries, BATTERY SAMPLE. This battery repeatedly cuts the learn sample down while leaving the test sample unchanged, in an effort to illustrate the effect that dataset size has on the accuracy or size of the model. In this example, consider CART models and how they respond when the learn sample is altered.
Let's consider a binary (0/1) target in a dataset with 4601 records and 57 predictors. 20% of the data will be randomly selected and held aside as a test sample (N=943), while the remainder of the data will serve as the learn sample (N=3658). I prefer to use the ROC statistic (actually, the integrated area under the Receiver Operating Characteristic curve, also referred to as the AUC statistic) as measured on the test sample to determine how well the models perform, since this is a commonly-used measure in many of the industries in which the Salford Predictive Model Builder is used. Note that the ROC/AUC statistic is also provided for the learn sample, for those that are curious.
BATTERY SAMPLE builds a series of five models, in this case five CART trees. The test sample remains the same in all five models, but the learn sample is repeatedly cut. Starting with 100% of the learn sample, models are then built with 3/4, 1/2, 1/4 and 1/8 of the learn sample. A summary of these five models is presented comparing the number of terminal nodes, the ROC/AUC statistic, and the size of the learn sample among all five models.

What is notable in this example is that the ROC does not vary overly much among the five models in spite of the fact that the learn sample drops by almost 90%. In other words, while the first model built on all the learn data has 136 terminal nodes and an ROC/AUC test sample statistic of 0.9269, the smallest model built on only 1/8 of the learn data has many fewer nodes (10), yet its ROC/AUC statistic is not much less: 0.9147. It should be pointed out that these direct comparisons are possible because a single test sample is used for all five models.
These results suggest that there is a strong signal in the data and that where CART is applied to these data and this target, it is reasonably impervious to the amount of learn sample data. Indeed, much of the signal required to predict the target can be achieved with the smallest tree containing only ten terminal nodes. The largest and most complex tree in the first model may be significantly better on the test sample for some performance measures, but using ROC/AUC to judge the models the marginal improvement obtained by going from 1/8 of the learn sample to the entire learn sample is not great.
SPM batteries, of which there are over 50, are particularly useful for getting a good handle on the modeling properties inherent in your data. Every dataset has its own idiosyncracies, and sometimes many models must be generated to get a sense of what a dataset's particular properties are. Whether your preferred analysis tool is CART, TreeNet, MARS or RandomForests, SPM batteries make investigating these matters quick and easy.
Naturally, when building a single model destined for deployment and prediction-making, all available data should be used. In this example, the first model based on all the data is a good candidate for this purpose. However, using BATTERY SAMPLE to generate four additional models served to illustrate how the amount of learn data affects the size and performance of the model, lending confidence that the model would not change in any profound way by the addition of more data. This fortunate property of robustness is not shared by all datasets, however, and it is wise to establish this when a new data mining project is begun. In our consulting work at Salford Systems, we routinely use BATTERY SAMPLE to quickly and easily assess this aspect of the data we analyze, often as one of the first analysis efforts we carry out for our clients.