What is a SPM Battery?
A model battery is simply a series of predictive models that are built on your data using some systematic variation of a model parameter, or by a mechanism in which one model determines how a subsequent model is built. The underlying predictive model algorithm could be CART, TreeNet, RandomForests or MARS.
To begin, let me introduce by way of example one of SPM's simplest batteries, BATTERY SAMPLE. This battery repeatedly cuts the learn sample down while leaving the test sample unchanged, in an effort to illustrate the effect that dataset size has on the accuracy or size of the model. In this example, consider CART models and how they respond when the learn sample is altered.
Let's consider a binary (0/1) target in a dataset with 4601 records and 57 predictors. 20% of the data will be randomly selected and held aside as a test sample (N=943), while the remainder of the data will serve as the learn sample (N=3658). I prefer to use the ROC statistic (actually, the integrated area under the Receiver Operating Characteristic curve, also referred to as the AUC statistic) as measured on the test sample to determine how well the models perform, since this is a commonly-used measure in many of the industries in which the Salford Predictive Model Builder is used. Note that the ROC/AUC statistic is also provided for the learn sample, for those that are curious.
BATTERY SAMPLE builds a series of five models, in this case five CART trees. The test sample remains the same in all five models, but the learn sample is repeatedly cut. Starting with 100% of the learn sample, models are then built with 3/4, 1/2, 1/4 and 1/8 of the learn sample. A summary of these five models is presented comparing the number of terminal nodes, the ROC/AUC statistic, and the size of the learn sample among all five models.
What is notable in this example is that the ROC does not vary overly much among the five models in spite of the fact that the learn sample drops by almost 90%. In other words, while the first model built on all the learn data has 136 terminal nodes and an ROC/AUC test sample statistic of 0.9269, the smallest model built on only 1/8 of the learn data has many fewer nodes (10), yet its ROC/AUC statistic is not much less: 0.9147. It should be pointed out that these direct comparisons are possible because a single test sample is used for all five models.
These results suggest that there is a strong signal in the data and that where CART is applied to these data and this target, it is reasonably impervious to the amount of learn sample data. Indeed, much of the signal required to predict the target can be achieved with the smallest tree containing only ten terminal nodes. The largest and most complex tree in the first model may be significantly better on the test sample for some performance measures, but using ROC/AUC to judge the models the marginal improvement obtained by going from 1/8 of the learn sample to the entire learn sample is not great.
SPM batteries, of which there are over 50, are particularly useful for getting a good handle on the modeling properties inherent in your data. Every dataset has its own idiosyncracies, and sometimes many models must be generated to get a sense of what a dataset's particular properties are. Whether your preferred analysis tool is CART, TreeNet, MARS or RandomForests, SPM batteries make investigating these matters quick and easy.
Naturally, when building a single model destined for deployment and prediction-making, all available data should be used. In this example, the first model based on all the data is a good candidate for this purpose. However, using BATTERY SAMPLE to generate four additional models served to illustrate how the amount of learn data affects the size and performance of the model, lending confidence that the model would not change in any profound way by the addition of more data. This fortunate property of robustness is not shared by all datasets, however, and it is wise to establish this when a new data mining project is begun. In our consulting work at Salford Systems, we routinely use BATTERY SAMPLE to quickly and easily assess this aspect of the data we analyze, often as one of the first analysis efforts we carry out for our clients.