A model battery is simply a series of predictive models that are built on your data using some systematic variation of a model parameter, or by a mechanism in which one model determines how a subsequent model is built. The underlying predictive model algorithm could be CART, TreeNet, RandomForests or MARS.
To begin, let me introduce by way of example one of SPM's simplest batteries, BATTERY SAMPLE. This battery repeatedly cuts the learn sample down while leaving the test sample unchanged, in an effort to illustrate the effect that dataset size has on the accuracy or size of the model. In this example, consider CART models and how they respond when the learn sample is altered.
Let's consider a binary (0/1) target in a dataset with 4601 records and 57 predictors. 20% of the data will be randomly selected and held aside as a test sample (N=943), while the remainder of the data will serve as the learn sample (N=3658). I prefer to use the ROC statistic (actually, the integrated area under the Receiver Operating Characteristic curve, also referred to as the AUC statistic) as measured on the test sample to determine how well the models perform, since this is a commonly-used measure in many of the industries in which the Salford Predictive Model Builder is used. Note that the ROC/AUC statistic is also provided for the learn sample, for those that are curious.
BATTERY SAMPLE builds a series of five models, in this case five CART trees. The test sample remains the same in all five models, but the learn sample is repeatedly cut. Starting with 100% of the learn sample, models are then built with 3/4, 1/2, 1/4 and 1/8 of the learn sample. A summary of these five models is presented comparing the number of terminal nodes, the ROC/AUC statistic, and the size of the learn sample among all five models.

What is notable in this example is that the ROC does not vary overly much among the five models in spite of the fact that the learn sample drops by almost 90%. In other words, while the first model built on all the learn data has 136 terminal nodes and an ROC/AUC test sample statistic of 0.9269, the smallest model built on only 1/8 of the learn data has many fewer nodes (10), yet its ROC/AUC statistic is not much less: 0.9147. It should be pointed out that these direct comparisons are possible because a single test sample is used for all five models.
These results suggest that there is a strong signal in the data and that where CART is applied to these data and this target, it is reasonably impervious to the amount of learn sample data. Indeed, much of the signal required to predict the target can be achieved with the smallest tree containing only ten terminal nodes. The largest and most complex tree in the first model may be significantly better on the test sample for some performance measures, but using ROC/AUC to judge the models the marginal improvement obtained by going from 1/8 of the learn sample to the entire learn sample is not great.
SPM batteries, of which there are over 50, are particularly useful for getting a good handle on the modeling properties inherent in your data. Every dataset has its own idiosyncracies, and sometimes many models must be generated to get a sense of what a dataset's particular properties are. Whether your preferred analysis tool is CART, TreeNet, MARS or RandomForests, SPM batteries make investigating these matters quick and easy.
Naturally, when building a single model destined for deployment and prediction-making, all available data should be used. In this example, the first model based on all the data is a good candidate for this purpose. However, using BATTERY SAMPLE to generate four additional models served to illustrate how the amount of learn data affects the size and performance of the model, lending confidence that the model would not change in any profound way by the addition of more data. This fortunate property of robustness is not shared by all datasets, however, and it is wise to establish this when a new data mining project is begun. In our consulting work at Salford Systems, we routinely use BATTERY SAMPLE to quickly and easily assess this aspect of the data we analyze, often as one of the first analysis efforts we carry out for our clients.
MIAMI -- Salford Systems, the authority in data mining and predictive analytics software, unveiled its new Salford Predictive Modeler (SPM)™ software suite at NCDM 2010 here today. SPM provides businesses, institutions and government agencies with a highly accurate, ultra-fast platform for developing predictive, descriptive and analytical models from large and complex databases. SPM technology dramatically accelerates accurate, robust model generation by automatically sifting through such databases to isolate significant patterns and relationships. Yet the program is easy to use for both technical and nontechnical users.
SAN DIEGO – Data mining technology allows sports teams to find new indicators to measure player performance while helping them gain insight into athletes’ future success, asserted Mikhail Golovnya, Salford Systems’ senior scientist, during his presentation at the MIT Sloan Sports Analytics Conference in Boston last week.
SAN DIEGO - Dr. Falk Huettmann, a wildlife ecologist and professor at the University of Alaska-Fairbanks, has written a report entitled Future of Alaska in which he forecasts how climate change, human activities, natural disasters and cataclysmic events might affect Alaska’s ecosystem over the next 100 years.
SAN DIEGO – Salford Systems CEO Dan Steinberg and Salford product user Felipe Fernandez will share with KDD 2011 attendees how broad scale predictive modeling and marketing optimization can be used to improve retail sales. The presentation will be included in the conference’s inaugural Industrial Practice Expo on Tuesday, Aug. 23.
SAN DIEGO – Salford Systems announces its 2012 Analytics and Data Mining Conference with the launch of its new conference website. The conference will be held in San Diego, Calif., May 24-25, 2012.
MIAMI – For the first time since its release, Salford Systems will train analysts on the advanced and novel features of its Predictive Modeling Suite. An Introduction to SPM is one of the featured computer training workshops included at the 2011 Joint Statistical Meetings in Miami Beach, Fla.
SAN DIEGO – A recent study confirms that a 17-gene genomic biomarker, identified by Salford Systems’ data mining algorithm TreeNet®, enables the Epidermal Genetic Information Retrieval (EGIR) method to detect melanoma accurately.