How much memory is needed in order to complete a model building run in SPM?
A useful measure of the training set size is the total number of cells used for modeling, which is calculated as the number of modeling variables (predictors, target, weight) times the number of learning records.
Remember, it isn't the size of the dataset that really matters – it's how much of it you want to build your model with.
We can now provide a set of rough estimates on the minimum total amount of memory in bytes required by each of the main data mining engines available in SPM:
CART: 8 X N-cells
MARS: 8 X N-cells
TreeNet: 12 X N-cells (default fast version)
8 X N-cells (slow memory efficient mode)
RandomForest: 8 X N-cells
There is a simple explanation to the observed pattern of 8 and 12. For memory efficiency reasons, each training cell is internally stored as a 4-byte number (either single precision floating point number or an integer). Therefore, 4 X N-cells bytes are needed to load the training data itself. Most engines also require a special array of sort indexes for all available predictors so that the split searching routines can work as fast as possible (most of the modeling time is obviously spent on the search for best split at each node over and over again). This indexing array also requires 4 X N-cells bytes of memory and has to be loaded all at once. Therefore, 8 X N-cells bytes are typically the dominant component in the resulting memory footprint of the application. Additional memory may be needed to accommodate inner working of the algorithms; however, its share becomes less and less significant compared to the dominant part as the datasets get larger and larger. Note that TreeNet in its default mode uses a second indexing array to facilitate fast sampling operations – this increases the resulting memory footprint by another chunk of 4 X N-cells bytes. This extra overhead can be avoided (MART LOWMEM=YES option available in SPM 7.0) at the cost of a slight increase in the overall run time of the process.
Going back to our example with 1,000 predictors and 1,000,000 observations, 8 Gb of memory will be needed to build a CART/MARS/RF model and 12 Gb of memory will be needed to build a default fast TN model. Note that given the complexity of the underlying algorithms, the overall run time may still become an issue on the datasets of such sizes and above in spite of the fact that the entire data is loaded in memory and readily accessible. One way to improve the situation is to run pilot models on a smaller sample of observations, use these models to filter out unimportant predictors, and then building large sample models on the reduced set of important predictors. Remember, it is the total count of training data cells that matters in the above calculations, reducing the number of cells will not only reduce the memory footprint of the application proportionally, but may also dramatically speed up your runs.