Are there limitations on the learn sample size when using cross validation?
By default CART will not allow Cross Validation (CV) for any dataset that has more than 3000 observations. The n-fold cross-validation technique is designed to get the most out of datasets that are too small to accommodate a hold-out or test sample. Once you have 3,000 records or more, we recommend that a separate test set be used.
For large datasets, it is recommended that a separate error set be used, either by manually splitting the dataset into learn and test samples (ERROR TEST or ERROR SEPVAR) or by using a randomly-selected test set (ERROR PROPORTION).
However, you can persist in using CV with the command:
BOPTIONS CVLEARN = n
The default value for n is 3000 but it can be reset to a larger value. For example, if you have 50,000 observations and want to use the entire dataset in a cross-validation run, issue the command:
BOPTIONS CVLEARN = 50000Steinberg, Dan and Phillip Colla. CART—Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.