Using Your Own Cross-Validation Bins
Users of cross validation (CV) in CART, MARS, and TreeNet have become accustomed to simply requesting this testing method when setting up a predictive model and allowing the software to take care of the details. Of course, the Salford software prepares the data automatically and uses stratified sampling to randomly assign each record to a CV bin. The user has no influence and no control over how the bins are managed.
There will be times, however, when it is advantageous to construct these CV bins yourself. This can occur, for example, if you want to compare results across different software tools so as to be sure that any differences in results between methods are not due to the cross-validation process itself. By using the same CV bins in every modeling run, you can be sure that any differences in performance are due only to differences in modeling methods. Analysts with repeated observations on subjects will want to assign subjects rather than individual data records to CV bins, keeping all data belonging to a given subject together at all times. In data with a temporal dimension, it may be desirable to break the data into bins along the time dimension (for example, assigning records from every calendar month to a distinct bin). Although assigning data records to CV bins needs to be conducted with care to ensure that the right kind of balancing of the data is maintained, the process is quite simple and mechanical. In preparing the data for analysis you need to create a new column of data on which the bin assignments for every record in the training data will be recorded. If you prefer to work with a numeric bin variable, we recommend that you use the integers from 1 through K, where K is the number of bins you want. If you prefer to work with a text or character variable to record bin assignments, then you are free to come up with any unique labels you like for the bins.
Once the bin variables are created you can use the GUI to let CART, MARS, or TreeNet know that you intend to use your own CV bin variable on the MODEL setup dialog's TEST tab. In the screen capture below you can see that we have selected "Variable determines cross-validation bins" as our test method.
Alternatively, you can issue a command of the form:
which is the command that would be generated by the GUI for you from the screen above.
If you are going to create a CVBIN variable we recommend that you create several versions using different random number seeds because the results will vary across different partitionings of the data and it is useful to know how sensitive each of the results is.