Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
  • Statistical Learning and Data Mining III
    October 01, 2012
    Boston, MA
  • DMA
    October 13, 2012 - October 19, 2012
    Las Vegas, NV
  • INFORMS
    October 14, 2012 - October 16, 2012
    Phoenix, AZ
View full calendar
Home Blog Dan Steinberg Dan Steinberg Rules of Thumb When Working With Small Data Samples

Rules of Thumb When Working With Small Data Samples

Written by  Dan Steinberg Friday, December 02 2011
Rate this item
(0 votes)
Binary Classification

CART®

The original CART monograph discusses a study the authors performed working with 215 observations and 19 predictors, where 37 records were of class 1 and 178 of class 0. We think that this is example, with 37 examples in the smaller class is close the smallest sample size you can usefully work with CART.

Recommendation: We suggest using a minimum of 100 records, with the target variable distributed not more unbalanced than in proportions (1/3, 2/3) for up to 30 predictors. We recommend repeated cross-validation to estimate the out-of-sample (previously unseen data) performance.

MARS®

Some our clients have reported with extremely small samples for regression, using about 30 records and 10 predictors to develop very compact models. We do not have much experience with small sample binary response models.

Recommendation: A minimum of 30 records in the smaller of the two classes, and thus a sample size of 60 for a balanced sample, working with up to 15 predictors.

TreeNet®

TN is probably our most effective tool for working with very small samples in the context of many predictors. We have seen successful results with as few as 30 records in the smaller class while working with several thousand predictors in genetics research.

Recommendation: First, it is strongly advised that the minimum node size be lowered from the default of 10 to as low as 3. Of course repeated CV is required to determine the likely out of sample performance of the final model. Repeated CV and bootstrap re–sampling (via the BATTERY BOOTSTRAP) are best for honest performance assessments.

Dan Steinberg

Dan Steinberg

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.