Our recent set of posts on the topic of illegitimate predictors was provoked by a fascinating recent paper on the topic presented at the KDD2011 technical data mining conference. The paper, by Kaufman, Rosset, and Perlich, focused on public modeling competitions. Such competitions have been in vogue since at least 1997 when the first KDDCup was conducted, and were given a terrific boost when Netflix offered a $1 million prize for a movie recommendation system that could beat their own internally–developed system.
The Salford Predictive Modeler™ suite (SPM) includes a number of automated tools to assist in the process of feature selection under the BATTERY mechanism. For example,
BATTERY KEEP
Selects a subset of features at random and builds a model from this random subset only. The GUI will guide you in how to use this option, but from the command line you would issue something like:
BATTERY KEEP=100, 15
Which requests 100 models, each of which includes 15 randomly-selected predictors. If we are sure that we want certain variables included in every such model, the command would look like:
BATTERY KEEP=100, 15 CORE= X1, X2, X3, X4, X5
The basic question is:
If CART is a great variable selector, why should I do any variable selection at all? Isn’t it better to let CART do everything automatically?
More technically:
If I have already built a CART tree using a given list of variables, why would rebuilding with fewer predictors sometimes yield a better-performing tree? Didn't CART already make the best possible decisions regarding which variables to use in any part of the tree?
A number of points should be made regarding these issues. The first, and the simplest to understand, is that CART is a myopic model builder that looks only at the split it is currently working on. This means that CART does not look ahead to future splits to be made on the children and grandchildren of the current split.Consider just the root node for the sake of argument. Suppose that we have five relatively strong splitters for the root. Of course, CART chooses the split generating the greatest reduction in Gini impurity (by default) and then goes on to build the entire tree using the same split selection criterion. Suppose we were to split the root on the second best root node splitter. It could happen that the overall tree now generated is a better performer on test data than the default tree.
Salford Systems' 6th International Applied Data Mining Conference, a user-oriented data mining and predictive analytics conference, was held in San Diego on August 23rd through August 25th, 2009, hosting over 100 people and offering 32 presentations across multiple tracks. Topics included what went wrong in the financial markets, best practice analytics in banking and insurance underwriting, fraud detection, discovering unexploded ordinance in minefields, various topics in healthcare and bioinformatics, predictive analytics for optimal placement of web advertisements in an ad network, genetics research, and techniques for building better models.
We were also honored to have scientific thought leaders Jerome Friedman and Richard Olshen presenting summaries of their most recent research. Jerry Friedman spoke about his Generalized PathSeeker approach to regularized regression; this technology offers high speed LASSO-style regression for extreme data set configurations with upwards of 100,000 predictors and possibly very few rows. Such data sets are commonplace in gene research and text mining and the new technology is both supremely fast and efficient. (GPS is currently available in limited release versions of Salford predictive analytics software.)
The agenda for the conference can be viewed at: http://www.salforddatamining.com/agenda.php. If you are interested in attending an online replay of the conference please contact us at This e-mail address is being protected from spambots. You need JavaScript enabled to view it. . We will be offering video recordings of the conference sometime in October.