Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.
Using dates in any kind of predictive modeling model can be tricky to get right. It is important to be clear about what you are trying to accomplish. Suppose, for example, we are trying to predict sales of a specific brand of beer in a given store and have daily sales data going back several years. One of the patterns we are going to want to track and capture is “seasonality,” which refers to changes in sales levels due to the season of the year. We might find that beer sales of all types are typically highest in the summer months, lowest in the winter, and intermediate in spring and fall. Of course, seasonality is only one factor among many, and good forecasts will require much more information than the date. To capture seasonality, statisticians and econometricians have long resorted to introducing variables to reflect the season of the year. This could be captured by a categorical variable coded, say, “fall” “winter” “spring” “summer.” A modeler might instead prefer to introduce a variable for the month of the year or even the week or the day of the year. The point is that this variable would be extracted from the date, and we would leverage the fact that we can observe the seasonal pattern more than once to draw conclusions about something like a “summer effect.”
Our recent set of posts on the topic of illegitimate predictors was provoked by a fascinating recent paper on the topic presented at the KDD2011 technical data mining conference. The paper, by Kaufman, Rosset, and Perlich, focused on public modeling competitions. Such competitions have been in vogue since at least 1997 when the first KDDCup was conducted, and were given a terrific boost when Netflix offered a $1 million prize for a movie recommendation system that could beat their own internally–developed system.
In the last post we briefly described a situation in which it was possible to inadvertently use illegitimate data. In this post we discuss some other situations in which this can occur. Here are a few examples:
At Salford Systems we take pride in pointing out that much of the work of modern analytics can be automated using our advanced technology. And indeed, our process of going from raw data to high quality predictive models is vastly faster than it was when we used classical statistical models some 20 years ago. But not everything that needs to be done in model construction is 100% automatable, and this is especially true when it comes to the avoidance of certain common blunders in model construction. In this article our focus is on the inadvertent use of information which in fact should never have been used in the model construction. Although we can provide some rules of thumb and some management advice to protect against this type of blunder, at present it appears that avoidance of these errors requires specific knowledge of the details of all of the potential fields in the database. In other words, there are some errors which are probably always going to be avoided only by the exercise of good judgment and vigilance exercised by human experts. This article is devoted to one such problem: the use of predictors which should never be used even though they appear on the database.
We can dig deeper than we did in our previous post into the reasons why more compact predictor lists can improve decision trees. Recall that a CART tree is grown by searching for splits across all predictors and all possible split points in a given partition of the learning data. There is no guarantee that this same split will be as good on the previously-unseen test data. Occasionally, the best split on the learn data will be a lucky draw, and the split will not be confirmed on test data. In the original CART monograph, large sample theory was intended to assure that in very large samples CART will always correct any unfortunate splits made as the tree evolves by making the correct splits lower down in the tree. With sufficiently large samples, enough data always are left to converge to the best model. In most real world situations, however, we will not want to rely on massive data sets to get to the best model, and we may not have enough data to assure the desired result.
The Salford Predictive Modeler™ suite (SPM) includes a number of automated tools to assist in the process of feature selection under the BATTERY mechanism. For example,
BATTERY KEEP
Selects a subset of features at random and builds a model from this random subset only. The GUI will guide you in how to use this option, but from the command line you would issue something like:
BATTERY KEEP=100, 15
Which requests 100 models, each of which includes 15 randomly-selected predictors. If we are sure that we want certain variables included in every such model, the command would look like:
BATTERY KEEP=100, 15 CORE= X1, X2, X3, X4, X5
The basic question is:
If CART is a great variable selector, why should I do any variable selection at all? Isn’t it better to let CART do everything automatically?
More technically:
If I have already built a CART tree using a given list of variables, why would rebuilding with fewer predictors sometimes yield a better-performing tree? Didn't CART already make the best possible decisions regarding which variables to use in any part of the tree?
A number of points should be made regarding these issues. The first, and the simplest to understand, is that CART is a myopic model builder that looks only at the split it is currently working on. This means that CART does not look ahead to future splits to be made on the children and grandchildren of the current split.Consider just the root node for the sake of argument. Suppose that we have five relatively strong splitters for the root. Of course, CART chooses the split generating the greatest reduction in Gini impurity (by default) and then goes on to build the entire tree using the same split selection criterion. Suppose we were to split the root on the second best root node splitter. It could happen that the overall tree now generated is a better performer on test data than the default tree.
When we started our work to release our first release of CART (a 1993 command line version running very nicely on UNIX), I was startled by some (now long forgotten) articles claiming to describe a new technology that was more accurate, or faster, on some class of analytic problem. At the time, I assumed that such articles needed to be taken seriously because they represented peer-reviewed, solidly-researched scientific advances.
Dan Steinberg, CEO of Salford Systems, has initiated a blog principally devoted to technical matters pertaining to our core products CART, MARS, TreeNet, RandomForests, Generalized Path Seeker, and RULEFIT, among others. This new blog focuses on the fields of data mining, machine learning, predictive analytics, and business intelligence, but with a personal perspective. Entries here could well recount conversations with product developer Jerry Friedman, or some time ago with Leo Breiman, or could reflect his thoughts on the art and practice of advanced analytics and the development of new analytics methodology.