Our recent set of posts on the topic of illegitimate predictors was provoked by a fascinating recent paper on the topic presented at the KDD2011 technical data mining conference. The paper, by Kaufman, Rosset, and Perlich, focused on public modeling competitions. Such competitions have been in vogue since at least 1997 when the first KDDCup was conducted, and were given a terrific boost when Netflix offered a $1 million prize for a movie recommendation system that could beat their own internally–developed system.
In the last post we briefly described a situation in which it was possible to inadvertently use illegitimate data. In this post we discuss some other situations in which this can occur. Here are a few examples:
At Salford Systems we take pride in pointing out that much of the work of modern analytics can be automated using our advanced technology. And indeed, our process of going from raw data to high quality predictive models is vastly faster than it was when we used classical statistical models some 20 years ago. But not everything that needs to be done in model construction is 100% automatable, and this is especially true when it comes to the avoidance of certain common blunders in model construction. In this article our focus is on the inadvertent use of information which in fact should never have been used in the model construction. Although we can provide some rules of thumb and some management advice to protect against this type of blunder, at present it appears that avoidance of these errors requires specific knowledge of the details of all of the potential fields in the database. In other words, there are some errors which are probably always going to be avoided only by the exercise of good judgment and vigilance exercised by human experts. This article is devoted to one such problem: the use of predictors which should never be used even though they appear on the database.