Modeling Competitions and Illegitimate Predictors
Our recent set of posts on the topic of illegitimate predictors was provoked by a fascinating recent paper on the topic presented at the KDD2011 technical data mining conference. The paper, by Kaufman, Rosset, and Perlich, focused on public modeling competitions. Such competitions have been in vogue since at least 1997 when the first KDDCup was conducted, and were given a terrific boost when Netflix offered a $1 million prize for a movie recommendation system that could beat their own internally–developed system.
A typical format for such competitions is that an interesting database is secured and divided into two parts. One or more target variables are selected and a modeling challenge related to the target is are posed to the public. All the data are released with the exception of the target variable in one of the data partitions (the "test" partition). The goal is to develop predictive models using the data in the complete data partition and use the models to make predictions on the second incomplete test partition. The competition organizers use the actual values of the target in the test data partition to evaluate the predictions. Competitions might be looking for models that rank data well (e.g. area under the ROC curve), classify accurately, minimize average squared error, minimize cost or maximize some other objective function.
Over the years, data mining competitions have secured data from on line retailers, nonprofit fund raisers, telecommunications companies, medical research, computer networks subject to intrusion, and have required both simple and very complex models to be developed. The topic addressed by Kaufman, Rosset and Perlich concerns illegitimacy in the data made available to the participants. Put baldly, they discuss several high profile competitions in which a clever participant could essentially cheat because the competition organizers had unwittingly included the true responses (or a close approximation) somewhere in the data.
Obviously, such mistakes spoil the competition if any participants discover the hidden information and exploit it. But participants who made such discoveries cheerfully leveraged the information to win, something we regard as unfortunate because it not only spoils the competition but also diminishes the value of all the work of the competition organizers. In our own competition experience we took it upon ourselves to report any significant problems we found in the data to the competition organizers immediately.
In one such competition, the organizers responded by issuing a new version of the data with problems corrected. Although in this example the problem was data incoherence rather than illegitimacy, our point is that it would be better if participants took it upon themselves to report potentially damaging information to the organizers, allowing the organizers to decide whether to publish that information, correct the problem (perhaps by issuing a completely new test partition), or do nothing. Perhaps competition organizers should publish a code of ethics to which participants must agree in order to enter. The code would prohibit the participant from silently exploiting what is obviously an unintended error in the data. Any winner who was subsequently found to have exploited such an error would be subject to being stripped of their title — much like an Olympic athlete found to be using banned performance-enhancing drugs.
But competitions aside, the Kaufman, Rosset and Perlich paper discusses a variety of examples in which the preparation of the data for subsequent predictive modeling created one or more illegitimate predictors and that in some cases this was not recognized by the preparers of the data prior to releasing the data for modeling purposes. When this happens in the context of private data mining projects it creates the risk that a predictive model developed on this data will never function properly in the context in which it is intended to be deployed. The lesson is that modelers must remain attentive and vigilant to this type of problem because they are far more common than many analysts realize.