Unintended Use of Illegitimate Data in Predictive Modeling Part 1
At Salford Systems we take pride in pointing out that much of the work of modern analytics can be automated using our advanced technology. And indeed, our process of going from raw data to high quality predictive models is vastly faster than it was when we used classical statistical models some 20 years ago. But not everything that needs to be done in model construction is 100% automatable, and this is especially true when it comes to the avoidance of certain common blunders in model construction. In this article our focus is on the inadvertent use of information which in fact should never have been used in the model construction. Although we can provide some rules of thumb and some management advice to protect against this type of blunder, at present it appears that avoidance of these errors requires specific knowledge of the details of all of the potential fields in the database. In other words, there are some errors which are probably always going to be avoided only by the exercise of good judgment and vigilance exercised by human experts. This article is devoted to one such problem: the use of predictors which should never be used even though they appear on the database.
In the literature devoted to this topic since the late 1990's the discussion has usually focused on "leakage of information from the future". We encountered this in a project we undertook for a retail catalog company in 1996. Our large client had substantial volumes of data and was interested in identifying households in their database most likely to order from their Christmas catalog. In 1996 the catalog was mailed to households in November, and the list of several million households who would receive the catalog was finalized in the August of that year. Using information for the previous year we constructed a database with an indicator for households that ordered from the Christmas catalog from the date of receipt of the catalog through two weeks after Christmas. Our job seemed straightforward: we had access to quite a large number of household attributes and could use sophisticated data mining techniques to select predictors and build a high quality model. But our first models were so surprisingly accurate we conducted a detailed investigation into our model and data. We discovered that we had made the mistake of using information about orders the household had made from an October mini-catalog to predict Christmas purchase patterns. In and of itself, there does not appear to a be problem with this, as October comes before Christmas. The problem was that model was supposed to be deployed to select the mailing list in August, and in August we would never have had access to the subsequent October ordering.
The problem with a mistake such as this is that the usual methods of model validation may fail to catch it. We had randomly divided our data into training, validation, and test partitions, and the models appeared to be superbly robust and not in the least overfit. The problem so far as detection was concerned is that all three data partitions contained the same illegitimate October information and thus predictions made for the test sample were quite accurate and in line with the results seen in the train and validation samples. To reiterate: there was nothing wrong with the model as it was built should the client have wanted to use the model after the October purchase behavior became available. But because the model was needed for prediction two months before it was not useful for the purpose intended. The combination of the data used and the time at which the model was to deployed were in conflict.
A consequence of such a mistake, if not detected in time, is that expectations will be high for the model based on al reasonable evaluation criteria but when the model comes to be deployed in the real world it may deliver far less than promised. A further more pernicious problem with such mistakes is that they interfere with the business of discovery of high quality models that will work. If the illegitimate October information had been excluded from the model, the analysts might have gone on to find another legitimate variable that was almost as good and that would have been available for deployment in the subsequent August and yielded satisfactory results.