Unintended Use of Illegitimate Data in Predictive Modeling Part 2
In the last post we briefly described a situation in which it was possible to inadvertently use illegitimate data. In this post we discuss some other situations in which this can occur. Here are a few examples:
The account number or Record ID or merge key appears in a model and it is actually predictive. This can most commonly occur when meaningfully different subsets of data are stored in different data bases, such as when customers appear in one database and prospects who have not yet become customers appear in another. Certain identifiers might provide clues as to which database the record came from and using this information would be akin to developing a model which says "If a record came from the customer database then it pertains to customer and otherwise not". Clearly such a model would have no practical use when considering a record for an individual who is new to us and is not yet stored in either database. Kaufman, Rosset, and Perlich (2011) discuss exactly this situation in the context of a medical database in which the sick patients received special IDs totally distinct from the healthy patients.
An example pertaining to prediction of the amount a customer will spend on an e-commerce website was discussed by Kohavi, et. al (2000) , where future sales tax and shipping information was inadvertently available to the modeler. Obviously, if we know the sales tax and shipping cost pertaining to an order, we can accurately estimate the sales amount of the order. But the e-retailer wants to make this prediction before we have any such ancillary information. Making use of such information when building predictive modelers is entirely illegitimate.
Data preparation that leverages illegitimate information. In a consulting project we discovered that the data preparation computer code came in two forms: one for the "responders" to an offer, and another for the "non responders". Although the two programs were largely identical, there were slight differences which were sufficient for a modern data mining technology to exploit. In this case the smoking gun came from the variable importance list generated by our predictive modeling engines. The most important variables were factors we had never previously seen to be important and it was surprising to see high rankings for factors which we expected to be innocuous. Careful tracing back through client computer records allowed us to uncover the problem and repair it.
Missing Value Imputation. In general, our view at Salford is that missing value imputation is undesirable, and also a complicating nuisance that is best avoided. Fortunately, our principal data mining engines all have built-in methods for missing value handling that do not require imputation. However, there may be times when an organization elects to impute missing values to support traditional statistical modeling methods. Many modelers opt for the simplicity and robustness of median value or mode value imputation and such imputation is not likely to bring leakage problems with it. But if the imputation methodology is based on some type of predictive or statistical modeling then the door is opened to the possibility of contamination from illegitimate data sources. Often imputational models are more or less "kitchen sink" models in which every scrap of information available is thrown into the system in the hopes of extracting at least a bit of predictive accuracy. But in predictive modeling data sets, there will always be some information which must not be used, and this information must be avoided not just for the principal models, but for all supporting and auxiliary models as well. If illegitimate data is leveraged to impute missing values then the imputed variables may subsequently turn out to be powerfully, and illegitimately predictive.
In case this example seems a bit abstract, it is useful to keep in mind that exactly this mistake was made by the well-regarded author of a well-known data mining book in one of his principal example data sets. Fortunately, the author took the trouble to provide both raw and imputed versions of his sample data sets so it was possible to both avoid the error but also trace it back to its roto cause.
In the next post we will discuss some ways to protect against the use of illegitimate data.
Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-cup 2000 organizers‟ report: peeling the onion. ACM SIGKDD Explorations Newsletter. 2(2).
Kaufman, Rosset, and Perlich.Leakage in Data Mining: Formulation, Detection, and Avoidance, KDD'11, August 21 – 24, 2011, San Diego, California, USA.