Black Boxes and Data Mining Systems
The concept of a "black box" is used to describe a situation in which a scientist endeavors to learn as much as possible about an entity or physical system, but is limited in the type of information that can be obtained. Traditionally, only behavior is observed, with no way of knowing the probable mechanisms that determine the behavior. For most of its history, for example, psychology was limited to studying the brain as a black box because it was virtually impossible to peer inside to learn how the brain actually functions.
In the world of data mining and predictive modeling, the concept of the black box often comes up in the context of proprietary prediction systems in which the vendor does not disclose details of the algorithm by which the predictions are being made. In the 1990's, many financial institutions paid hefty fees to use a proprietary system for predicting interest rates; the vendor was successful in persuading banks that the predictions were accurate enough to warrant subscribing to the service even though the banks did not know how the predictions were generated.
Today, in the field of data mining and predictive modeling software, there are new black box vendors who prefer to offer the most minimal descriptions of their algorithms. Instead of describing their own algorithms in detail, they offer general discussions of data mining principles and pepper their white papers with formulas for well-known procedures such as logistic regression and ROC calculation.
The topic addressed in this blog is: Should you seriously consider such a black box system? In general, we think not for the following reasons:
One plausible justification for using a black box predictive system is that it outperforms other non-black box systems. To the best of our knowledge, no black box system has succeeded in outperforming systems such as TreeNet or other Salford Systems technologies.
Another justification that has been offered for black box systems is that they offer a high degree of automation coupled with good, if not the best, performance. In other words, you may obtain an "easy button"; just point, click, and wait for the models to appear automatically. We will offer a detailed discussion of this topic in another series of blog entries, but our current take is that such total "lights out" automation is largely marketing hype. In contrast, while considerable sensible automation is available within the Salford suite of tools, in addition to superior performance, these tools also come with detailed explanations regarding their inner working. So why go with mystery tools that offer far less in substance?
We have often suspected that black box systems for data mining are actually rather simple mechanisms. The vendors may endeavor to keep the details secret because they would find it impossible to obtain their high licensing fees from people who understood what the system was actually doing. By creating an aura of mystery around their simple mechanisms, these vendors hope to persuade wishful thinkers that a "silver bullet" solution to their modeling needs is at hand.
Many circumstances exist in which it is vital to be able to explain in detail how certain predictive models were developed from the training data. For example, regulators such as the FDA (Food and Drug Administration) are not going to accept the results of a data analysis if the method of analysis is not disclosed. Marketers are generally keenly interested in understanding how data is used to extract insight into customer behavior, and banking regulators insist on total transparency of any credit risk model. For such consumers of models, adequate explanations of the workings of the modeling mechanism must be provided.
Modeling systems frequently require tweaking the nature of the data, as well as their quality, volume, or breadth and change over time. Using a system that is both understood and understandable puts the user in a position of modifying control parameters intelligently so as to obtain better results over time. With black box technology, on the other hand, the user is always dependent on the vendor to make these adjustments, if indeed they are even possible to make.
So, in conclusion, we believe consumers must come down on the side of knowing, at a minimum, the key concepts behind the modeling system, as well as sufficient technical detail to be able to understand how and why the control parameters will affect modeling results.
If you would like to take a closer a look at Salford predictive modeling tools, evaluation versions are available at Salford Systems Products.