|
If you have managed to construct a data warehouse, you now have corporate and operational data organized to support informed decision making. The challenge is to find effective ways to analyze those data so as to extract valuable information.
Many people new to the field find it difficult to determine exactly what constitutes data mining. Unfortunately, many software vendors are also unclear, so many identify any summarizing and reporting activity as "data mining." If you are truly to benefit from the revolutionary technology known as data mining, you need to distinguish the descriptive reporting provided by OLAP from the predictive, artificial intelligence models generated by the latest in machine learning technology.
Data Mining versus OLAP
In a nutshell: OLAP tells you what, but only data mining can also tell you why. Furthermore, OLAP can answer generic, unstructured questions.
We can illustrate the difference between OLAP and data mining with an example from our consulting experience. A major residential mortgage lender wanted to be able to predict which of their customers would refinance their loans at any point in time. Because refinancing normally means going to another lender, refinancers are often customers to competitors. With advance information, however, the bank could give each borrower a refinance probability score, classify each borrower as desirable or undesirable on other criteria, and then attempt to keep desirable customers by offering a reduced interest rate. The data warehouse we constructed contained monthly information on each borrower; data on current employment, real estate, and financial markets; the difference between employment, real estate, and financial markets; the difference between the current interest rate and the rate of the original mortgage; and detailed credit bureau information. Using OLAP reporting alone, we could have examined refinance behavior from many perspectives, including by region, borrower credit score, trend in borrower credit score, local real estate conditions, trend in market interest rates, and so forth. To generate reports the analyst then would have had to decide what was relevant and request reports based on those variables. Generally, such reports are useful and comprehensible only if they have a small number of dimensions.
 Our mission was to develop highly accurate predictive models that would tell us who is likely to refinance in any given circumstance. Using our CART decision tree, we pointed the software at the data warehouse (actually a data mart constructed specifically for this project) and asked CART to do the rest. The target variable was an indicator of whether the customer [had] [would] refinance at a specific point in time. One of our training data subsets contained historical information on about 300,000 customers; in any one calendar quarter about 1.8% prepaid. On its own, and without any guidance from us, the software determined which columns to use in the analysis and how the relevant casual factors interacted. For a model of 30-year fixed loans, more than 35 relevant factors were identified. The predictions that resulted were far more accurate than predictions developed by the in-house statistical staff. Although the columns used were all sensibly related to prepayment, there is no way a human being interacting with a reporting tool could have uncovered the 80+ segments identified by the CART tree.
|