|
Step into the next generation of data mining and predictive modeling...

Financial Services data mining example: Identifying risky borrowers
To introduce you to data mining with the CART decision tree software we walk through a real world example drawn from the Financial Services industry. The database is an extract from a group of customers who selected a financial loan product, some of whom went "BAD" (stopped making payments). The information we use comes from standard credit reports provided by all the major credit bureaus, including variables such as:
- Number of credit reports requested for this person in last six months
- Number of credit cards with balances greater than 80% of available credit
- Number of new credit accounts opened in last 12 months
- How long ago was oldest account opened?
- How long ago was newest account opened?
Such measures are well known in the industry although the terminology used to refer to these measures is somewhat obscure.
Our goal is to see if we can predict who goes BAD using standard credit bureau information. We use this example to display the look and feel of CART and to show you how much you can learn quickly from a CART analysis.
First, let's read in our data from an Excel spreadsheet called DEFAULT.XLS. Click on the File Menu, select [Open] and see a display like the one below.
Note that CART will recognize a broad range of file formats, a few of which are visible in the selector box. Altogether there are about 85 formats to choose from, including all the major databases, statistical packages, and spreadsheets. ODBC connectivity is also supported, as is plain ASCII.
Because our data was provided to us as an Office 2000 document we select the [Excel 97/2000] format and then click on our data file.
After you open a data file, the Model Setup dialog automatically appears. This is the control center housing all model setup and refinement options in one convenient location.
To run an analysis all you have to do is:
- Select your target variable, in this case "BAD"
- Specify that this is a "Classification" problem
- State that BAD has two levels starting with 0
That's it! All you need to do now is -- click [Grow Tree]!
You don't have to select predictors if it is OK for CART to use all available columns. In this case we didn't want CART to use ID or similar information so we opted to list predictors explicitly. Note the running count kept of the number of predictors selected so far -- 15.
Eventually you will want to learn how to take advantage of the controls and options available on each of the Model Setup tabs. The good news is that all the options are set to intelligent defaults so you can get started without worrying about them.
The tree Navigator is the first report CART presents when the analysis is done. It displays a high-level picture of what has been accomplished but underneath it contains complete details on every aspect of the analysis. The tree diagram shows you how complex the best model is, how accurately it predicts, and where the interesting segments are located. Let's take a closer look at some of the specifics.
In the lower right hand corner we see that our tree has 13 terminal nodes. The two bright red nodes are where the highest concentrations of BADs are to be found; the pink nodes are a little less concentrated, the white nodes are typical of the overall population and tne blue nodes contain very few BADs. Later we will drill down into our data by double-clicking on the nodes that interest us the most.
The curve under the tree diagram reports on the overall accuracy of the tree and tells us how we might trade off accuracy for a smaller, simpler tree. The most accurate tree flagged by the horizontal green bar has an error rate that is 63.7% of that experienced without a model.
By clicking on another box on the error curve you will bring up a tree of a different size, in this case ranging anywhere from two terminal nodes to over 70. We'll discuss why you might want to look at these other trees later.
Let's prepare for some drill down. Positioning your mouse anywhere in the gray area surrounding the tree diagram, click and hold down the right mouse button. The pop-up menu, shown below, illustrates your options regarding the amount of detail you will see when you hover the mouse pointer over a node. The choices are:
- Column used to split the node
- Column and split value
- Simple tabulation of target variable
- Complete tabulation of target variable
We will select the complete tabulation. You may opt for less if the target variable has a large number of values or if you just want a more succinct report.

Now hover the mouse pointer over the root node at the top of the tree. The table tells us that of 9,297 customers, 680 or 7.3% were bad. The first variable CART wants to use to separate good from bad accounts is "How long ago was the oldest account opened?" and the separating value is 48.5 months.
You can hover the mouse over any node and see this instant mini-report. There are many more detailed reports available to you, including diagrams of the entire tree, which we will get to in time. We find this quick review of some interesting parts of the tree to be a useful way to get started.
Let's take a quick peek inside one of the deep red nodes where we should be doing an above-average job of concentrating the bads. In terminal node 9 we have 151 cases, with 21.9% bad. Compared to the root node where the percentage bad was only 7.3%, this node clearly contains a high risk group of customers.
Right now we are using "0" and "1" to represent "good" and "bad." At any time you can elect to use text instead of numbers and color code the rows of your table. We will illustrate that later.
If you double click on any node in the navigator, a window containing node-specific information appears. Clicking on Terminal Node 9 and selecting [Rules] displays the list of rules that define this node. Here we see that:
- The oldest account was opened more than 48.5 months ago.
- There has been at least one inquiry regarding this account within six months.
- The newest account was opened less than 19.5 months ago.
- The customer has fewer than 6.5 credit accounts.
The rules are primarily intended to help you understand exactly what segment of the database ends up in a terminal node. As a convenience, they are written in C code to facilitate deployment for those who prefer working with this programming language.
For a quick succinct overview of the entire tree, click on the [Splitters...] button at the bottom of Navigator window. This displays the variables used in every split in the tree. We like this view because it is an easy way to check for the main themes of the analysis and to make sure that no unwanted variables or major errors have crept into the analysis. This display also packs the entire tree into the smallest possible window.
As we will see later, any of these displays can be printed, sent to a formatted report window, saved as a windows metafile (wmf) or copied to a clipboard for later processing of your choosing.
You can see the whole tree in full detail by clicking the [Tree Details...] button at the bottom of the navigator. You have considerable control over exactly what is displayed in each node from the [View] [NodeDetail...] menu. The miniature Tree Map window help you keep track of your current position by showing you where you are relative to the overall tree. You can also click on a location in the miniature window and the larger window will reposition to reflect the clicked-on location.
Right clicking and then selecting [Export...] allows you to save the whole tree in any of several graphics formats for inclusion in other documents or publication to the Web.
CART offers you a flexible way to print your trees. With the Tree Details window active, go to [File] [Print] menu to see a print preview like the one below. Right now this tree will print on three pages. But if we click on [Page Setup] (see the next slide) we will have an opportunity to resize and reformat the tree.
More on Tree Printing
It is often worthwhile experimenting with the page orientation and changing the horizontal and vertical distances between nodes. Here we rescaled the image down to 60% of full size and managed to squeeze it all onto one page.
You can also specify headers and footers for the image and change the shape of internal and terminal nodes.
Now that we have the tree laid out the way we want we are ready to send the image to the printer or to a file.
We can review the overall performance of the tree by clicking the [SummaryReports...] button in the Navigator window. The tabs in this window all provide summary information pertaining to the entire tree.
The Gains Chart orders the terminal nodes by the observed lift. Terminal node 9 displays a bad rate 2.988 times that of the sample as a whole and has the highest concentration of bads in the tree. By common industry experience this tree is a reasonably strong performer. The top four nodes represent 24% of the total population but capture 61% of the bad accounts.
More useful information is provided in the [VariableImportance] tab. This ranking tells us which variables do most of the work in separating good from bad accounts. Two variables related to the age of the customer's oldest account stand at the top of the list followed by the number of satisfactory accounts. Total credit limit from revolving accounts comes next along with an income-related variable.
This report is meant to highlight any outstandingly important variables and to allow you to separate the wheat from the chaff. Several different ways of computing importance are offered for the expert but usually the standard report is all you need to look at.
The [Prediction Success] tab provides a performance report. Of the 8,617 cases in the test file the CART tree gets 70.5% classified correctly. Among the 680 bad accounts the tree gets 65.9% correct. Remember these results come from test data that were not used to grow the tree.
Can we do better? Almost certainly! The tree we are displaying here was arrived at in less than 30 minutes of data analysis. Digging deeper into the data, perhaps expanding the list of predictors, creating new measures, trying different control settings could all lead to substantially improved results. Remember, also, that the lift in our top nodes was quite satisfactory. We decided to use this set of results because they are typical of outcomes that are difficult to predict and are a refreshing change from artificial examples that permit 99% accurate classifications!
The [Misclassification] tab gives you another view of performance with one row per class. We usually see a slightly worse performance on the test sample but in most cases CART is able to deliver a tree with very similar error rates in the learn and test samples.
Most of what we have looked at so far pertains to the entire tree. Now it is time to drill down into the node detail to learn more about the specifics of the CART analysis. Going back to the Navigator window we double click on the root node to reveal a new report.
The node splitter is MNOLDOPN (How long ago was the oldest account opened?); this is the most effective variable we have in the root node for separating good from bad accounts. But how much better is this splitter than the other variables available? The competitor pane lists the other variables ranked by their splitting power (improvement score) and the graph displays this information by competitior rank. The top two competitors are almost as good as the main splitter; we may wish to examine trees with these variables as root node splitters in subsequent analyses.
The bottom right panel lists the surrogate splitters. Surrogates are splitters that partition the data in almost the same way as the main splitter, record by record. Surrogates play two roles in a CART analysis: they are used as stand-ins when the main splitter is blank or missing and they help us understand the split. Since surrogates are all similar to the primary splitter (the association measures how similar) we can think of the best surrogates as "synonyms" for the main splitter.
CART can apply our tree to new data; we call it "dropping data down the tree." First, you have to make sure that you have already saved a tree file. This is accomplished by clicking [Save Tree Information...] button in the Model Setup dialog before your run. You select [File] [Open] from the main menus to open the dataset to be dropped down the tree and then select [Model] [Case...] from the main menus to open the window below.
Here you select the Tree to be used to do the scoring and specify where the results will be saved. The output file will contain the CART classification (good or bad) and the terminal node each record ends up in. You can carry along all the original model variables used to grow the tree (by clicking on [Include Model Information]) and up to 50 additional ID variables.
Several reports are available when you drop data down a tree. The window below gives node-by-node detail reporting how many cases ended up in each node, the CART class assignment for that node, the percent correct in each node, and the percent of the scored data falling into each node.
The prediction success tab summarizes the overall performance in a table just like the one produced when growing a tree.
We have now covered all the essentials of CART tree growing and tree deployment. In the remainder of this Walkabout we will briefly look at some of the options and controls you can use to refine or modify an analysis.
Going back to the Model Setup dialog notice the [Categorical] tab. This allows you to flag nominal data variables. If you have a variable like "State of residence" coded 1 for Alabama, 2 for Arizona, etc, this variable should be declared categorical as the ordering of the numerical values has no significance.
Failing to flag a categorical variable is not a fatal error but it will prevent CART from making full use of the information the variable contains. Mistakingly checking a continuous variable as categorical is also not fatal but it will make the analysis far less efficient.
The [Method] tab allows you to choose from several different splitting rules. It is important to have more than one splitting rule because no single method is effective in all situations. Favoring equal-sized splits (a feature unique to CART) is sometimes helpful in improving performance or generating more interpretable trees.
We recommend experimentation with the different methods. Often one of the methods will stand out as the best performer for a specific type of data.
Testing is a core stage in CART analyses. CART wants you to specify how trees will be tested before the analysis begins. Your options are listed on the [Testing] tab. Brief explanations appear below.

- Exploratory Tree - No testing is performed. CART grows a large tree and leaves it unpruned.
- V-fold Cross Validation - best for smaller datasets where it is not feasible to reserve a meaningful subsample for independent testing.
- Fraction of Cases - best for large datasets. A random fraction of your data is set aside for testing.
- Test Sample in a Separate File - useful when databases are large.
- Variable Separates Learn and Test - a flag indicates which records are to be set aside for testing.
The [Advanced] tab provides you with controls that can influence size of the maximum tree. Here we specify that a node must have at least 100 records to be split and that CART should never create a node with fewer than 50 records.
Is our example data set large or small? We have over 9,000 records but only 680 records in the class of interest (the bads). Due to the small number of bads we elected to use cross-validation as our test method. To turn off warning messages suggesting that we use some other test method we raised the cross-validation sample size limit to 10,000 cases.
By default, if your training database contains more than 3,000 records, CART suggests that you use faster test methods. The suggestion is just to save time; there is nothing wrong with running cross-validation on a million records if you are willing to wait for the results.
New controls on the [Penalty] tab allow you to fine tune your analyses:
- Penalty on Variable - the larger the penalty the more difficult it is for a variable to become a main splitter. Penalties are expressed as fractions; a penalty of .5 reduces a splitter's score by 50%. Use this penalty to reflect the cost of acquiring certain types of information.
- Missing Penalty - penalize a variable to the degree it has missing values in the data. A proportional penalty would reduce a variable by 10% if it was missing in 10% of the records in the training data.
- High Level Categorical Penalty - categorical variables with a very large number of levels enjoy an advantage over continuous variables. This penalty levels the playing field.
The formulae for the two last penalties can be seen by clicking the [Advanced] button.
Not all mistakes are equally serious. If you deny a loan to a good borrower you miss an opportunity and perhaps lose a customer. If you grant a loan to a bad borrower you could suffer a more severe loss. Such differential costs of misclassification can be incorporated into the CART analysis and help balance mistakes appropriately.
In the cost matrix below we set the loss suffered by misclassifying a bad as a good to be twice that of misclassifying a good as a bad. We don't want to make any mistakes, but if a perfect model cannot be developed we want to steer away from the more costly mistakes.
We grew a new CART tree using these costs and obtained the somewhat smaller tree below. The relative cost displayed in the curve is larger because the cost of each misclassified bad is now 2 instead of 1. The next display provides more detail.
On this new tree the misclassification rate for the bads has dropped to about 16%; it was 34% on our first tree. However, the new model has a much higher misclassification rate for the goods (55% versus 29.5%). If you think that this second tree makes too many mistakes on the goods you could try using a more moderate cost for misclassifying bads, such as 1.5 instead of 2. Just a few experiments should be enough to determine if varying costs can give you satisfactory performance.
The CART Audit Trail and Scripting Language
CART keeps a record of every step you take in your analyses and makes it available for replay in another session or on another platform. The record serves as an audit trail documenting how you arrived at your conclusions. To access this record, click on [View] [Open Command Log...] or select the "L" icon from the toolbar.
The scripting language is quite simple and makes it easy to run multiple experiments in batch mode. We know of a CART user who needs to run thousands of trees almost daily; their work is fully automated with scripts (called command files) that look something like what you see below.
The command log is plain text; you can save its contents, edit it, and cut and paste to other applications.
Submitting Command Files and Running Batch Jobs
Command scripts are normally stored in plain text files and given a .CMD extension. You can launch them from the main menus by selecting [File] [ Submit Command File..]. Another convenient way to submit repetitive commands is to select and paste or type commands into this window.
Selecting [File] [SubmitWindow] will run the script in the CART notepad.
If you intend to work with the command files we advise getting acquainted with the Command Reference section of the on-line help.
What's Next?
We recommend that you grow some CART trees on your own data and become more familiar with the CART style of analysis. Once you discover the power of CART for yourself you won't ever want to conduct data analysis without it.
If you have any questions feel free to contact us by phone, FAX, or e-mail. Our technical support and technical expertise is unmatched in the industry. We also offer public and private training world-wide and provide analytical consulting services for the most challenging data mining and web mining problems.
|