Train/Test Consistency in CART
Take a close look at CART to see the advantages of using train and test data when building your predictive models.
Train/Test Consistency in CART
Welcome to another in the series of Salford Systems online training videos, this is Dan Steinberg. Please visit our website to download the appropriate software to work through the examples with us. Today we're going to be talking about train test consistency checks for the CART decision tree. Unlike classical statistics, data mining models generally do not rely on the training data to assess the model quality. In the SPM data mining suite we're always focused on test data model performance. This is the only way to reliably protect against over fitting. Every modeling method, including the classical statistical models in SPM 7.0 offers test data performance measures. Earlier versions offer the test data performance measures for the data mining methods only. Generally, these measures are overall model performance indicators. And if we're looking at a tree, we're talking about the tree overall. The measures say nothing about the internal model details. So for example, with the CART tree assessment we're going to be using the GB2000.xls-Zipped Dataset.
What is the advantage of test data?
CART uses test data performance of every tree in the back–pruned sequence of progressively smaller trees, to identify the overall best performer on classification accuracy. CART also notices which tree achieves the best test data area under the ROC curve on the navigator. You can see that no tree nodes are seen being reported as a tree with 14 nodes. You can see the results that are being reported for a 36 node tree, which is the best performer on classification accuracy. But what more can we do, CART performance measures have always been overall tree scores, no specific attention is paid to nodes specific performance. But in the real world we often want to pay close attention to individual nodes, we might use the rank order of the nodes in important decisions. We may prefer to rely on those that are most accurate in their predictions of event rates. Therefore, we need an additional tool for assessing CART tree performance at the node level, and this is provided by the Pro EX feature we call TTC or train test consistency checks.
So here we are inside of SPM, and what I've done is I've used the file open dialog in order to point to the GB2000.xls data set Zipped. We see the variables over here; notice that there are 2000 records in the data set, 26 variables. If you've attended some other video sessions you may have seen this data set before. We're not paying very much attention to the details here, except to observe that there are 1000 goods and 1000 bads here on an outcome related to alone. So we'll click on the model dialog, and we will select target as the dependent variable. We're going to go to the testing dialogue, and set one half of the data to be selected at random for testing. We do require genuine test data in order to get the TTC results. If we were to go with the fold cross validation there would be no TTC report. Hit the start button, and here are our results, this is a display that we saw in the PowerPoint slide before. We can see here, as we use the arrow keys we go from the smallest, to largest tree. We start with the largest of course, prune backwards, get the performance curve here, and normally we would be tempted to go with the most accurate tree, or perhaps with a tree that has the highest ROC. However, at this point we have still not paid any attention to what is going on at the node level. And if we click on summary reports, we can see here that the prediction success or confusion matrix gives us very good results. There are 77 and 76% accuracy in each class, averaging at a little over 76% accuracy, which is quite good. But until we hit the TTC button we really don't know what's going on at the node level.
Taking a closer look
Now let's consider what is going on in these nodes specific reports. So this is a table which starts with one row for the largest possible tree that was a tree over here with 96 terminal nodes. Then we follow the back pruning sequence to get progressively smaller and smaller trees. You can see the number of terminal modes is declining, and we go all the way back to trees that have only two terminal nodes, of course after that we end up with no tree at all. What we want to do over here, because we're interested in class 2, which is the class group that represents bad in this particular problem, we'll do that over here as well. And now let's consider what this particular report is telling us. The best way to understand what is going on here is to actually look at the results that come from double–clicking on one of these rows. So let's click on the tree that has 14 terminal nodes, and be sure that you've selected target class equals 2, otherwise what you'll see will not look the same as what we're going to display right now. Let's make this a little but wider, in order to have a better look. And we have a display here of the training data results, which are a red line, and the test data results, which are a blue line, with the green triangles.
Now what we're representing over here is not a direct measure of the percent of goods or bads in a node. What we are looking at instead is the lift, which is the ratio of what is going on in a particular node, to what is the average. The average is always represented as a lift of 1 overall, and we have some nodes in which the bad rate is above average, we have other nodes in which the bad rate is below average. If we look at the reports for the individual nodes, what we can see over here is that the third rank node on the training data is given a somewhat higher performance measure on the test data. This may or may not bother us, but this is a discrepancy which the report is calling attention. Again, over here we see that we have training results, which suggests that these particular nodes are reasonably positive in respect to the lift above–average. But the test data simply don't confirm that, and we get results which are almost indistinguishable from average. Finally, when we go to the worst performing node what we see over here is that the test data is suggesting that that node actually has somewhat more bads than the node that is considered to have the least number of bads according to the training data. So whether you find this particular set of discrepancies bothersome or not is not important for this particular discussion. What is important is that if you considered these discrepancies unsatisfactory, then our reports show you ways to find other trees which will be satisfactory.
If you click on the home button over here, it will take us back to this display. We selected 14 to look at, now I'm going to suggest that we look at the tree with six terminal nodes. Now why am I mentioning choosing six, well notice that the tree with six terminal nodes is the largest tree, which is green all the way across this display. The other trees all have some yellow, or perhaps all yellow and yellow is a caution light, green is a good to go light. So let's click on what the report here is telling us is a very good result, and let's try to understand why. Notice what we see over here, the alignment between the training curve and the test curve is excellent, and it's excellent in 2 dimensions, and this is what we want to pay attention to. Direction agreement, in each case, if the training data result is above the blue line, which means above average, then the test data is also above the line. And therefore, they agree in terms of direction, meaning they agree in terms of are we above–average or below average. But we're also seeing that we have an agreement in terms of the rank ordering of the nodes, which is the best node? Which is the second–best node? Which is the third best node? Etc. Now one of the things to keep in mind is that we need to allow for a certain amount of statistical variation, which is due simply to the sampling and not due to the underlying truth of this particular model. And we do that by allowing for some flexibility in deciding whether we have agreement or disagreement, or whether we have rank; ranks that matchup or ranks that do not, by running statistical tests based on a Z–score. And these are both by default set to 1.0. If I decided to be more forgiving, and to allow the discrepancies get as large as 2.0 before we declared an alert, then what we would find is that we could go to larger trees without seeing an alert being flagged. Now in this case we didn't get very much more in the way of trees that were given the green light. However normally as you set this number larger, you become more and more forgiving. Our opinion is that 1.0 is a very good indicator of a degree of consistency between train and test, which is going to be persuasive to business decision–makers.
So in summary, TTC which is a unique feature of SPM software and applies only to CART, focuses on two types of train test disagreement. The first is direction, is this node a response node or not? And we mean by response, the focus class of interest and whether it is above–average or not. We regard disagreement on this fundamental topic to be fatal. And therefore, we're going to require it pretty strictly for all nodes. Second we have a test on rank–order, are the richest nodes as identified by the training data confirmed in the test data. Without this we cannot defend deployment of a tree or at least the part of a tree. TTC allows us to quickly identify which tree in the pruning sequence is the largest satisfying train test consistency, to the level of strictness which we want to set. The TTC optimal tree is often rather close in size to Breiman's 1 SE rule tree. But the 1 SE rule does not look inside of the nodes at all, and 1 SE is not available for cross validation. But this similarity between the 1 SE rule of Breiman's and TTC is an interesting observation. But in the end, what we feel you should be looking at is the TTC results. Deciding on the level of agreement, which you are going to require, and then making your decision as to which tree to go with, or which tree to present to the decision–makers on that basis. Again, thank you very much for attending this particular video session, and we look forward to seeing you again in the near future.