Building The Optimal CART Tree
Building The Optimal CART Tree
Hi, this is Dan Steinberg at Salford Systems, and welcome to this particular edition of Salford Systems' online training videos.
The CART Sequence
In this session we will be discussing the CART® tree sequence. CART follows a forward growing and a background pruning process to arrive at the optimal tree. In the process CART generates for us, not just one model, but a collection of progressively simpler models. This collection of models is known as the tree sequence. In this video segment, it will explain the forward backwards tree generation process. We'll also discuss how a modeler might use judgment to select near optimal tree that might be better for deployment than the so–called optimal tree. Let's go back to our example data set, the euro telco mini excel file and again let's set up our model. Make sure to select a dependent variable. Exclude the record ID, which we don't want in the model and indicate the city and the marital status are categorical variables now hit the start button.
Finding The Optimal Model
If you recall, the optimal model is revealed to have 40 terminal nodes, and in a previous video we discussed this particular tree which has 10 terminal nodes. Now let's go in a little more detail as to what actually goes on during the CART tree growing process. The first thing to notice is that we have a curve down below our tree topology graph, which is a downward sloping curve, usually, till it hits a minimum and then it starts to increase.
This curve represents the performance of different sizes of trees on a test sample. In this case, the test sample was obtained by cross validation, which we discuss in another video. For now, take it on faith, that we have an honest test base methodology for determining how well that given trees performs. The lower this curve gets the more accurate the model is because this is a measure of error. And if we use our left and right arrow keys we can move through a variety of different size CART trees. So the question is how do we get these trees? What does CART do?
Well move all the way to the left edge, of this curve here, to obtain the beginning of the CART tree, which always starts with the root node. In this case the 830 records divided between 126 yes outcomes and 704 no outcomes and CART proceeds to split this root node, which is all of our training data into two partitions. CART always split whatever node it is splitting into exactly two, and no more nodes.
Having split the root into two nodes what CART then does continue to split the nodes. However, the sequence that we see here is not the order in which the trees are grown. In fact, what happens, in order to get an idea of how the trees are grown, we have to go to the other end of this display, which is the largest tree and over here when we click on this enlarge button, whoops we have to click up here, this enlarge button. Now we get to see the largest tree CART developed was.
The actual order in which CART grows the tree is known as left first. That is, CART starts off at this root node here, then splits this left child here, then it splits the left child again, and again to the left, and again to the left, etc. In other words, CART favors making splits on the left side of the tree until it can't go any further and then it gradually works its way across to the right. It actually doesn't matter how CART orders the splitting because every node, no matter where it is, is analyzed in isolation from all the other nodes and that node is searched for the best possible split that CART can find.
So order of tree growing and split making is irrelevant and CART uses an ordering that is convenient for the programming of the software. And we end up with this large tree. So this large tree is our starting point. So if we make it a little bit smaller, it's very hard to see the details here but it gives you a rough idea of the general topology and pattern of the tree is. You can see it is quite deep on certain paths of the tree and relatively shallow on other parts of the tree. Why is that? Why does CART stop short on some paths and not so short on others? Well the answer is very simple, in some cases, CART simply runs out of data. What that means is that if CART arrives at a node in which there is only one record obviously the splitting has to stop. Also, since we're trying to separate zeros from ones, or goods from bads, if we ever arrive at a node, in which all the records are good, let's say, then it doesn't matter how many of them there are. If they are all the same class, there is no further splitting possible. So those two reasons are going to be the principle reasons why a tree may have branches that are not as long in some direction and path as others are. The important point here is that we've gotten to a large tree.
Pruning The CART Tree
What is this large tree? This large tree is our raw material. This is the starting point, from which we are going to try to discover a smaller tree, which lives inside of this larger tree and this smaller tree is going to have certain desirable characteristics among other features of this smaller tree, is that it is going to be the best performer that we can find of all the trees that live inside this bigger tree.
I'm going to click on this button over here again, in order to make this curve visible in its entire range.
So here we are at the bigger tree which has 81 terminal nodes. Now our question is how do I arrive at a smaller tree? And also arrive to a tree that is more desirable than this larger tree? Which is very likely too large, or known in literature, over trained. Well the CART methodology introduces a special concept called cost complexity pruning and the key idea here is that we want to remove from this tree something at the bottom and we want to remove one of these splits which is the split that hurts the performance of this curve the least on this training data. This is something done automatically by CART, CART checks every one of the bottom parts of the tree. The splits that lead to terminal nodes and it calculates what the damage of the performance on the test data would be if that particular split was removed. Finding the split that hurts the least, it pulls back to progressively smaller trees.
So this methodology of cost complexity pruning is very helpful because it reduces the problem of finding a smaller tree inside of this largest tree to something quite manageable. The actual number of sub trees that exist inside of this relatively large maxima tree is some huge number which something close to 80 factorial. It could be smaller, but not that much smaller. So we have the equivalent of perhaps trillion of trees that we would have to estimate and analyze if we wanted to study every possible tree that is extractable from this larger tree. But given cost complexity pruning, the worst case scenario is that if our largest tree has 81 terminal nodes in it then the largest number of trees that we have to access is 81, that is the slowest rate at which we can make the tree smaller, is by removing one terminal node at a time. Sometimes we eliminate more and we don't have to go into the details why, but essentially it has to do with ties available when we start removing nodes from the bottom of the tree.
Why is cost complexity pruning so appealing? Well, here is one reason. Notice here that I have a tree with 6 terminal nodes. What's special about this tree? Well from cost complexity pruning methods we can prove that this 6 terminal node tree is the best performer on the training data of any other 6 node tree. So what is another possibly 6 terminal node tree? Another possible 6 terminal node tree could be on the right hand side of the tree and forget the left hand side of the tree. So we can get that also has 6 terminal node but looks quite different from this tree here which is actually balanced that has a left hand and right hand side. Why don't we produce trees that only have splits on the right hand side after the root node split? The answer is cost complexity pruning works to eliminate that 6 node tree from consideration because it doesn't perform as well, as the tree we have here.
This should explain why the trees look like why they do when you use your arrow keys to look through them. The shapes that we see do not reflect the order in which the trees were grown they reflect the order in which the trees were pruned back. So we actually go to this sequence of trees by a backward process, but in order to start that backwards process, we first had to grow the largest tree possible.
Comparing Prediction Success
Ok, so having this large collection of trees, what we can now do is test them on some independent data known as test data and for each of these trees, what we can compute is the classification accuracy of those trees. This is what CART and the CART authors did in the original monograph describing CART. So if we go to the summary reports, click on prediction success and click on test, what we see here is the accuracy that we obtained for each of the two classes, the 0s and the 1s.
For the 1s, our accuracy on the test data is a little shy of 70% and the 0s is at 51.7% this is for a specific tree, and this tree has seven terminal nodes. If we were to go out to the optimal tree, we go again to the summary reports choose prediction success again and test and we see the performance of the optimal tree is a little higher, on test data is at 68% about the same we saw before but the 0s our accuracy is up to 73%, we can leave this up here and go back to a smaller tree. Let's go back to that 7 node tree again. And now we can compare these two, so you can see what happened as we got to the larger tree as we got to the larger tree, which is 40 terminal nodes versus 7 our accuracy on the responders are the same, but our accuracy on the non–responders got better.
How did we accomplish that? By making finer distinctions in the data and therefore when we allocated certain segments to the response category we didn–t carry along a bunch of non–responders with them to that same segment.
So let's closed these windows and go back to our navigator and observe some comments. First of all, if you're interested in the area under the ROC curve we're told here that the best we can get on the training data is with the 40 node tree and we're also given the score on test data for the current size tree of 7 nodes tree that's .60 which isn't so good. But when we go to the optimal size tree, we get a score of .73 which is a much better. So do we have to take the optimal tree? Answer is no, we have perfect freedom to decide to work with, to deploy, to score with, to report, we can deploy, score, or report any of the trees in this sequence. What would be our reason for choosing a tree which is different? Well under most circumstances, people prefer smaller models to larger models. This number over here .585 is a number that goes between 1, which is the worst possible outcome, and 0 which is the best because it represents the degree of error and we can use that particular score in order to compare different models. We can see here that we have an error score of .61 if we go to a model that has 35 nodes instead of 40 and as we get progressively smaller we can see that error score goes up, but we may be willing to make that tradeoff taking a little bit or a moderate amount of error in exchange for a smaller model.
When it comes to scoring, generally, it is not terribly relevant to consider the number of terminal nodes in a tree, unless we're talking about perhaps tens of thousands of nodes. But in smaller trees it isn't going to make that much difference if we're working with 25, 30, or 40 let's say. But we may want to use a smaller tree not for scoring but for interpretation and for decision making. That's what we did in our first training video with this data. We decided to work with a 10 node tree not because it is the most accurate tree but because it tells an interesting story, and it's a story that can be used for segmentation and marketing purposes. How do we choose this particular tree? Well one of the guide lines was whether we could present that tree on a single page. Not based on any scientific bases on choosing a tree but in real world situations, if you have to defend the tree, you may decide you'll have to defend a tree that is slightly smaller than the tree that will be used to be deployed on servers but which contains the essential predictive content of the main tree which is what we have over here.
So when we go to this particular tree and look at the tree details we see the story that we showed before and a story that is actually fairly compelling telling us what this CART tree is doing.
In general you are always going to have a choice. In general there will be reasons for choosing a tree that is slightly different than the so–called optimal tree. One of the ways to get a handle on the right size tree to explain, or to comment on, or to report is by clicking this button here. This button is a 3 state button, let's continue to click it. This is the original status, which shows this curve. When we click it once, part of this curve, consisting of blue little squares and a blue line are turned green and the green area is an area which we call the SE1 band or the standard error 1 band.
What does that mean? It means that the performances of all the trees that are colored green are within one standard error of the measured performance of the optimal tree which means statistically speaking it is not possible for us to be sure which of these trees would actually perform best on a new data set. Their performance is close enough that there is no compelling reason to choose one over the other, so we might as well choose the smallest tree, in circumstances in which the smallest tree is beautiful. We can then go further in that by including judgment.
So to summarize what we've accomplished in this particular video, we've explained that CART starts with all the data in the root node and it builds a large tree in an order we see different in this tree sequence and that tree becomes the raw material for ultimately finding an optimal tree and other trees that are interesting and use the cost complexity backward pruning algorithm that CART does for you automatically which progressively makes the tree smaller and smaller and defines an interesting sequence of smaller and smaller trees. That sequence never goes slower than removing one terminal node from the maximal tree at a time. So if you have a 300 terminal node tree the most alternative trees you'll see is 300. It then scores all those trees in that sequence using some form of test data, if it is available. In here we used cross validation to do it, and this allows us to measure the performance of each of these trees and then to finally make a decision based on a combination of performance and judgment as to which tree we'd prefer to work with.
Final observation, why is it that CART does not use a stopping rule? And instead use this make a very large tree and prune back to find the right tree? So why not use a stopping rule? Well it turns out the CART authors originally wanted to use a stopping rule as this would allow trees to be finalized more rapidly, but they discovered in their research that the stopping rule often stopped too soon. In other words, they missed important structure that is available in the data but only discoverable if you make a few more splits. The only way to protect against missing important structure in the data is to use CART's fully growing maxima tree followed by backward pruning.
So why do some other tree growing tools use stopping rules? Well here, you'll experience some advantages of working with true CART. CART makes use of some ingenious programming methods to accomplish massive amounts of computation rapidly. These high speed algorithms are available only in the proprietary code written by original CART co–author Jerome H. Friedman of Stanford University and this CART is the only CART that was authored by Jerry Friedman and it–s the CART that is only available through Salford Systems. Competing trees often use stopping rules because their algorithms are slow and they cannot complete the true CART process in reasonable time. These other trees often produce inferior results compared to true CART even if they claim to be offering something like or a good fact simile of the CART algorithm. This concludes our session on tree sequences in CART and we thank you for joining us and we hope to see you again in another one of our training videos. Until then, bye–bye.