Controlling Node Size in CART using ATOM and MINCHILD
Learn to control the size of the maximal CART tree in two ways: Telling CART to stop early and limiting CART's freedom to produce small nodes.
Welcome to another in the series of Salford Systems Online Training. This is Dan Steinberg. We're going to be talking about Salford Predictive Modeler. The SPM software is available from our website, if you don't already have your copy installed, please do so before joining us for the rest of the video. Today we're going to be talking about controlling node sizes and CART with ATOM and MINCHILD. Today's topic is on the technical side, but it's very easy to understand. The concepts we're going to talk about today are relevant for all Salford based tools which includes not only CART, but also TreeNet and Random Forests and there will be others in the future.
Why control node size?
Controlling the sizes of terminal nodes is a practical matter, if you're using CART, for example, to segment a database you might want to make it impossible to create segments that are too small. Altering terminal node size can also influence performance details of the optimal tree. We need to start with a little bit of background about obtaining optimal trees. CART theory teaches us that we cannot arrive at the optimal tree by a stopping rule. The CART authors devoted quite a bit of energy to researching this topic. For any stopping rule, is possible to construct data sets for which that stopping will not work. What that means, is we will end up stopping too early and we will miss important data structure. This result was discovered by both experimentation and via mathematical construction. So the CART methodology is to grow first, and then prune and ultimately arrive at the best tree. The CART methodology is thus, a sort of search engine for discovering possibly valuable trees. First we grow, then pruning is used to arrive at the optimal tree or very often a range of tree sizes that yield both acceptable predictive performance and simplicity. CART also insists that we have a test method to make our final tree selection, but that's a topic of another session. CART theory tells us the trees should be grown to their maximum size during the growing phase. Thus, trees should be grown until we either, run out of data, which means there's only one record left and thus there is nothing to split, the node is impossible to split, because it's pure, that is all the records belong to one class such as, for example, all of them are good or all of them are bad in a two class good–bad problem, or the node is impossible to split because all records have identical values for predictors. The last is pretty rare, but it is still possible. Experience tells us, that if you start with 1000 records in a typical binary classification problem, you should expect about 500 terminal nodes in the largest possible tree, but it could be many less. But our rule of thumb is, if you're trying to make calculations and allowing for how big a tree you're going to get, if you try to grow the biggest, however many records you have, you&rsquore going to get about half that number, or less, of terminal nodes. We are talking about a binary classification problem. This could be different if you're talking about regression or multiclass classification.
So, let's start with a data set that we discussed in other sessions, it's GB2000.xls-Zipped, and let's load that into CART and have a look at what we can do with it. If you're going to follow along, you'll want to have loaded the data set and set up your model. I've already done a few runs here and we'll be talking about them, but once loading the data you will see your Model Setup dialog that will look like this. You'll be selecting target as the dependent variable. You want to make sure that in testing, you set it to 'no independent testing — exploratory tree'.
The reason I'm doing that is because I want to get the biggest tree possible and I don't want to mess around with testing right now, and one other very important feature — you need to go to the advanced tab and here just follow along with what I do and we&rsquoll discuss what these controls mean in a little bit. But you're going to set the 'Parent node minimum cases' to 'two' and the 'Terminal node minimum cases&rsquo to 'one&rsquo, that's going to allow us to get the biggest possible tree, with really no limits on the growing of the tree. And this is what we're going to get, we're going to get a tree with 349 terminal nodes.
So on the navigator, we have a little control on the left here, if we click it, what we get is a display that represents the sample sizes in the terminal nodes. Now if we&rsquod had a test partition we could see the counts for either 'train&rsquo or 'test.&rsquo But since we only have a learn sample here, only one of these buttons is active.
Now, we said before, that if you grow a tree to its absolute largest limit, that technically, we're going to run out of data when is only one record at a terminal node, but we might stop for other reasons and that's exactly what's going on here. If we hover our mouse over some of the nodes, we might actually see here there's five cases in this particular node but they're all class ones - which are the goods. Look over here and we can see we have two records, the reason the tree stopped there with only two is because both of them are class ones again, so there's nothing more to split. We can go to a specific part of the tree, for example over here, and we can ask to look at it. And here we can see a little more detail about what's going on in the terminal nodes. Notice that on the left–hand side, we have a terminal node with 133 records in it, but they're all bads, so the tree has to stop. There's no more work for it to do. In this node over here, we have ten bads and one good, and we get a split here in which eight of the goods go to the right. At this point, we can no longer split the node so the tree stops. But over here, we actually have a mixture, and so therefore there is an opportunity to do yet another bit of splitting and over here we split out one record to the right and we split out two records to left.
How can we control node size?
Now in real–world practice, it simply may not be necessary to push the tree growth to the literal maximum. It is essential to grow a large tree, large enough to include and have gone beyond the optimal tree. We can control the size of the maximal CART tree in a number of ways. Some controls tell CART to stop early. Other controls limit CART's freedom to produce small nodes. So the key controls that we are focused on in this session have the names of ATOM and MINCHILD. ATOM terminates splitting along the branch of the tree when the node sample size is too small. This means that this node is too small to be split, therefore, it is not allowed to be a parent. If a node contains fewer than ATOM data records, then we stop. Ten is commonly used, but you might set this much larger, in our example that we've just run, we set it smaller. MINCHILD prevents the creation of child nodes that are too small. So, we're working with that parent node, and it has more records than are necessary, therefore, it has more data than ATOM says is the minimum, or equal, to what ATOM says is the minimum. And now, MINCHILD says, out of that data, we're going to produce two children and we're going to limit how small the smallest of those two children can be. The smallest possible value is one. And that's what we had in our example. Not necessarily a very interesting example, but at least you can see that it is possible.
What does this really mean?
What this means, is that in splitting a node, we would be permitted to send one solitary record to a child node, and all other records to the other child. Larger values are sensible and desirable. Values such as five, ten, twenty, thirty, fifty, could work, for you. We've used values as large as two hundred and I can't come up with any theoretical reason why you might not sometimes be using even larger numbers yet.
But go back to the SPM application, and look again at setting of these ATOM and MINCHILD controls. So we have to go to the Model Setup dialog first. Then we go to the advanced tab, and there we see the two controls where we use the terminology parent node, that is can equivalent to the ATOM control, and terminal node minimum cases that is the MINCHILD control.
Now we can set these to numbers that suit us. The default happens to be ten for terminal nodes — for the parent node minimums, and one for the terminal node minimum. There are a number of circumstances in which I like to set the terminal node minimum quite a bit higher. Here we provide you with a rule of thumb: we recommend that ATOM be set to three times MINCHILD, so you can decide on what you'd like ATOM to be, or you can decide on what you'd like MINCHILD to be. But once you've got that number in mind, then you can use this rule to help you guide how to set the other one. How do we come to this number? Well, ATOM must be at least twice MINCHILD to allow a split consistent with MINCHILD. If you're going to split a node into two children then you have to have enough data for each of the two children to be filled properly, and if the two children each have to have a minimum of MINCHILD records, than you need twice that inside of the parent which is going to be the minimum size that a parent can have, which is therefore, ATOM. If you set inconsistent values for ATOM and MINCHILD, they will be reset automatically, to be consistent. So be careful when you control these, if you want to be the one who sets those particular values. So as we said before, ATOM controls the right to be a parent, the parent must generate two children, the parent must contain enough data to be able to fill the two child nodes, so that parent must have at least two times MINCHILD records.
So look at the following example where we got 20 records in the data set. If we set ATOM equal to 20, and we set MINCHILD equal to 10, then we have a chance of splitting this node into two children. But if we're going to do that, then we're going to need to set our split point so that exactly 10 records go to the left child and exactly 10 records go to the right child. Any other division of the data is not going to succeed, and therefore we won't have a successful split. So in the diagram over here we're looking at the values of a particular splitter, and so you can see that we got the minimum value of the splitter on the left, we've got the maximum value of the splitter on the right — this is the variable that we're going to use to make the split and now I have indicated in the middle, a split point. That split point may not exist. We may not have the option to select a split in this variable that will allow the ten records to go to the left and the other ten records to go to the right. If we can't find that kind of split, perhaps because of clumping of values, then we're not going to be able to make a split.
So let's explain our reasoning for wanting to have ATOM three times MINCHILD. In the diagram below, we have a representation of one splitter that is available in the database, and so CART is going to try to use that splitter to see if it can find a successful split. We've also set our MINCHILD equal to 10, so assuming also, that this particular splitter has a distinct value for every record, so there's no record clumping — we don't have to worry about that. We can split the data any way we like using this particular splitter. We can choose some point along the ordered line of values and we can say that's a split point. And we will then get a certain number of records going to the left and another number going to the right, but if we set the MINCHILD equal to ten, then we cannot get closer to the min. end of the values than that asterisk that we are showing. That asterisk is being shown here. If we were to choose a value that is smaller than this asterisk, then one of the children would have fewer than 10 records and so it would not be allowed. Similarly, if we were to set the splitter to a value higher than the second asterisk, than we would leave fewer than 10 records on the right-hand side and this would also be inadmissible if the MINCHILD is set to 10, and therefore we couldn't split there either. But, given that we have 10 records and 10 different values therefore, between the two asterisks, we have some flexibility, and we can choose one of those to be the splitter. Which one would we choose? Well it would be the value that gives us the best Gini improvement. For example, so if we have in this case 30 records, which is three times the MINCHILD, we have a chance for some flexibility. If we had set ATOM to just twice MINCHILD, we would have exactly one place in which we could split using this particular splitter, and if that didn't work, then we simply would be out of luck. But with three times MINCHILD, we have a little bit of flexibility. Now you might decide that you would like more flexibility. Maybe you want to set that value to four times and you're not guaranteed that flexibility anyway, if there's significant clumping of values in your data, but this is at least a rule of thumb and a guideline.The best way to determine the ideal settings for ATOM and MINCHILD is to actually run some experiments. In the Model Setup dialog, go to the battery dialog and select ATOM and select MINCHILD. In this particular case, we might also want to use a testing method in this case I set testing method to .05. Other than that, we can just let things go their own merry way.
Let's hit the start button and see what happens. Now, we're going to be building a collection of models, and this may take a few minutes on your computer depending on the speed with which your computer runs. Once the battery completes, we get a battery summary, which shows us the results for all of the different combinations which were run. And notice that the y–axis again is relative error, so we want the number to be as small as possible and we can see there's quite a few settings that have close to the lowest value. And then there's a number of settings which simply don't work, that is, we set the limits on tree growing too high. Why don't we click on this column here, which is 'relative error', which will order the results from the best performing or lowest error, to the highest. So we can see here that the worst possibility is when we're using a MINCHILD of two hundred, and we happen to be using an ATOM of four hundred, and that combination really prevents us from digging into the data structure. Although the literal best has ATOM equals five and MINCHILD equals one, it turns out that we can go to here, which is the setting with ATOM equals 10 and MINCHILD equals five, and we're going to get a relative error of 0.4698, which is a little over 1% greater than the absolute minimum — so that's pretty good. But let's just give a little try here, if it looks like that MINCHILD, in fact, should be set to something along the order of five. Then let's go back to the advanced setting. Let's set the MINCHILD, or smallest possible terminal node, to five and let's follow our rules and set this to fifteen and see what happens. And this is only one tree, so it should run pretty well. We get a result of .470, which is a little bit worse than the recommended settings based on experimentation of ten and five. Not bad, certainly we can live with either one reasonably.