Controlling Node Sizes in a CART Tree
The theory behind the CART decision tree, as laid out in detail in the classic monograph Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone, dictates that CART trees always be grown to their largest possible size before pruning. This means that the smallest allowed terminal node will have only one record in it! In theory, this is not a problem because large CART trees are just the raw material that the pruning engine starts with to arrive at the optimal tree. Thus, it is likely that all the small nodes will be pruned off anyway.
In practice, however, real world data sets do not always work that way. In fact, in small to moderate sized data sets it can pay to manipulate the controls that govern the sample sizes allowed in CART nodes. In this note we discuss this topic and provide some examples and instruction on how to use the controls.
Let's start with an example (use your own data to generate something similar).
This data set contains 2,000 records recording a GOOD/BAD outcome in the dependent variable called TARGET. TARGET=1 represents GOOD and TARGET=2 represents BAD. To run this model we selected all the available variables on the data set as potential predictors and allowed CART to select a 50% randomly selected test sample. See below for more information on the model setup if you need instruction in this area as well.
This looks to be a near ideal CART model with the test sample error rate falling rapidly (in the curve in the lower panel) to a plateau and then rising quickly as the tree is allowed to grow to be too large (overfit). The area under the ROC curve for the test sample is a respectable 0.7903 (see the lower left area of the navigator diagram). But what you might find troubling is the size of the smallest node in the training data: just 2 records. (See the section titled "Model Statistics" on the right hand side of the navigator display below. Two records in a terminal node might be too small for comfort; every terminal node is responsible for a prediction for any data falling into it and we might not want to put too much faith in a prediction generated by a two–record node.
Another way to view the tree is to click on the small button on the left edge of the navigator, the lower pane, that contains a curve with green dots. This control shifts the lower panel through three different displays. Click it twice to get the display below. (Look for the red arrow below.)
This bar chart above shows us the relative sample sizes in each terminal node. You can elect to view either the train or test sample distribution. In this example, both train and test distributions are virtually identical. In both cases we have a preponderance of small sample nodes.
What to do about this? Of course the born and bred statistician might suggest that you go out and get more data, but this is usually not an option and might not even eliminate the problem. The next natural thing to is to try pruning the tree using the arrow keys. We tried this and came up with the following pruned tree:
What we see here is rather typical: we pruned the tree back a fair way but did not succeed in eliminating all the really small terminal nodes. Above, even after pruning back from 36 to 17 terminal nodes, we still have a node with just three records in it. It often happens that as you prune back a large tree a few stubborn branches with very small terminal nodes remain. Looking at the node distribution shows that things are better but not really what we wanted.
So we need a more direct way to limit the terminal size. The way we do this in CART is to prevent the creation of such small sample nodes in the first place.
The control we need is on the MODEL SETUP dialog–s ADVANCED tab. There we have the option to disallow terminal nodes smaller than any value we care to specify. The default minimum for a terminal node is one record!
So try now to change the value of "Terminal node minimum cases" to five as shown here:
Then press the "Start" button to regrow the tree to get:
Observe that the smallest terminal node now has seven cases in it. Although the overall relative error rate (a measure of misclassification rate) is slightly higher, the test sample area under the ROC curve is also slightly higher. So on balance we have essentially the same performance in the new tree but with more agreeable sample sizes in the terminal nodes (see below).
So how should we set this control? Before answering that question we need to clarify a few more concepts.
What exactly do we mean by "minimum size of terminal node?" In the command language we refer to this as MINCHILD and the control is written as
This means that as we grow the CART tree by splitting parent nodes into two children, the sample size in the smallest child is not allowed to fall below five. Suppose we start with a training data set of 100 records. Technically we might discover a variable and split point that places four records on one side of the tree and 96 records on the other. Even if this split is highly accurate it will not be allowed because of the LIMIT MINCHILD command. The control operates by allowing some splits while disallowing others.
However, this is not the only sample size–related control we have in CART. The other closely–related control is ATOM, which operates by determining whether a node is allowed to become a parent. Below we see how to set this in the GUI as "Parent node minimum cases."
To better understand the two controls listed consider the diagram below, which consists of a parent node (green) and two child nodes (red).
The "parent minimum" control determines whether the green node above is even allowed to be split at all. If this node is too small to be split the tree will just stop along this branch. Suppose that the green node is large enough for splitting. Splitting means breaking into smaller pieces and the second control on "terminal nodes" limits how smal the smaller of the two cchildren can become.
In our example, we set the parent minimum size to 10 meaning that any node with fewer than 10 cases is not eligible to be split. We could have set this value to a higher number, such as 15 or 20. This control effectively stops the growth of the tree along a branch once that minimum size has been reached. As long as a node is splittable (i.e., is eligible to be a parent), however, it is allowed to generate a child with as few as five records in it.
Observe that if ATOM=10 and MINCHILD=5 then a node with 10 records must be split evenly (5 records to the left and 5 to the right) or not split at all. For this reason, we recommend that the value for ATOM (minimum parent size) be set to at least three times the MINCHILD value (minimum terminal node size) to give CART a little more flexibility in how it can split a node.
Now we arrive at the question of optimal settings for this pair of controls. Here we recommend that you try the BATTERY commands (if your version of CART includes it). On the Model Setup Dialog select BATTERY and then add the ATOM and MINCHILD options to the batteries (experiments) you wish to run.
The BATTERY automatically varies the values of ATOM and MINCHILD for you. You can just go with the defaults or edit the Values field to specify the values you prefer. In this example, we just go with the defaults to get this summary:
To make the results easier to read and absorb click on the Rel. Error box to sort the results and get:
The table shows that several different combinations appear to yield the lowest error rate and the same size 40–node tree, but that the combination of ATOM=20 and MINCHILD=10 yields an only slightly worse tree with 21 terminal nodes. Further inspection of the table might yield other appealing combinations.
Unless you start imposing rather large MINCHILD values, these controls should have no effect on the upper parts (generally more important) and thus only shape how the tree arrives at its final refinements. This suggests that ATOM and MINCHILD have more aesthetic than substantive impact on a tree, which means you need not be too concerned about using them to polish your results.
Appendix: Setting Up the Model
After opening a dataset and opting to set up a model, indicate your target or dependent variable. This is enough to get started! We often override the default testing method by selecting the "Testing" tab of the Model Setup dialog as well. This note focused on options found on the "Advanced" tab.