Controlling Model Complexity with “Added Variable Penalties”
To clarify our point here: when grow a large CART tree using a huge number of potential predictors we develop an initially complex model. Subsequently, we prune the tree making it simpler, sometimes radically simpler. But practical experience has demonstrated time and again that if you can intelligently reduce the number of potential predictors that are fed into CART before you start you will probably obtain a better performing model after the pruning is complete. In other words, if the model building process is simplified then the pruning process will typically give even better results on new unseen data. This applies to our other core data mining engines as (MARS, TreeNet, RandomForests): starting with a well selected set of predictors will give better results. The pre-selection of predictors helps in part by ensuring greater model simplicity during the model construction phase. We have discussed the specific topic of “variable shaving” to accomplish this automatically in SPM (the Salford Predictive Modeling suite) in other posts and videos. Here we are focused on another technique: “Added Variable Penalties”.
To the best of our knowledge this idea was first introduced by CART co-author and algorithm author Jerome H. Friedman when he introduced MARS in 1991. If you are not familiar with MARS don’t worry, you can still follow along with us here. MARS is an automated regression machine that introduces new variables into the model step by step, simultaneously breaking predictors into regions to allow region specific slopes. Once a variable is in the model it can serve as the basis of an interaction with other variables also already in the model or with candidate predictors not yet in the model. In his testing of MARS Friedman observed that when a collection of highly correlated variables were available as predictors the MARS model construction process might include several of them in the model, either as main effects or interactions, resulting in a fairly complex and not easy to understand model. The MARS second stage model simplification process, which removes unneeded terms from the model, might not succeed in eliminating this complexity as much as desired. To address this issue Friedman introduced a user controllable penalty to restrain the introduction of new variables into the model. During the forward stepping model growing process MARS ranks all candidate ways to expand the model based on some training sample performance gains. If any such expansions involve the introduction of a variable previously not in the model they are automatically penalized (for example, by 20%) making it less likely that the model be expanded by the introduction of a new predictor. Although this control has been in MARS since its earliest versions it has rarely been used or discussed in the literature.
Years later Friedman revived the idea again for TreeNet in the context of interaction detection. Since trees with more than two terminal nodes are natural interaction detectors (and generators) Friedman asked whether a perfectly flexible TreeNet model might end up generating more interactions than are absolutely necessary to capture the true data generating process. Since we cannot guarantee that every interaction in every tree is “real” we might restrain interaction discovery by using a penalty such as the one introduced for MARS. Thus, in TreeNet as well we have essentially the same “Added Variable Penalty” but this time applied within a given branch of a tree. Suppose we are trying to predict target variable Y and we develop a tree with predictor X1 as the root node splitter. With the added variable penalty we will now make it more likely that X1 is used again immediately as the splitter of one or both of the child nodes produced by the root. But suppose that even with the penalty in place X2 splits the right child. Now when we look for a splitter for each of the two children of this second split we will give preference to using either X1 or X2 again since these variable are not new.
Here is an example from a CART tree using our GB2000.xls-Zipped data set.
Observe that on the left hand side of the tree we have three splits each of which involves a different variable. With an added variable penalty we could induce instead
Observe that the third level split is at a different location and that it reuses the M1 splitter, thus avoiding the introduction of a new variable. Also note that CART has many controls over splitters, including forced splits and constraints, but does not directly offer the added variable penalty. This specific penalty only appears in MARS and TreeNet and we manipulated the CART tree to simulate the penalty just as an illustration.
What is the effect of using such a penalty? In TreeNet each tree grown will tend to be constructed from a smaller set of variables. If the penalty is set high enough trees could be pushed back to being constructed from just one or two variables (mostly). In MARS, the forward stepping will also tend to involve fewer variables. In both cases, the initial model construction will result in simpler models at least in the sense that fewer variables play a role in the models. In both MARS and TreeNet we follow the initial model construction with a second stage model simplification that could yield even further simplification.
Now we come to the final question: does the added variable penalty help us obtain better models? We have to admit that our early experiments with both MARS and TreeNet were disappointing. In both cases we have found that the backwards stepwise predictor elimination built into BATTERY SHAVING is more reliable for producing robust simplified models. Further, when it comes to interaction restraints for TreeNet we have come to prefer our Interaction Control Language (ICL) as it gives absolute control over exactly which interactions are allowed. However, we also want to point out that our experiments are far from definitive and we hope to research the issue in detail in the near future.