What is the Variable Importance Measure?
When fielding support questions over the years, I am often asked about CART's variable importance measure. Questions like: "What is the definition of Variable Importance?" Or maybe, "Why is a variable shown as important, but is never a splitter?"
Given that one of the goals of CART is to develop a simple tree structure for predicting data, relatively few variables may appear explicitly as splitters, which might be interpreted to mean that the other variables are not important in understanding or predicting the dependent variable. However, unlike a linear regression model, a variable in CART can be considered highly important even if it never appears as a node splitter. Because CART keeps track of surrogate splits in the tree-growing process, the contribution a variable can make in prediction is not determined only by primary splits. (The primary splitter is the variable you see exhibited in the tree structure. Behind the scenes, however, whenever that variable is missing, the surrogate splitters will be used instead to move a record down the tree to its appropriate terminal node.)
One way to think about this is to consider pairs of variables that contain similar information, such as father's and mother's education. Although only one of these variables can appear in a particular primary split, because one will perform better than the other in a given context, to rank one of these variables as important and the other as unimportant would be a mistake. Suppose, for example, FED (father's education) was chosen as the primary splitter and MED (mother's education) turned out to be the best surrogate. Just how close these two variables are in predictive power would become evident if we either deleted the primary splitter FED or set all its values in the data set to missing when applying the tree to new data. In these circumstances, the surrogate variable could end up doing all the work of the primary splitter, and the predictive accuracy of the tree might not be any worse if MED had to be used everywhere instead of FED. The phenomenon of one variable obscuring the significance of another, known as masking, is addressed in CART's variable importance measure.
To calculate a variable importance score, CART looks at the improvement measure attributable to each variable in its role as a either a primary or a surrogate splitter. The values of ALL these improvements are summed over each node and totaled, and are then scaled relative to the best performing variable. The variable with the highest sum of improvements is scored 100, and all other variables will have lower scores ranging downwards toward zero. A variable can obtain an importance score of zero in CART only if it never appears as either a primary or a surrogate splitter. Because such a variable plays no role anywhere in the tree, eliminating it from the data set should make no difference to the results. (Some rare circumstances occur in which this rule of thumb is violated, but these are not discussed here.)
The importance score measures a variable's ability to perform in a specific tree of a specific size either as a primary splitter or as a surrogate splitter. It says nothing, however, about the value of the variable in the construction of other trees. For example, a variable that is very important in a 20-node tree might not be important at all in a two-node tree because it plays no role in the splitting of the root node (which is the only split in a two-node tree). As a tree is allowed to become bigger, variables have more opportunities to play a role in the tree and thus to receive non-zero importance scores. The relative importance rankings of variables can change dramatically as you compare trees of substantially different sizes. Thus, you should not take importance scores to indicate an absolute information value of a variable; the rankings are strictly relative to a given tree structure.
The scores reflect the contribution each variable makes in classifying or predicting the target variable, with the contribution stemming from both the variable's role as a primary splitter and its role as a surrogate to any of the primary splitters. In our example ANYRAQT, the variable used to split the root node, is ranked as most important. PERSTRN received a zero score, indicating that this variable played no role in the analysis, either as a primary splitter or as a surrogate.
To see how the scores change if each variable's role as only a primary splitter is considered, click the Consider Only Primary Splitters check box; CART automatically recalculates the scores.
You can also discount surrogates by their association values if you check the Discount Surrogates check box and then select the By Association radio button. Alternatively, you can discount the improvement measure attributed to each variable in its role as a surrogate by clicking on the Geometric radio button and entering a value between 0 and 1. CART will use this value to geometrically decrease the weight of the contribution of surrogates in proportion to their surrogate ranking (first, second, third, etc.). Finally, you may click on the Use Only Top radio button and select the number of surrogates at each split that you want CART to consider in the calculation.