Variable Importance In CART®
Discover what variable importance is and how it will help you build accurate predictive models.
Variable Importance In CART
Welcome to another in our series of Salford Systems online training videos. This is Dan Steinberg, and today we're going to be talking about variable importance in CART. If you haven't already done so please download relevant software, the Salford Predictive Modeling Software from our website. There you will find other information related to data sets, and slides that give the information that we're talking about here. Okay, variable importance in CART is hard to imagine now, but in 1984 when the CART monograph was first published data analysts did not generally rank variables concept of variable importance. Variable importance is intended to measure how much work a variable does in a particular tree. In other words, variable importance is tied to a specific model. A variable might be most important in. Some did, but most sophisticated analysts did not. Informally however, researchers would pay attention to T statistics, or P values associated with the coefficients of regressions. But the idea of actually taking these as a literal ranking of variables was frowned upon. But since the advent of modern data analytic methods researchers expect to see a variable importance ranking for all models, and all this modern approaches really started with CART. So the CART one model and not important at all in a different, but rather similar model built on the same data. So it's important to understand that this is not a statement so much about a variable, as it is about the role of the variable in this model. The fact that a variable is important also, does not mean that we need it. If we were deprived of the use of an important variable, it might be that other available variables could substitute for it or do the same predictive work. Occasionally, they might even do a better job. So, variable importance describes the role of a variable in a specific tree when we're talking about CART.
What is the variable importance list?
Every tree in the CART sequence has its own variable importance list. Let's illustrate that now, we will open one of the data sets that we've been using in some of these tutorials. This is GB2000.xls-Zipped, nothing special about this data set except it's convenient to use. Go to the modeling dialog, select target as the dependent variable, everything else can be a predictor, and we'll just go with CART in the default mode. Again, we don't particularly care about the details of how the CART tree is built. Once the trees been found then we can go to the summary reports and variable importance.
What you can see here, is that we have a list of the variables ranked from most important down to least important. In general, if a variable has a literal zero for importance, then it won't be shown on this list. That's why we have this option here that shows zero importance variables. This will probably occur most often when you have a large number of variables, many of which aren't even used at all in the tree; and of course they have true zero scores for importance. Here we see a variable down at the bottom that is 0 to 2 decimal places, but it still has a nonzero score, which is why it is shown here. If we click on this, does anything happen? Yes, it looks like we get a couple extra variables down at the bottom, those were the true zeros. When we unclick then they disappear, the other two were literally not used at all in this particular tree.
Okay, so far so good, so we know that this variable M1 is most important. We want to understand how CART came to that conclusion. But first let's review the fact that we have a sequence of trees, each of which is bigger than its predecessor, and every one of these trees has its own variable importance list and ranking. So, if we go to this two node tree and we go to variable importance the list is very small. Why's that? Because the tree being so small, there has been really no opportunity for most of the variables to appear in the tree at all. As we get to bigger trees let's say for example, the optimal tree as we saw before, the variable importance list contains many more variables. So the first point to keep in mind is that, variable importance as reported is for this tree that is a tree of a specific size. If you change the size of the tree you may very well change the number of variables that appear, and even their ranking. So usually if we're going to be talking about a tree, we have a specific tree in mind and that without making any further commentary we usually focus on the tree CART as identified as optimal. But this should not deter you from selecting another tree, usually that'll be a smaller tree, a tree that is of particular interest to you for some reason.
Determining importance scores
So how do we get to importance scores? Well we need to start by looking at splitter improvement scores. Recall that every splitter at every surrogate has an associated improvement score, which measures how good a splitter is. Recalling if we double-click on any of the nodes in this tree, we get to see the improvement score for the primary splitter and also for competitors. We see a list of surrogates and also improvement scores for those surrogates.
Those improvement scores are going to be critical for computing what the variable importance's are. Why's that? The improvement score is a measure of how much work the variable is doing in that particular node. If it's a splitter it is a measure of how well it separates the two classes from each other, at least if we're talking about a two class problem. So we get improvement scores here and then again there, and there and there etc. Now one thing that's interesting is that we see that the M1 variable is a splitter in the root node, and notice that it's also a splitter a little further down. So a variable might appear in a tree more than one time.
And if we're going to create a measure of how much work that variable is doing, we're going to need to locate the node that it splits everywhere in the tree and not just the first time we encounter it. So the improvement score for a splitter and a node is always scaled down by the percent of the data that actually passed through the node. 100% of all the data passed through the root node, so the root node splitter is always scaled by 100%. Meaning whatever quality of split it makes, it gets 100% credit for that. But once we make a split in the root node, then approximately 1/2 of the data flows to each child. Suppose one of the children gets only 30% of the data, then whatever improvement score we would compute for the split of that node is going to be multiplied by .3. Meaning that, that variable in that node is only going to get 30% of the raw improvement score that we would compute for that node. Why's this important? Well the splits lower in the tree have progressively smaller and smaller fractions of the data passing through them. So the splits lower in the tree, even if they're very high quality splits get multiplied by a relatively small number. This means that, low down in the tree it's very difficult for a variable to have a very big contribution to its overall set of improvement scores.
Construct a variable importance score
So the variable importance computation is now as follows, to construct a variable importance score for a variable we start by locating every node that variables split. We add up all the improvement scores generated by that variable in those nodes, then we also go through every node this variable acted as a surrogate, and add up all those improvement scores as well. The grand total is the raw important score. We get a raw important score for every variable, including of course the possibility that variables that never appear in the tree, either as a splitter or as a surrogate, will get a score of zero. After obtaining the raw important scores for every variable, we rescale the results so that the best score is always 100. And that's what we see over here when we go to the summary reports. That top variable is always given this relative score of 100, and then everything else is scaled down proportionately. So the RES variable, whatever it got in terms of a raw score, that total was about 76% of the best variable, which was M1, and so forth. So you're probably thinking, we add up the improvement score that the variable gets when it's a primary splitter, and we add to that the surrogate scores. Well what about the competitors' scores? Why don't the competitors also get credit? Well of course the top competitor gets credit, that's the winner, that's the primary splitter. But the second best splitter in a node gets zero credit for being second best. Nevertheless, if that variable was a surrogate than it would get some credit.
What's the reasoning for this particular discrimination against non winner competitors? Well the creators of CART Breiman, Friedman, Oshen and Stone discuss this in their 1984 monograph. And they say that they originally experimented with the idea of giving competitors credit. But what they discovered was that when they did that, the credit that was given to competitors was generally overstated, and there was a process of double and triple and quadruple counting. Why is that? Let's suppose that a variable comes in as the second best splitter in a node, but it doesn't actually get to split that node and a different variable splits it. Then there's an excellent chance that, that same variable will be a strong competitor in the child nodes that follow. You can think of this variable as being a strong contender that keeps trying, and it might try in the child nodes and fail. And then it will try in the grandchild nodes, where it may or may not succeed. But if we keep giving it credit for every time it tries and fails we'll be giving it credit for the same split that it's trying to make, or that could happen. So realizing this, the CART authors decided that no credit should be given to a variable in its role as competitor, but only splitter or surrogate. And this gives us a proper measure of variable performance in the trees.
Examining the results in SPM
Let's go back to SPM now and have a look at some of the details that are available to us as we look at the results. So when you are reviewing the results in the GUI, and you have clicked on the summary reports, and then brought up the variable importance tab, you see here that we can consider only primary splitters. We'll notice what happens when we do that, many of the variables seem to have disappeared from the list, or their scores are much, much smaller, not surprisingly.
What's going on here? Well in a typical node, there is a primary splitter and there may be five surrogates, that means six variables are getting some credit in that node. As soon as I say consider only the primary splitters, only one variable is getting credit. So there is much less opportunity for some of the variables to appear on this list, or to have accumulated much of a total score. Look at this variable C2 over here, if we only look at the actual splitters of the nodes it gets a score of 12.8%, so it contributes about 13% as much as the principal variable M1. But if we take into account surrogates, then it's up at 66%. Well if we can either take into account the surrogates or not, then why should we do one or another? Well this all depends on your particular situation, and the pattern of data that you're likely to be working with in the future. If your data is never going to have any missings, you have a data collection methodology that is nearly perfect. Then perhaps you don't need to be thinking about surrogates at all. But if your data set has a common pattern of missing values, not necessarily predictable as to when or where, or even in which variable. But that you know that missing are going to occur, for reasons that are difficult to know in advance. Then the surrogates are playing a very important role, because the surrogate actually does real work, not just potential work. And that is something that you might want to be taking into account when you think about these models. So it's worthwhile looking at this list both ways.
Top predictors in a model
Let's also look at another interesting feature of these variable importance lists. Once you are on this particular display you can decide to highlight let's say the top performers, and then say new keep and build. That means use this last rather than the original list of predictors to build a model, and then actually go and build it.
The alternative new keep list simply says produce the list and make it ready for future work, but don't actually build the tree. Let's go ahead and click new keep and build, so what we've gotten here is a model that consists of nine predictors instead of the original 25. And let's look at its performance; on test data the larger model gets an ROC of 82.83 whereas, the smaller list of variables comes up with an 82.54, extremely close.
So we're losing a tiny, tiny bit of ability to rank order the data, but we have saved two thirds of the variables almost. So that may be worthwhile, keeping in mind as a reasonable exchange, a tiny bit of accuracy for a lot of simplicity. Also, notice here that we end up with an optimal model of 18 nodes whereas; when we have more variables the tree expands a little more to 23 again, another dimension of simplicity that may be desirable.
There's a couple other approaches to variable importance, the built in and default methodology of CART rates the importance of the variables that are actually used in the tree. They are not statements about the essential nature of the variables; they are not statements about their value apart from this particular tree, everything is relative to that tree. Another approach which is based on the concept of deletion statistics works as follows, we make use of something that we call, battery leave one variable out, and we'll use that to rank variables in terms of importance. Here is how it works, the lovo procedure tests how much our model deteriorates, if we were to remove a given variable. Now sometimes when you remove a variable, not only doesn't it not deteriorate it actually improves, so you have to allow for that as well. But the idea is to measure what happens if we leave a variable out, and it's sensible to say that a variable is very important if losing it damages the model substantially. Conversely and this is where things get interesting, and it's sensible to say that a variable is very important if losing it damages the model substantially. Conversely if losing a variable does no harm, then we can conclude that the variable is useless.
Now here's an example that might help to make this make a little bit more sense. Imagine you have a soccer team, and you got a player who appears to be very important in the sense that, that player is responsible for almost all the goals that the team makes. So we conclude just by looking at the basic scoring statistics that this player has throughout a particular season that, that particular players most important. Now that player is injured and can't play anymore for a good portion of the season and what we notice is that the team keeps winning, but now instead of one other player replacing the lost player, in terms of scoring, that there is a collection of players that together managed to make the goals that are necessary. So what's happened in this particular situation? We had a player that appeared to be most valuable and therefore, we ranked that player as most important. When we lost that variable that player, other players managed to accomplish the same outcomes that is winning games and scoring goals, but they did it in a different way. So you have two different perspectives here, from the first perspective when the player was available they appear to be most important. But because the team continues winning and maintains its same overall ranking lets say without that player, then by this definition we would conclude that player is not important at all. And that's exactly what's happening here with our variables, we are ranking the variables from two different perspectives; from the role that the variable plays when it's in the model, and the damage that happens to the model if that variable is removed. And those two ways of ranking the variables can be very different, and you should not be confused or alarmed by this particular discrepancy, it simply teaches you a lot about your data.
Let's see how to run the Lovo; you're going to have to have SPM Pro EX in order to do this. What we do is we go to the modeling dialogue, we go to the battery tab, and then we look for Lovo and add that, that is all we have to do.
Then we simply hit the start button and wait for the process to complete. Now there is one important thing to keep in mind, and that is that if you have 10 variables in your model then you're going to run 10 test models, each of which drops just one variable. So if you have 10 variables then they're going to be 10 of these lovo runs, if you have 100 variables there is going to be 100, so keep that in mind. Let's look over here and look at our particular results, and what you see here is a graph showing the performance of each of these variables.
This graph is based on removing the variables one at a time as they were encountered in the data, so there's no particular order here. So what we might want to do is first of all look at the ROC ranking, and then let's look at the sorted list here. And so we have a baseline model, and then we have these remaining models, and what we will see here is that if you remove M1 then you end up with the lowest ROC, which means M1 is hurting the most.
Then the variable that hurts second most is LS, then RES and so forth. And so here is the sorted list over here, and this is the ranking that comes from the Lovo procedure. It turns out in this case that the ranking of the variables via Lovo is pretty similar to the one that we got from this particular run over here. Let's go and look at that and find variable importance. And what you can see N1 is first, LS is second, RES is third, we agree there. Next Lovo reverses the order 0C and C2, but it does agree that they're both next in line, and then CR, little bit of disagreement about BU in that it gets a slightly higher ranking when you do it the Lovo way. Again, there is no way of concluding that one of these methods is better than the other, it's nice in this particular example that the results are so consistent that we can be very confident in the rankings that we've accomplished.
If you have gotten this far and you feel you have a good understanding of what variable importance is in CART then great. This may be a good time to stop playing with the software, and get a better feel for how these different measures compare and play out. And not worry about the advance topic that we're going to cover next. However, if you still have patience for yet one more idea than let's go ahead and cover that, it is the randomization test and was introduced by Leo Breiman. And Leo introduced this idea when he began his work on tree ensembles in a particular Random Forest. But the idea is a general one could just as easily be applied to CART. So here's how it works, I'm explaining what we do behind the scenes; you're not going to have to do any of this, just request that the randomization test be done. So what we do is we start with the test data and we score this data with the preferred model to obtain a baseline performance. So the model's already been run, we have a test partition and we score it in the natural way. Then what we do is we take the first predictor in the test data and we randomly shuffle its values within the column that it sits in. So what's interesting about this particular approach to randomization is that the values are unchanged, the exact same numbers that existed in the data set before we shuffled it continue to exist there afterwards. But what has happened is that the values of that variable have been relocated to rows they don't belong on. So by doing that naturally, we have messed up this data, we have damaged the data. And by putting the wrong data on a particular line of data in this table we ought to have damaged the performance of the model. So now having shuffled the data in this way we score again. In given that we have essentially messed up the data we would expect the performance to drop, because one of the predictors has been damaged. Now just doing this once is not anything that we can draw conclusions about. But suppose we do it 100 times and then we average the results, this is a pretty good idea of what would happen if somehow the data that was available in this test sample was not reliable. We do this for every variable and what we get is a performance degradation measure for every variable. The more a shuffled variable damages the performance of the model, the more important that variable is.
And so here is an example of what the output would look like using one of the recent versions of SPM, which may not be available to you as of the date that you are listening to this particular video. But this particular methodology is available in the non-GUI version of the software today, and you should be able to gain access to it early in 2012. So let's look at the results here, so the baseline ROC was .85320, and then what happened is after we shuffled M1 randomly that .85 fell to .821. That is not a terrible result, but what we have accomplished here by doing this with every variable is to see how big the damage is. And it turns out that we lose the most when we shuffle M1, we lose second most when we shuffle RES, we lose the third most when we randomly shuffle LS. Now this should look familiar, because this is the exact same ranking of these three top variables that we got from the other two methods that we used as well. If you are using a version of the software for which this is available a version 6.8 then at this point this particular procedure is available only in the non-GUI version. Once you have a tree saved and you're looking at a data set that you want to score, you ask for the variable importance scoring procedure so varying equals yes. You indicate the number of times that you want each variable to be shuffled randomly, and then reports like this one over here will be generated in the classic output. Okay this concludes our brief review of variable importance in CART, and we hope you've enjoyed the session and we look forward to having you join us on another one.