Using Surrogates to Improve Datasets with Missing Values
Learn: What is a surrogate? What makes a good surrogate? Why are surrogates important?
Greetings and welcome to another in the series of Salford systems online training videos. This is Dan Steinberg and today's session is focused on the surrogates that we find in the CART decision tree. We will be covering a number of topics including: what is a surrogate? What makes a good surrogate? Why are surrogates important? We will also discuss, but in a subsequent video, surrogates in deployment and surrogates and variable importance. We're expecting that you've previously seen some other of the videos concerning basics of CART and also CART splitters.
What are surrogates?
Okay, so what exactly are surrogate splits? Well, the first thing to be aware of is that surrogate splits are a novelty, they're an innovation, they're a technological advance, and they first were described in the CART classic Monograph (Classification and Regression Trees) which appeared in 1984. Prior to this, the concept of surrogates — to the best of our knowledge, never existed in any form in the scientific literature. A surrogate is a mimic or a substitute for the primary splitter of a node. We've already seen how a primary splitter is made, and we can't start the process of even thinking about surrogates until we have a primary splitter in hand. Once we have that splitter in hand, then we can go looking for surrogates. And what is a surrogate? The ideal surrogate splits the data in exactly the same way as the primary split, in other words, we are looking for clones, close approximations, something else in the data that can do the same work that the primary splitter accomplished.
Why are surrogates important?
Why do we want surrogates? Surrogates have two primary functions: first, to split the data when the primary splitter is missing. Now, the primary splitter may never have been missing in the training data. However, when it comes time to make predictions on future data, we have no idea whether that particular splitter will always be available. When it is missing, then the surrogates will be able to take over and take on the work that the primary splitter accomplished during the initial building of the tree. In addition, surrogates reveal common patterns among predictors and the data set. We'll say more about that in the next few minutes.
Now, CART searches for surrogate splitters in every node in the tree, as we said. First, the tree is split using the methodology in order to find the primary splitter. Then, after the primary splitter is in hand, surrogates are searched for, even when there is no missing data and this puts CART in the position of dealing with missings in the future. There is no guarantee that useful surrogates can be found. Surrogates have to meet certain conditions, certain mathematical performance standards, if we can't find a surrogate, then we're going to obtain a report which indicates that no surrogate was available for that particular node. In general, CART will find surrogates and CART attempts to find at least five surrogates for every node, but first of all, there's no guarantee that you'll find any, let alone five, and also you can make changes in the control setup in order to request a larger or smaller number.
Example: Surrogate Splits in CART
Let's go back to one of our examples from a previous session. Let's open the EuroTelcoMini.xls. Go to the modeling window and set up the model as before. So, we have response as our dependent variable.
We want all variables in the model, except for the record ID, and we want to be sure that the two categorical variables: city and marital status — have been indicated as categorical. We don't need to indicate that response is categorical, because CART is assuming that, but it doesn't hurt to click that one as well. Other than that, we're going to go with the defaults as before. Hit the start button. Along comes our CART tree. We've actually grown 11 trees in total, in order to do the cross validation. Let's just run back to a modest size tree. Here's the tree that we looked at before, and let's pay attention to this right child of the root node. So that particular node has a primary splitter, which is the telephone bill at $50 a month, and if you recall from the previous session, in which we talked about this particular tree, we're trying to predict whether a household will accept the offer from the landline telephone company of a new mobile phone offer.
We have 15.2% of the population that said, 'Yes' overall. We go to a group that has been shown a relatively high price for the handset. But then we go to a division of that particular node into two groups: those that have telephone bills larger than $50 and those that have telephone bills less than $50.
We see there's quite a bit of difference in response, and so the 16% that we see here, in this node, even though it is not very much bigger than the 15%, it is quite large relative to the 4.2% on the other side of the split. Okay, so this was just to remind ourselves of what this particular tree is all about. Let's go to this node over here and double–click on it. We can double–click on any, or single click on any node in the navigator and we will see this particular report. We looked at the left–hand side of this report in the previous session, in which we talked about competitors. Here we're looking at the right–hand side and we're talking about the surrogates. So, what do we see?
The primary splitter is a telephone bill being less than or more than $50 a month. So, what can we think about this particular splitter? Obviously, what we're looking at is the level of spend that a household has on a particular service — in this case, telecommunications. What if we don't know what the telephone bill is for a particular household and we want to make a prediction as to whether they would accept the offer of a new mobile phone or not? In order to move that individual down the tree, we need to know whether to let them go to the left— which was to a non-response node, or to the right— which was a modest response node, and that makes quite a big difference regarding how we're going to treat this individual. And here we are, not knowing what their telephone bill is.
Why might we not know? Well, perhaps it's a brand new customer — we don't have any experience with them. Perhaps it's an error in the database. Perhaps the person moved and because of the move, there was a disconnect between previous records and the current record. So what does CART tell us? How to handle this particular problem? What are we going to do about this particular record? So, CART says, well, there are actually five surrogates discovered and they're ranked in order from most useful to least useful. And the top surrogate is a marital status and the split is at one. Don't forget, marital was categorical, so we have a standard split in which we say, if marital status equals one, which means unmarried, meaning never married, then we go to the left, and otherwise we go to the right. So this particular split is intending to mimic whether a person spends less than $50 or more than $50. Could marital status reasonably be a substitute for knowledge about the telephone bill? Well, you can make a case for that. Individuals who have never been married, especially, we're talking a couple of decades ago in Europe, probably have less income. They are connected to only one family which is their presumably, birthparents — and they have a modest use of the telephone service. People who are married, on the other hand, make use of telephone services much more perhaps, in part, because a married couple needs to stay in touch with at least two families. That may still be true for families that are divorced. In any case, whatever the reasons, marital status has been selected as the best possible stand–in after CART has examined all the other variables that were available for us. So this is a distinctive variable, although it's not the only one. If marital status were also missing, actually, skip down here to row three. The travel time that the person must undertake in order to get to work that is also given a standard split, meaning low travel times to work are associated with lower expenditure on telephone bills. Higher travel times with more telephone expenditures. Perhaps, again, people who live close can spend more time talking to their friends in person. People who live far away are going to have to communicate more by phone than in person, and so forth, as we go down this particular list.
Let–s go back to our slides here and we have a commentary here as to what is going on, and you're free to read this slide which repeats what we've just been talking about. Now, surrogates have an important characteristic that are associated with each one of them, and that is a direction. A surrogate is intended to be a good substitute for the primary splitter in making similar left–right decisions for data. But some surrogates may work in the opposite direction than the primary splitter. This is similar to having a negative correlation with a variable. A negative correlation is useful, we just have to keep in mind, that the direction of one variable is opposite the direction of the other. So when the surrogate has a direction, a sense which is the same as the primary, then we put a letter S after it, which represents 'standard' and when the surrogate splits in the opposite direction, meaning it reverses the right and the left directions, then the letter R occurs after the split. Categorical splitters are always organized in a way so that the cases that go to the left are selected, that is, the levels of the categorical variable are selected to match that of the primary splitter. The beauty of surrogates is that they mostly make sense.
Think about the example we just went through. Our primary splitter is the average monthly spend of a household on a fixed line telephone account. The surrogates include marital status, commute time to work, age, and the city of residence. Now, we don't know how the cities work, because that information has been suppressed from the data – but the rest of the story is very plausible. In general, surrogates help us understand what the primary splitter is all about. In this particular example, if you think about a high telephone bill being associated with a greater propensity to accept the new telephone technology, we have two possible ways of interpreting this. One of them is that people who spend more money on telephone bills spend more money on everything related to telephones, and therefore, when a new product comes along, they'll spend more money there also. It's more of an income explanation. But alternatively, we might have individuals who don't necessarily have such high incomes, but are willing to sacrifice other goods and other expenses in order to permit themselves to do more with telephony, and that seems to be the case in this particular example because the surrogates that we're finding have more to do with the need to communicate and opportunities to communicate than they do with things that relate to income.
How do we compute surrogates?
Well, fortunately we don't have to. CART looks after all the details for us. This is actually a technical question which involves quite a number of fine details, and so it's a question that we're not going to cover here. If you're interested in pushing through the mathematics, the CART Monograph does contain a wealth of technical information although it is a challenging read. However, we do want to talk about the main ideas. First of all, the top surrogate, and in fact all surrogates, are always constructed from one variable. Now, this may sound like not news to you, however, a primary splitter, if you requested, could involve more than one variable. We can have linear combinations of variables. But a surrogate, by definition, is always made up of a single variable. Furthermore, this variable is used to create a two–way split in the data, so the surrogate could be a categorical variable or it could be continuous variable, but we will be using that variable to split the data into two parts and the purpose of the surrogate is to mimic, as closely as possible, how the primary splitter divided the data. So in other words, our goal here is not directly to try to predict the dependent variable. Our goal here is to try to predict whether a record should go left or should go right, and is based on the historical data in the database, in which we look at which records did go left and which records did go right and we try to find another database rule that captures that information as well as possible.
What is association? Association is a measure of the strength of the surrogate, and you will see an association reported for every surrogate that CART produces. The lowest possible recorded score is zero, and the highest possible score is one. One corresponds to a perfect clone, not of the whole variable, but its ability to split left and right, meaning that on the training data, the surrogate will send exactly the same cases to the left and to the right as did the primary splitter. It doesn't mean the two variables are perfectly correlated, but normally they will be very highly correlated. But they are perfect with respect to this one decision. In order to come up with a measure of how good a surrogate is, CART goes through these stages: first, CART constructs the default rule. Suppose we have no information. We're sitting in a node, somewhere inside a CART tree, and that node gives rise to two children. We need to know whether a record that has arrived at that node is going to go to the left or going to go to the right- and we have no information to guide us. What should we do? CART's decision- and I'm simplifying a little bit here, is that if you don't know which way to go, then look at the historical record of the training data and follow the majority. If more than 50% of the cases went to the left, then, if you don't know what to do, go to the left, also. Otherwise, if the majority went to the right, then, whenever we encounter a case in which we are lacking information, we go to the right. Now, clearly this default rule is very crude and we can measure how often we will make a mistake. Because, if 60% of the cases go to the left, and then when we have a true problem case we always send these records to the left, then we're likely to be right 60% of the time and we're likely to be wrong 40% of the time. So, it's not a particularly good surrogate. It would be a much better surrogate, of course, if 90% of the data went to the left and 10% went to the right, then, absent any other information, we'll probably be right if we assign a record to the left when we have no further information. So you can think of the default rule as the surrogate of last resort. If we really have no other way of making a decision, we're going to use the majority rule, and further, if there is an exact 50% going to the left and to the right, then CART will resolve that ambiguity by moving to the left. A surrogate is a variable that by definition does better than the default rule. If nothing can do better than the default rule, then we're going to have no surrogates and we measure the degree to which the surrogate does better than the default rule by the percent reduction in the number of mistakes that are made. So, if we had a split that was 60/40, and therefore we send everything to the left, we're going to make mistakes 40% of the time. That 40% of the time, when we should've gone right, but of course we didn't know how to make that distinction. If a surrogate allows us to reduce that error rate from 40% to 30%, then we're going to see that as a 25% reduction in the error rate, and, therefore, we're going to get an association measure of 25%. So that's what we're seeing over here in these slides. If we look at the variable C2, which is the top surrogate, this one is improving over the default rule by 37%. CP2, the next variable, reduces that error by 30%. Both of those are, therefore, pretty good surrogates. Then we get a couple of 17%s and then finally a 12%. What should we think of as a boundary that would define a good surrogate? In our experience, we certainly feel that anything that reaches 20% is looking like a good surrogate. Numbers on the order of 1-2-3-4-5% are clearly pretty weak and they're not adding very much to the process of handling missing values. We actually have some control over the number of surrogates that we will see in the report and also the number of competitors and to set those controls, just go to the Model Setup dialog and then visit the Best Tree tab. There you'll see a section called Surrogates and Competitors, Number of surrogates to use for constructing tree, and I've set that now to 10. The Number of competitors to track, I've also set that to 10. Remember, we actually compute as many competitors as we have variables that we're using, or that we enter into the keep list as we call it, or that we allow CART to use. We do track them all, but we don't report all the details for the simple reason that, mostly, we are not going to be that interested in the performance of all but the top few. However, you have control over that. When it comes to surrogates, it's slightly different in that even though we have to test every variable in order to find out if it's a potential surrogate, we do discard that information, and don't use it in the future scoring use that we might make of that tree. The default is to limit the number of surrogates that we keep track of and make use of, to five- and I've set that to ten here as well.
Let's go now and look at the live software and see what happened there. And here I've already executed this particular run and we just saw the screenshot of it, but you can see over here there are only ten variables altogether being made use of. So the primary variable is up here, that's the splitter, the telephone bill, and there are another nine competitors. So all of them are listed here, regardless of how poor their performance might be. You can see over here in this particular node, reading this number as 913, which is the score that comes for the best splitter. What we see over here is that the lowest scoring competitor has a score of 30, so that's dramatically less than the winner, and this thing does fall off reasonably, rapidly. When it comes to the surrogates, if you look at this pane over here, you can see that we actually were successful in finding eight surrogates. Now, you have the primary splitter, so that's one of ten variables. That means there's nine left. There could have been nine surrogates. In fact, we were only able to find eight, but you can see the Association score at the bottom is in fact going to be slightly greater than zero- but we're only showing zero over here. So, in this particular node, which is actually on the different side of the tree, we're seeing a rather different pattern when it comes to the surrogates for this particular split, and it's quite interesting that we are looking actually, at this side of the tree, rather than the other side which we examined before. And, whereas, on the right-hand side of the tree, it turned out that the marital status was a very useful surrogate. On the other side of the tree, and for the same splitter, it's different variables that to turn out to be the key. What we also see is that there are more variables being reported, as we would have expected. Look at the Improvements scores, and what you notice is that although the primary splitter has a score of 913, the best surrogate has an improvement score of only 178, so the best surrogate actually performs quite a bit worse than the variable that it's standing in for. Even so, the Association score of .25 says that using the surrogate reduces the error in comparison to using the default rule of sending all records to the larger node; larger child node. Even though that reduction in error is fairly good, but the overall performance of this variable is only so-so.
What are the controls?
For those of you that would prefer to control this by the command line, or who like to keep scripts which you constantly reuse, here is the syntax for controlling the competitors and surrogates. The main command is the BOPTIONS command, which controls a large number of options, COMPETITORS equals, and so in the example I just ran, I set that from the default of 5 to 10. CPRINT says, also, report on 10 competitors, in the classic output. SURROGATES equals 10, which says compute and keep track of up to 10 surrogates, and the PRINT equals 10, says, and also, print that many. These two numbers do not have to be the same. The print number is related just to reporting — it's the other number that refers to the actual computational side effects.
A very brief summary on this rather extensive discussion of this topic: the principal use of surrogates is the stand–in for the main splitter when a record of missing data is encountered. If we have more than one surrogate, they are rank ordered and they are processed, in order, as follows: if the primary splitter is missing, we try to make use of the number one surrogate. If the number one surrogate is missing, then we try to use the number two surrogate, and so forth. The secondary use of the surrogate is interpretation of the primary splitter, as we discussed before. A variable may appear as both a competitor and a surrogate for a given node, and this sometimes causes some confusion. We have to keep in mind, that the role the variable is playing is different in the two different sides of the report. When the variable is acting as a competitor, then it pays no attention to whatever the primary splitter did. But, when we are trying to use the variable as a surrogate, it is not free to do whatever it wants, but we need to try to make it as similar as possible to the primary splitter. Usually, if a variable appears as both a competitor and a surrogate, the split point for the competitor will be different than for the surrogate. Surrogate quality is measured by the Association score and the default rule, go with the majority, is a surrogate of last resort- meaning that we have no surrogates because none were found, or, if we have no surrogates, because for this particular record all the surrogates that we found happened to be missing, then we go with the majority, where majority "is a concept that is a weighted majority adjusted by priors" and we'll discuss elsewhere.
At this point, if you feel like you understand enough about competitors and surrogates, great. If you'd like to hear a little bit more about this topic, then we have a few more observations to make. So, let's grow another tree on this GB2000.xls-Zipped data set. We want to work with this data set because it has no missing values, which makes working through the examples much easier. Don't forget that CART always computes surrogates for the CART tree whether they're missing values or not. What CART is doing is, among other things, preparing for the future where future data may contain records of missing values. We're not going to try to make sense of this tree, and we'll look just at the mechanics.
Let's first look at the root node splitter and the top surrogate. So, here we are. We see that we have a root node splitter on the variable M1, and the split point is at .04645, and the top surrogate is listed as C2 which is at -.10835. We analyzed this data set once before, the dependent variable was TARGET. We left all the other variables available as predictors, and we just started a CART tree using the defaults, which meant that we also used tenfold cross validation. This was what the tree looks like when you're done and we simply pruned back to the root node, and then went to tree details in order to show the contents of the slide before. Okay, so what further insights can we get by digging deeper into this? Let's look at how the dependent variable is distributed between the left and the right child when we split with the main splitter, and when we split with a surrogate. So, you can see we have a total of 2000 records altogether. When we use the main splitter, which is the M1 split, we get 672 class ones and 252 class twos on the left and then we get a somewhat similar split with the preponderance of class twos on the right, and the two sides are not exactly equal, but not that different from equal, either. When we split with a surrogate, we get a couple more records on the left and a couple fewer records on the right, and we are a little bit richer in class II on the left, and we're a bit poorer in class II on the right. What that means, of course, is that the separation that we get among the class twos, between the two nodes, is less here than it is here. So, not surprisingly, the surrogate isn't as good as the primary splitter. It never could be unless it was an exact clone of the splitter, but no surprises here. The best is the best and the surrogate is typically not quite as good as the best.
Now, we're going to go ahead and try a little trickery here. What were going to do is to create a new variable called ROOTSPLIT. So, how are we going to get this ROOTSPLIT variable? Well, let's go have a look here at the work we've already done and notice, I'll go ahead and make this larger so that we can see it better. Go to edit fonts and let's make this a 16 font display here, and so now it's a little easier to see what we did. Let's move this a bit more to the center and resize this.
This is a CART notepad and what I did is, I use the CART built-in BASIC programming language in order to create two variables. Notice here, LET ROOTSPLIT, that's a variable name that I constructed, and how is it defined? It's defined according to the actual split value that CART found, so that is, if M1 is less than this value, then this root split value takes on the value one and otherwise the value is zero. And I created another variable called SURROGATE and that is defined exactly as CART defined the surrogate. So, now I have two variables, something called ROOTSPLIT and then something called SURROGATE, and if I want to, I can run a model of one on the other. So, here is my variable ROOTSPLIT, here is my variable SURROGATE, and I can now run a CART tree there. For those of you, let's hit continue over here. We may have missed one step here in order to execute these commands. What we need to do is to submit them to the command processor, so we need to go to FILE and SUBMIT WINDOW. I've already done this, so I won't do it again. As soon as you submit these commands, then these two variables, which don't exist on the original data set, automatically get added to the data set and that's why they became available to us on the model set up dialog. So, what I did here was get ready to run a CART model on ROOTSPLIT as the dependent variable, SURROGATE is the splitter, but the other thing that I need to do is to take the original target, indicate that it's an auxiliary variable, and also indicate that it's a categorical variable — and when I do that, I get a new tree.
It's not surprising that the tree only has one split, after all, we have a 01 dependent variable and we have a single 01 variable. Once we use that variable, then we're done, there's no more work that variable can do. And if we go to the tree details, we can see here how the tree has been split, but this isn't that interesting to us, because this is giving us information about the primary splitter. Notice the 926 and 1074, that's an exact match for what we previously saw in the regular CART output. What we need to look at is not this, but what the original dependent variable is doing when we make the split. And that's the auxiliary variable report- it's a little bit big, so we want to compress it. And you can see here the 626 and 300, that's what we reported in our little table. That's for the left&rsquo:hand child, and then, if we go back here and do the same, auxiliary variables. Again, make this thing a little smaller and we see that 374 and 700, and that is the other half of that display in the PowerPoint slide that we showed.
So, what you can see here, is that if we make the split using the information that CART has given us, and then we also take the trouble of tracking the dependent variable- we'll see what that surrogate split actually accomplishes. This sounds a little bit complicated, and in fact it is, but if you spend a little time studying the notes that we have here and reviewing this particular video, we think it'll eventually make sense to most of you.
Okay, so one last thing, and we did say that this section over here that we're covering now is a little advanced. Finally, if we were to go to the modeling dialog again, and instead of using the surrogate that CART gave us, we let CART choose whatever split point it wants using C-2 with the dependent variable being the root split, and again carried along the target variable. So, there we go with a potentially larger tree, for the simple reason that now we have a continuous variable that can split more than once- but the best result occurs after just one split. Let's have a look and see what happens here if we double–click on this, notice that the split is at -.372. The reason that's news is because the surrogate that CART reported to us was on the variable C-2, but at a different split point. So what's going on here? When we try to find the best split of the root split using the surrogate, we don't get the same split point that CART found. Well, the reason for that is because CART doesn't run this particular secondary model in order to discover where the split point is. What CART also does, is it takes into account what is going on with the dependent variable and that is key here. So, if we look at the dependent variable distribution on the left-hand child, and again, do the same on the right. And we want to bring those two things side-by-side. Notice what the counts are here: 288, 598, 712, 402, and we've got that in a PowerPoint slide. Notice that my table now contains three panels: the primary splitter information, the surrogate information, and then this artificial model which we thought might be a good way to discover what the surrogate was and notice what we see. If you look at the breakdown, you can see that the first panel, the main splitter and the surrogate, actually have counts that are not that different from each other. They looked reasonably similar. But if you look at the model that came back from this artificial cart run, you can see that the deviations between the main splitter and this handmade attempt to find the surrogate are actually, the differences are larger. So, on the balance average concept, the handmade surrogate does rather poorly in comparison. We see a 443- when we're looking to get 462, and we see a 557- when we're looking to get 538. So, the handmade model doesn't work as well and the reason for that is, that it is not able to take into account the distribution of the dependent variable, the real dependent variable, and that's what's going on in the computation of the surrogates- behind the scenes. In any case, this concludes our discussion of surrogates and we look forward to seeing you next time.