Using Battery PRIORS in CART®
Understand the value of PRIORS EQUAL and PRIORS DATA in common classification problems in CART.
Welcome to another in the series of Salford Systems' online training videos, this is Dan Steinberg. Please visit our website in order to download the appropriate software and any supporting data sets. Today, we're going to be talking about CART and the PRIORS parameter. If you are a casual user of CART, you probably can get by without knowing anything about PRIORS. The default settings of CART handle PRIORS in a way that is well–suited for almost all classification problems. The casual user will probably not want to review or understand the more technical output, which is printed to the plain text or classic output window- but there are some effective uses of CART that require judicious manipulation of the PRIOR settings, and therefore, a basic understanding of priors may be helpful and worth the effort. So, we would rate this particular session as intermediate, at least, and going perhaps a little bit beyond that when we get to the final slides.
The CART Monograph
The classic reference for what we're talking about is the original CART Monograph, published in 1984, and it remains one of the great classics of machine learning. It's called Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone. It's available in paperback, as an e–book from Amazon, I happen to have any copies on my iPad, and also on my MacBook. It makes it very handy and we certainly recommend that you obtain a copy. It is not the easiest reading if you treat is as a whole, but there are parts of the book which are very clear and accessible even to beginners and is really quite fascinating — especially the discussions regarding the decisions the authors made in crafting CART in the first place. They also have some very interesting comments on their whole approach to analyzing data, very worthwhile. Contains extensive discussions of PRIORS, as well as major concepts relevant to CART. Here is a picture of the cover of the book, along with the Library of Congress information. The slides of course, are available to you so you can study them in detail, if you'd like to have a look and investigate the possibility of getting your own copy, at least for reference purposes.
So, some observations for the casual user. We are now thinking about a binary 0/1 classification problem. These concepts apply to multiclass problems as well, when you have an ABC, or even larger number of classes, but let's keep things simple and just think about the most basic and most common classification problem, 0/1. So CART has two ways of evaluating a CART generated segment, meaning a node somewhere in the tree, including possibly the root node. So we can assign the segments to the majority class, that's a very straightforward concept. We're looking at 'yeses' and 'no's. Which wins? Well, the majority wins. That's one approach. The other approach is to consider things relative to a baseline and there we introduce the concept of lift. People sometimes think about relative risk instead. So we have a baseline event rate that's a fraction of the ones, usually we use one to refer to the rare event that's in the data and then we look at the ratio of the event rate in the node in question to the event rate in the whole sample. So the node in question is a segment, so we look at that segment and we say, 'do we have more of the events in that segment than we do in the population overall, or not?' Any segment with a better than baseline event rate is going to be labeled a '1,' or belonging to the rare class, and a '0' otherwise. CART uses this latter concept, the lift related concept, for making decisions. We call this, in the CART terminology, 'PRIORS EQUAL'. You can elect to use the first method, which is majority rule, instead, by the selection of PRIORS DATA.
So we're going to bring up this data set in a little bit, but for right now we'll just look at the split that was generated to discuss the main concepts. We have two classes, 0's and 1's, the 1's happen to be 'bads' and this data comes from a financial institution with individual consumers as customers, and they can either have a good or a bad outcome. In this particular data set the 'bads' are not particularly rare, but they're clearly well less than half. The 'bads' start off at 21.4% in the overall sample and CART goes ahead and finds the split, and after the split is made we get two children- the left child and the right child. The left child has a bad rate that's up at 41%, clearly larger than the 21.4% that we started with. You can see that CART is assigning that class to Class 1, the other class is being assigned to Class 0, where the lift concept's gone the opposite direction. So PRIORS EQUAL simply ensures that we think in these 'relative to what we started with' terms. So which should we use, PRIORS EQUAL, or PRIORS DATA, or something else? PRIORS EQUAL is almost always the right choice, we do not recommend that you start your experimentation with something else. Go with the default, see where it takes you. The default is almost always a good choice, and almost always yield useful results. It does not mean that it's best, but it does mean that it's most often sensible.
PRIORS DATA focuses on absolute majority and not relative counts of the data. PRIORS DATA will rarely work with highly unbalanced data, so if you've got ratios like 10 to 1, of 0's to 1's, so that the rare class is a relatively small fraction of the data, it's very unlikely that PRIORS DATA will give you useful results. However, it doesn't hurt to run an experiment if your data set is not so big that it takes a long time to run the example. PRIORS can be expressed as a ratio, so the default will be one to one. You can set the priors to whatever ratio you like, but you have to use positive numbers in expressing the ratio. So you could set a ratio, for example, of 1.2 to 1, and I ran an example for this particular video which uses that. You could set a number like 5 to 1, 10 to 1, whatever you like- again, so long as you have positive numbers and the first number don't have to be to the larger one, it could be the smaller one, in, when you're setting the ratios. Changing priors usually changes results, sometimes dramatically, so extreme priors often make getting any tree impossible. You might want to try that; go ahead and set priors at 100 to 1, see what happens. Probably, you'll get a report saying that no tree was created. It sets a bar; it sets a problem that is impossible for CART to solve.
When you set the PRIORS, you're going to go to the Model Setup dialog. You're going to go to the PRIORS tab and you'll see something that looks like this, possibly in the version that you have it looks slightly different, but it's not going to be very different. So the default is equal, so the selection would have been made on the first option there. You can see that you can select to follow whatever patterns we observed in the learned data, or in the test data, or in all of the data together. There is an option for MIX, which is something in between EQUAL and DATA. We've never found MIX to be terribly useful, but it's in there for historical reasons, and then finally, there's 'Specify', and that's where you get to set the 'PRIORS' ratio to whatever you want. It will start, by default, with the prior settings of 1 to 1, and then it's up to you to change it — if you would like to.
So if PRIORS can change results, then what is right? Well, the results CART gives you are intended to reflect what you consider important, and won't make sense given your objectives. PRIORS EQUAL usually reflects what most people want. If tweaking the priors and changing them gives you better results given your objectives, then use the tweaked priors. So our advice on priors, just to repeat what we've already said, you can start with the default and you don't ever have to get beyond this. If you're a beginner and you're just curious as to what PRIORS are all about, we don't encourage you to leave now- but you can because now you know that you're safe to live with PRIORS EQUAL.
What is BATTERY PRIORS?
There is a battery that is called BATTERY PRIORS, it is available in the CART Pro EX and in the SPM Pro EX series. What it does is, it runs an automatic sweep across dozens, if you want you can set it to hundreds, of different settings of the priors- to display the consequences of tweaking the priors. So this way, you don't have to go through and do this experimentation in a tedious manner, and also the results are then summarized in very convenient tables and charts- which make it very easy to decide what a large number of runs are trying to tell you. This is going to be useful when you're looking to achieve a specific balance of accuracy, for example, across the dependent variable classes- and I don't mean that the balance means equal accuracy, but if you want, for example, 70% accuracy for the 0's and 90% accuracy for the 1's, and PRIORS EQUAL doesn't give you that, then maybe manipulating the priors can get you closer to what you're looking for. You can always experiment manually to measure the impact of a change in the PRIORS.
So, in order to delve into how PRIORS actually work, we have to start by discussing a particular splitting rule, and we're going to start with the Gini. The Gini happens to have a characteristic like PRIORS EQUAL and that is, that even though there are options- the Gini is very often your best choice. That's why it's been set as the default. You have other options, but it's always worthwhile starting with the Gini. The Gini has a very simple formula for the two class dependent variable. So let's label the two classes that we have as class 0 and class 1, and in a specific node of a tree, we represent the shares of the data for the two classes as Pₒ and P1, and these have to sum to 1, because the two fractions are fractions of each of the two groups of the total. The measure of diversity, or impurity, these words are synonymous in CART speak, the measure of diversity in a given subset of data is given by simple formula: the impurity equals one minus the P0 squared minus the P1 squared. Now keep this in mind, if you didn't square those P's, you would get zero, because one minus the fraction that are zeros minus the fraction that are ones has to be zero. But if you take a number that is between 0 and 1, greater than zero and less than one, when you square it, it becomes smaller, so each of the two terms, the P0 term and the P1 term become smaller than what starts, than what you started with, and therefore there'll be something left over, meaning one minus P not squared minus P1 squared, if those two numbers are both between 0 and 1, you will get a positive remainder- and that is the measure of the impurity. If either of the P0 or the P1 terms is equal to one, then of course the impurity will equal zero. When the P0 and P1 terms are both equal to .5, then the calculation gets you to an impurity value of .5. You get 1 minus .5 squared minus .5 squared again, which is 1 minus .25-.25 again. The result is .5. That is the largest impurity that you can get in the two class problem.
Using the Gini Measure
So, the Gini measure's just a sensible way to represent how diverse the data is in a node and it's useful, especially useful, when you have many classes because the formula just expands naturally. Every class has a share, you square the shares and subtract them all from one. So we use the Gini measure as a way of ranking competing splits. A split is a way of dividing the data, based on one of the variables that we have, into two parts. So let's think of two competing splits now: split A and split B. Split A will be considered better if it produces child nodes with less diversity, on average, than split B. We measure the goodness of split by looking at the reduction in impurity in the children relative to the parent. So here's a little diagram to help us think about it. Let's start with the parent node with an impurity of .5, that means each of the two shares for the 0's and the 1's is .5, as we saw before. Let's suppose, for the sake of argument, that 20% of the data go left and arrive at a child with an impurity of .3. 80% of the data go right and there we get a child with an impurity of .2. So, how do we calculate the improvement? Well, the average impurity of the two children is something between .3, which is what the left child has, and .2, which is what the right child has. We're not going to take a simple average because most of the data goes to the right, so we weight the right impurity, which is the lower one, heavier. We weight it by .8, and do a little simple algebra. We get the weighted average of the two children has a, an impurity of .22. Subtract that from .5 and we get improvement of .28, so very simple mathematics here. So here's a draft of the Gini impurity function, the formula that we gave you before of one minus P0 squared minus P1 squared, can be re-expressed as 2*P(1- P).
Doesn't really matter whether we use P0 or P1 here, we're going to get the same curve. So, what you can see is, the curve reaches its highest point, which is the highest impurity, when the P and 1- P are both equal to .5, and then, as we move away from the .5 value, the impurity drops and of course it reaches zero, if either share reaches zero or one. Now the thing to observe about the way the Gini rule works and the way the formula works, is that small changes from .5 don't buy you very much. If instead of a balance of .5 and .5, you had a balance of .6 and .4, think about what those two numbers multiply out to be. The .5 x.5 equals .25. The 6 and the 4 equal 24. .25 and .24 are not very different, so moving the needle requires moving the balance a lot further than from .5 to.6, for example. Even moving it to .7, that only gets you .7 x.3, which is .21, so that's still not dramatically lower than the .5, but the curve does accelerate as you approach the edges and so you can see that there's a lot to be gained by getting a node to be almost pure, if not literally pure.
Okay, let's go to a slightly more elaborate example. We are working in a situation in which there are no missing values, and that's very important. We don't have time to explain that in this particular video session, but just keep that in mind- that there are no missing values here for the splitter in question. So we start, in this case, with a perfect balance of goods and bads, so the percent of the goods in the parent node, which happens to be the root node, is 50%, same for the bad and now we have a split. Never mind how that split was generated, but we somehow got it, and you can see that we have 800 goods and 300 bads in the left child, for a total of 1100 records. 200 goods and 700 bads in the right child, for a total of 900. So you can see what the raw percentages are in each of the two children. So we have almost 73% good in the left child, and we have almost 78% bad in the right child, so it's pretty obvious how to classify those two segments. Let's go and look at the calculations for the impurities. So the parent, as before, has that baseline impurity of .5 that we get almost .4 for the left child. We get almost .35 for the right child. 55% of the data went left, 45% of the data went right, so we do our weighted average calculations and now we get the overall improvement of .1262. If you're really interested in making sure you understand this, I would recommend it with a spreadsheet or with a calculator, just verify that you can get the same results that we see here. This is still a very simple computation.
Here is what the Monograph looks like. We actually copied a formula right out of the Monograph and you can see what they do here. It tends to be a little heavy on the notation. The Delta is the symbol for the impurity. The t is the symbol for a node and the s is a split, and what this says is that the improvement for a given node, at a given split, is equal to the impurity in that node, minus a probability weighted average of the impurities of the two children, and that's just what we've been going through. Now, what if you have unbalanced data? The calculations for all key quantities become weighted when we use a CART default and the original data is unbalanced. Weighting is used to calculate the fraction of the data belonging to each class, the fraction of the data in the left and the right child nodes ,and therefore, the Gini impurity in each node, and therefore, the resulting improvement of the split- the reduction in impurity. So we can no longer use the simple ratios that we used in the previous slides. The good news is that the mechanism for weighting is very simple and easy to remember. So we start with this basic notion: all counts are expressed as the count in the node, for a particular class, divided by the corresponding count in the root node. So instead of looking at raw counts, we're always looking at this relative count.
So you can see in the middle of the slide here, we have a formula. We're looking at a particular node, t(N naught), or 0, stands for the number of records belonging to class 0 in that node, t, and we divide that by the number that we find in the root node, which is N0, without any reference to a particular node- so that is always the count for the whole data set or for what we see in the root. And you can see, that instead of simply taking the ratio of the number of zeros to the total, and the total of course, is the number of zeros plus the number of ones, we first transform each count into this ratio and then we do the calculation that you would want to do to get an average. So the math is the same as before, but we substitute actual counts for these relative counts instead. Now, if we use priors that are other than equal, then we get one additional complication and I hope this is not scaring you here, but the formula that we see on the bottom has those same relative counts, but we multiply each of the relative counts by its appropriate prior. If the priors are equal, so I want you to imagine that πo and π1 are both equal to one half. Well then, every number here is being multiplied by the same π, which is ½, so they just cancel out. So therefore, the formula you see here is exactly the same, with the formula that we saw in the slides before, because the equal priors just disappear. But, if the priors are not equal, for any reason, then all the computations are going to be done in this way, which is a form of weighting. These priors are now incorporated into this splitting rule, because the splitting rule, the Gini measure of impurity, is one minus the sum of all these shares, but the shares are now calculated as those relative counts we spoke about before, weighted by the relevant pi. So this slide here just shows you how those computations would look and they're there as a reference for you, in case you want to start using the formulas to try to verify some of the results that you can see from the CART runs.
Using a Real–World Example
So, let's now go to a real world example. In this example, we have approximately 75% class 0's, which are the goods, approximately 21% class 1, which are the bads, and we set this example up with just one predictor. So the data set which will be available to you, is bad_rare_X.xls. We're going to use the one predictor, X 15, we don't have to ask ourselves 'why are we doing this', this is simply an illustration, this is a schoolroom exercise, so, using real-world data though, in order to see how this mechanism works- so we're not going to discuss the rationale for this. Except for one point, X 15 is the one predictor in this data set that has no missings, and I really did want that for this particular example. The CART tree that comes up, pretty interesting- we have a single predictor which we're using to try to separate the goods from the bads, and we actually get a fairly elaborate tree here. We have a total of 27 terminal nodes, which means that this X, which has got a lot of different values, is being used to chop up the data into 27 different segments. We almost certainly could get away with fewer, but we're not going to talk about how to make CART trees simpler in this session. Look at the area under the ROC curve for test in the right-hand, little display under model statistics, so it's very near the bottom. The ROC on training data and on test data are almost identical and they're pretty good at .72. Doesn't mean that we can't do better, but .72 is quite respectable using just one predictor and because train and test are matching up so well, this actually looks like it's a fairly credible model here, even though it's remarkably simple. But we don't want to look at the whole model here; we only want to look at the root node split, so let's go ahead and do that.
The way we would look at the root node split in working with real CART is we'd use the arrow keys to prune this tree back to a single split, and then request the tree details, which is the button at the bottom, near the left hand edge of 'displays' and 'reports'. Bring that up and what do we see? We see here a split in which we start with about well over 7000 cases, about 2100 go to the left, almost 5000 go to the right, and you can see that the rare class, with a 21.4% incidence in the root node, now as a share of 41% in the left-hand segment, and only 13% in the right-hand segment. Okay, so quite convincing that the left-hand segment, if we were to stop there, we would want to declare that a class 1, or a bad class- much higher risk of bad in the left segment, than in the overall sample. And therefore, and also, much higher risk of bad in the left segment than in the right segment. Now, here we're looking at Classic Output and again, as I said, most people don't look at the Classic Output. We give you enough in the GUI, but we show a little more detail and sometimes people are quite interested to know, what is this detail all about and how do you understand it. That's what we're trying to accomplish here. Let's look at what it says here.
We start, and by the way, when CART was originally created, the first version actually came online; it was a simpler version than what we have now, but it was the first credible antecedent and had a lot of the characteristics of the real CART, and it was written by Jerry Friedman and used in his research at the Stanford Linear Accelerator Center, starting in 1975. So we're looking at almost 40 years ago. It was then improved dramatically, with the joint work that Jerry did with his co-authors and another version that was quite a bit more elaborate was ready by 1984, and that early version, which the real world, outside of Jerry and Leo Brightman's labs, became available around 1987 and there were no GUI displays at that time and when we started working with CART, we had no GUI either, we just had this kind of classic text output, so all of our early work with CART, we were seeing just these kinds of reports and no pretty pictures.
Reading the Output
So how do you read this here? Well let's go start at the top, node one- that's the root. That, that node is going to be split on variable X15. It tells us how many records there are, and then how many records went to the left, and then how many records went to the right. Well, we can see, and remember, that that matches exactly what that little GUI showed us before. We're told again that the node was split on X15. It says that a case goes left if the value of X15 is less than or equal to 2750 and then we get an improvement score, .062. Let me point out that mostly, unless you have an extraordinarily powerful splitter, mostly, improvement scores are numbers small like this. Numbers like .06 are not unusual, they're actually quite normal. Okay, let's hone in now, on another part of this report.
These are the weighted counts. On this case, this does not mean weighted by the priors, it means weighted by if you actually you yourself introduced any weights, and we don't have any here. So these are raw counts and here you see we are being told, what are the number of goods and the number of bads in the left child and in the right child, and again, this is just a non-GUI representation of the exact same thing that we see here and those numbers match up- as we would hope. Now we go to the third panel of the Classic Output and this is called Within Node Probabilities and we see that we have, as usual, a top which is a parent, and a left and then a right column. These numbers are not the raw probabilities or fractions. These are priors adjusted, and so this is where it takes a little bit of computation in order to understand what has happened. In the top node, which is the parent, the priors are equal and that means it does not matter how unbalanced the actual data is. After adjusting for the priors, the probability of a class 0 is .5 and the probability of a class 1 is .5, and that is independent of the actual fraction of data that are ones. PRIORS EQUAL always makes it work out that way and PRIORS EQUAL will always report this exact same split, this exact same pair of probabilities in the top node, .5 and .5, independent of the data- that is what PRIORS EQUAL does. Now, what happens in the left and the right children? That depends on the nature of the split. So, we have to calculate looking at the counts, and then also taking the priors into account, in order to getting the calculations right, in order to see what the left and the right adjusted probabilities are for each class- these will not match your raw probabilities. So over here, we've got a spreadsheet and we're also making the spreadsheet available in order for you to be able to verify these particular computations, should you be interested, and again- this is for the advanced user. if you're a professional statistician and you're new to CART and you're trying to understand how this works, you're probably going to be pretty interested. If you're a grad student, probably similar. If you're a casual user, it's just good to know that there are formulas, and this is very well worked out, and that you could apply spreadsheet to verify this- if you wanted to. You may not actually want to try to work through the details, but if you're still with us, then let's look at the spreadsheet and see what's going on.
We start in column C and you can see there, we have the raw counts for the goods and the bads, and then the total for the parent and then for the left child and for the right child; and this is exactly the same as what we saw in the previous reports and diagrams. Column D shows raw percents, and as we said before, the raw percents are not what we see reported in the Classic Output, because we compute adjusted percentages in order to get the results that we're looking for. In column H you can see what the priors are, so we're setting the prior on class 0 to .5 and the prior to class 1 on .5. If you take this as a spreadsheet and you change those priors, then everything, in, everything in the spreadsheet that relies on the priors will change- and it will change in the right way. Okay, now let's look at column F and what you can see is that we have an adjusted node probability and a raw node probability, both for the left and the right, and there you can see that the adjusted in the raw probabilities are different because they're reflecting the balance in goods and bads that occur in each of the two children. The raw probabilities would just count the number of records, but the adjusted probability takes into account the balance between the goods and the bads, because the bads have a higher weight- and so therefore they have a heavier contribution to each of the adjusted counts. The adjusted probabilities that you see in column G, these are exactly what we saw in the Classic Output. Notice that we see, for the left child, that the adjusted probabilities for the two classes are .28 and .718. Let's go back here, and you can see under the left column, that that is what we get. Notice for the right, we see .64 and .35. Look again in column G, in the second block of yellow highlighted cells, and you see we get the same results there. Finally, all of this feeds into our measure of the impurity; in the left child- we get a .40. In the right child, we get a point, almost .46. So we're averaging two numbers that are about .41 and .46. It'll be somewhere in between; when we do the final calculations, we get that the improvement was about .06, which means that those two numbers were averaging at a little under .44. Again, if you go back to the raw, to the Classic Output, you can see the improvement is being printed there as .0626; and we don't see it to quite that level and we see the same thing, actually, in the spreadsheet: .0626 for the final improvement.
Okay, this concludes this particular first session on priors in CART trees. There is quite a bit more to say about this, we can't cover it in the length of time that we want to devote to any one video, so we're going to have to continue this discussion in subsequent videos- but we're glad you came with us this far. So understand- priors are an advanced control that the casual user need not worry about. The default setting is almost always reasonable and almost always yields valuable results. Tweaking the priors can change the details of the tree, alter results- sometimes considerably, and it can be worth running some experiments. Thanks for hanging in there with us during this relatively long video and we hope to have you visit with us again.