Using CART For Beginners

Learn how to grow CART trees, view tree details, understand CART's color–coding mechanism and print your results.

Using CART For Beginners

Learn how to grow CART trees, view tree details, understand CART's color–coding mechanism and print your results.

Hi, this is Dan Steinberg at Salford Systems, and welcome to this particular edition of Salford Systems' online training videos. Today we're going to be talking about CART and this is a super–fast walk through of some of the basics of CART if you want to understand more about the basics we also invite you to visit the longer and slower version as we go through the details.

CART Model Setup

So what I'm going to do over here is I'm going to open a file, I'm going to the modeling dialogue, I'm going to choose a dependent variable, which is going to be the subject of my analysis. Then what I'm going to do is to set up a previously prepared analysis, use it to set up my model and notice that over here I have 10 variables that have been selected to run, I've got a dependent variable, just one which is all we can have, 10 predictors, I'm ready to use the CART engine. I just click start and away we go.

Tree Details

What CART does is it starts with data in the root node. Here I've got 830 records which happened to be organized as 704. Zero are non–responses, 126 responses and I'm using the available data in order to find partitions that is database rules that can separate the data in two parts such that one part has a substantially different response rate than the other, which is what I see over here. CART conducts a brute force search in order to find these splitters; once it's found one it repeats the process. For any node we search for the best splitter, again trying to get that maximum separation, in terms of response between one of the chosen or the other and we continue this process. We eventually end up with an optimal tree.

In this rapid walk through we don't have time to describe we determined that optimality is but I just want to assure you that it is based on testing, so we are not relying on the training data, we are relying on data that has not been used to build the tree, in order to determine what is optimal.

Color–Coded Nodes

I'm going to look over here at a tree which is smaller than optimal but is interesting. The key in a CART display here is the color coding. Red means interesting, high response. Blue means low response, and if we're looking for the high response nodes, then of course we want to focus on the red nodes.

So how do we get to the red nodes? Well we've got two different ways of looking at this tree. We can of course hover our mouse over the nodes and follow our way down. We can look at a conventional picture of the tree, which gives us all the details of what happens in each node. Or we can look at a very streamlined version.

In this streamlined version we can see which variables are operating. It turns out that in a CART tree, the way the splits work is the high values of the variable go to the right and the low values go to the left so whatever variable was used to split this, the high values go to the right. Whatever variable was used to split this node here, the high value is to the right.

So we end up with two interesting terminal nodes, those are the ones we are going to want to focus on first. The extreme right and the extreme left doesn't have to be that way, turns out to be that way this time. So let's look at this graph over here and tell the story.

This particular data set came from a marketing initiative. From the European telephone company that was trying to introduce the mobile phone into their market of landline customers about 20 years ago and they ran an experiment in which they made an offer to a 1000 different house holds approximately and in that set of experiments they offered everyone the exact same phone, the exact same service, but everyone was shown a different price. So the idea was to learn how price affected people's responses and also to learn which segments were most interested in the product and remember this node over here was a high response node and this node over here is a high response node. So this is what happened over here.

Even though we showed some people a high price what we found was that if they had a large telephone bill and if they also owned a pager then we got a really good response to our offer. Not surprising the pager is nothing but a defective cell phone when we come along. Even though this was 20 years ago, with a superior technology, those people are already paying for the inferior, but unusual technology jump on the opportunity to upgrade, and that's what happened over here. What about the people who didn't have a pager and also didn't have a large telephone bill? Well that's looking at this side of the tree. Here we led with a low price, but that wasn't enough to necessarily get a lot of response. We are also looking at individuals that had low telephone bills, lived in particular cities, and that happened to be quite young, under 25 years old. So what happens over here? We noticed that the people who are high responders are the people that have an extremely low telephone bill. What's the story here? Well very interesting, the story here is that we have individuals that don't make phone calls, probably because they end up going home very late. These are young individuals that probably after work hang out with their friends, perhaps go to clubs, perhaps they just hang out at the park, but the thing is they show up at home very late, and it's too late to make any phones. Now, when you show them a new technology that will allow them to interact with their friends, they are extremely positive on that new technology but with one proviso that is the new technology has to be cheap. So, this kind of insight into the model which is to repeat that there are some subtleties here. Normally, spending a lot on your home phone bill is a good thing and it is a good thing on this side of the tree. But we have a segment over here, where things are different. We also see here that in certain cities, being older than 25 is good for this particular market. Probably because the people over 25 have higher incomes, but there is a group of younger people that we also want to target. Those are identified over here, this kind of subtlety would be extremely difficult to detect using conventional statistical models and we know this for a fact because we conducted this analysis 20 years ago originally using classical statistical methods and we missed some of these subtleties at that time. Later we reanalyzed the data so we can learn more about it by using the CART decision tree.

Printing CART Results

Once we've completed an analysis, there are a number of things that we can look at to complete our understanding of the data. These come in the summary report, we have over here the conventional gains chart which we can open in a new window in order to see it much bigger. We can review the variable importance list which tells us what the ranking the model gives for the particular drivers of this particular model. We can also go to tree details here and get a nice print of this particular tree. And of course we aren't going to print it on paper we are going to print it to an image file or a PDF, but let's organize this so we can get it onto one page since it can clearly fit their reasonably, spread it out a little bit, if we want we can add headers and footers. We can send this to a PDF, let's click OK here. One of our options will be a PDF writer, and we're ok.

Once we're done with this, we can also go to scoring and we can decide which data set we want to create scores for and where we want to save those scores and perhaps choose an ID variable in order to keep track of which record is which and then just click ok and away we go and we have our score results. Well that was a very quick and ultra–rapid overview of some of the things that happen when you run a CART tree.

We invite you to come back for a longer discussion, which is probably four times as long as this one, in which we go through a few more details and just to give you a heads up in order to cover everything that is relevant to advanced data analysis in CART is going to take a number of hours and in our in person training we normally cover this in two full day sessions so as you go through our videos expect to visit a few number of them if you're hoping to become a true CART expert. Thank you very much for your interest in CART and in Salford Systems, and we hope to see you again soon.

(Video Transcript)

[J#215:1602]