TreeNet for Beginners
Work through large databases quickly and accurately with TreeNet Stochastic Gradient Boosting.
TreeNet for Beginners
What is TreeNet?
This is a short presentation only; there will be longer ones that will allow you to dig into the details. TreeNet can be summarized in perhaps three words: speed, power and accuracy. TreeNet is Salford Systems most powerful data mining technology. What's special about it? First it's fast. It will allow you to work through large databases and handle many predictors in record time. That speed wouldn't be worth very much if you didn't get accuracy. There is phenomenal performances that can be obtained from TreeNet models. These models can be far more accurate than you thought could be possible with modern machine learning methods. Further, it's automatic. A great deal of what TreeNet does, it does without user intervention or very little guidance from the model, it doesn't mean the modeler can't help, the modeler has an important role. But even doing nothing in order to guide TreeNet, TreeNet is often able to get you very good results. TreeNet handles problematic data effortlessly. You can have data that has errors in it, you can have data that makes mistakes regarding the actual dependent variable, you can have outliers, you can have missing values, all of those things are challenges for any data mining technology. They are less of a challenge when it comes to TreeNet. TreeNet performs extremely well in classification problems, problems in which you have a yes no, good bad outcome. It's also extremely good for regression, when you are trying to predict something like how many days will a person end up staying in a hospital, how much a person will spend on a website and TreeNet has been shown to be extremely good in the analysis of textual data, but that is the subject of another video and a set of tutorials, related to text mining. TreeNet is also very relevant to that topic also.
Why are we excited about TreeNet? Well first of all, it's been responsible for numerous wins in data mining competitions, going all the way back to 2002. In other words, we have a decade's worth of track records in terms of excellence performance in TreeNet. This technology is indisputably state of the art.
So what exactly is TreeNet? Well, TreeNet was invented in 1999 by data mining visionary, Jerome H. Friedman. As many of you know, we've been working together closely with Jerry Friedman ever since 1990. Friedman is the principle author, in fact the only author of the actual CART software, he is the co–author of the book on CART and he is the only author of MARS and other ground breaking predictive modeling technologies. All of the tools, which Salford Systems offers are built on Friedman's proprietary source code, which is available only through Salford Systems. You may find that there are other vendors that claim to offer something similar, but the only real versions are available from Salford Systems. TreeNet version 1 was released by Salford Systems in 2000 and is installed in many large scaled, mission critical, business intelligent systems. TreeNet was described originally as stochastic gradient boosting; this is a very technical description. There is a lot of articles talking about it right now. This topic was invented by Jerry Friedman, and we are the first to bring this technology to market. It is easily, the most powerful off the shelf technology in existence.
The TreeNet technology can simply be described at a high level, although the details are very complex and hidden from view. For the user, you don't need to start understanding those technical details but you can gradually gain mastery of them. TreeNet is a learning machine that uncovers structure in data through many stages of discovery. Each stage is intended to learn only a little and hundreds or thousands, sometimes tens of thousands of stages may be needed to arrive at a final model. However, all of this can happen very quickly as we'll see when we run the software. TreeNet typically builds very small CART trees and combines many such small trees into a powerful model, far more powerful than you can obtain from any single tree model.
TreeNet Starts Simple
The TreeNet strategy is to start simple. What do we do? We grow a simple CART tree with only a few nodes. The default in TreeNet and the recommended number to start with is 6. You can always change that to another number, either smaller or larger and you can vary the size quite a bit but extensive experience suggests that 6 terminal nodes is a very good default and a very good place to start and most likely will turn out to be the best for your problem as well.
This tree can either be a classification tree or a regression tree. Now in most circumstances a 6 terminal node tree would be too simple to be very accurate, hence it makes mistakes. What is the nature of a mistake in a decision tree? Well let's suppose this is a yes no outcome and the example we are about to run very shortly has a good and bad outcome and suppose we are trying to predict the bads. Well in that case, we can color coat the terminal nodes so the red terminal nodes include, are intended to, flag the bads. Well, in a model that's too simple, these terminals nodes are not going to be pure. What we mean by this, is sure this red node here may be very nice in that it has concentrated, relatively more bads, then goods compared to the overall population, but this node may still contain a reasonable number of goods and similarly, blue which is indicating here goods, may not be pure the node may also contain some bads. So what that means is that a simple model like this, even though it may perform moderately well, is going to be making mistakes in the outcomes, the point in which this model makes predictions.
So what can we do about that? We can focus on the mistakes that the simple tree makes. Those mistakes could be mistakes in classification, assignment of a probability or a regression prediction. So we're going to measure the mistakes by some form of "residual" and we don't want to get into the details of how we do that. It's clear in the case of regression that a residual is a difference between the real outcome and what the model predicts. What we're going to do is take those errors, or residuals, and we're going to make them a new dependent variable and we're going build a completely new model to predict the errors.
Notice we have a different tree with a different pattern and a different shape. This is the pattern of the tree that is intended to correct the mistakes of the first tree. Here is what is interesting about this, the second tree is grown using all of the training data. If we were to have tried to elaborate our original tree using a conventional tree methodology, the tree would have just became bigger. Which means the data inside of the nodes would have become diminished as the tree grew but because we are growing a second tree on the errors of the first tree we actually get to start with all the data again. This is one of the key secrets to the success of this technology. Every stage of learning starts with the data and we don't suffer the principal challenge of decision trees when we are working with progressively less and less data as we work our way down the tree.
So what do we have now? We have a revised model with two trees. This was the first one, and this was the second one and the way in which we combine them is to treat each tree as if it were generating a score and so here is a tree that generates some scores and here is a second tree that generates different scores and we take the score for any record that come from the first tree and we add to it the score that comes from the second tree. The second tree can therefore be thought of as correcting or modifying the predictions of the first tree. So a model from the end of a TreeNet process is simply the sum of all the scores of all the trees that were built.
Here is an example of now a third stage in which we continue that same process and there is no limit to the number of stages that we can take this process through.
What I've done here is I opened the predictive modeler, you should be doing the same if you want to follow along with me in these slides or if you want to follow along when you listen to these training videos a second time. In addition, I've opened a file called goodbadx_10k.csv, that is a file available to you from our website. Please visit the website and go to the areas that you've been pointed to find the training data files and make sure you have this one available. It has 9874 records, it's got 24 variables on it, 5 are character and 15 are numeric, and this data pertains to a good bad outcome, which is why we gave the data set that name.
Setting Up a TreeNet Model
Let's go directly to modeling even though that's not what we would normally do if we were trying to get an appreciation of a data set for the very first time. Let's click on the modeling button over here, and observe what I'm going to do. The first thing I'm going to do as always is indicate which is my dependent variable and I know that already, it's going to be the good/bad outcome, which is coded as a text variable, as the word good or the word bad.
Here we have 19 available predictors, it turns out we don't want to use all of them because there is a variable called ID, which is there to simply allow us to track records as we manipulate them in future models and perhaps make predictions. So let's remove that from the list, we don't want an ID variable in there. Other than that, the remaining variables are in fact legitimate predictors. There is only one other little detail that I want to take care of here. That is the variable occupation is coded 1, 2, 3, etc. However, it represents occupations; it just so happens this data was transformed from the original description into a numeric one in order to disguise the identity of the data. Other than that, there is no reason to have done this, we'll just go ahead and click that is categorical and we are actually ready to go.
I will want to do a couple of other things here, because we're talking about TreeNet, let's use TreeNet as the data mining engine that we want to work with and I am going to set the number of trees, we're going to grow to 500.
The default is 200, we generally think that is not a bad place to start, but it usually isn't the place you're going to end. So we want to allow for more trees here, the only other change I'm going to make is when it comes to testing instead of using cross validation with almost 10,000 records we can well afford to reserve a good chunk of the data for testing. I'm going to reserve 50%. Other than that, nothing more we have to worry about. Going back to the model statement, model tab, we want to make sure we are focused on the right type of model. This is a yes/no outcome, or a two class outcome for which the recommended option is logistic binary. So let's go ahead and click the start button and see what happens.
So you can see the process of going through 500 trees on this dataset of 10,000 records is in fact fairly rapid. So what I told you before about TreeNet's speed is I hope now evident to you. I'll also point out that I am running this particular example on a Window's XP virtual machine, which is being hosted on a MacBook Pro. In other words, it's very likely in your own offices you'll be working with a machine that is even more powerful than this one.
So we have come back with a very high level description of what has happened here and we have some curves here that are representing the performance of the model as we go through the training data which is the blue line and as we go through the test data which is the red line. We would like these two curves to be very close. That is an indication that we are not over fitting these curves here are a little bit further apart than I would like, but not so far apart as to be seriously concerned. The particular curve we are showing here is related to log likelihood, this is something that not everybody prefers to look at, so we can also have a look at, for example the area underneath the ROC curve and this model is performing extremely well. So if I make this bigger over here, we can see that even from the very first trees we have a very high performance then it gets driven up to an extremely high level. Not necessarily something we should expect from all such models, but it did happen this time around.
Let's get it back to a size I'm more familiar with and now let's look at some of the results. So we can go to the summary display first, look over here at predictions success, which is the confusion matrix, classification matrix.
Let's look at test data, what we see here is that the bads on test data are predicted correctly 93% of the time and the goods on test data are predicted more than 95% of the time. So this is actually very high performance. If we are more concerned about getting the bads right, well we have 2,481 bads and this model only made mistakes on 198 of them. Ok, so what else do we see from the summary table? Variable importance, the variable importance list is telling us which variables are driving the performance of the model and what we see at the top is excess to the blue book, which is a measure of how much of a gap there is between what the borrower is intending to pay for the car and what the blue book value of the car is.
Now if it's a new car, then there will be some other measure to this, but if it's a used car then there will be an established market and some typical value for that make, model adjusting for the particular conditions of that car. Next, we have the age of the vehicle, following that we have bureau searches in the most recent month. This is evidence in the credit bureau records that this borrower was seeking to obtain credit somewhere else. Then we have the occupation, some information related to any bad outcomes, if any for the borrower, the residential status, more about searches in the last 6 months, disposable income, time at the present employer, and we actually have made use of every variable in this list of predictors, but the relative importance drops very quickly as we move past the top four predictors. So we may want to take that variable importance ranking quite seriously here to think about a simpler model.
But that's not that interesting. Here is the part that is more fun, when it comes to this model. Let's try to get a good understanding of what goes on here. Click on the create plots. Now I want to just see one way plots right here, that is I don't want to look at interactions, and I'm going to select all here. We don't that many graphs here, so there is no reason to try to save time here by generating them. Now let's go ahead and select plots. Here are all the plots that have been created, it's for the 320 tree optimal model and I can click on any one of these to see the graph or I can click on show all here and we'll see the entire bunch.
Scrolling down here, we see small pictures displaying each variable. Let's double click on one of these to get a better view. So what we see here is the age of the vehicle measured in days.
We can see here, in about one year anyone looking to buy a car that is one year old is not very good, than someone who is buying a much newer car. This particular description over here says, we're predicting for the good/bad status being good and therefore moving upwards on the y–axis is increasing the probability for being good. So older vehicles are associated with being less good. But there is a flattening out of this curve over here, so really we're looking at relatively new cars here and cars one year and older on this segment of the graph.
Let's look at a categorical variable, which has some good description of the levels. And what we see here are residential status.
Quite interesting, if someone is an owner of their home, then this pushes them in the direction of being good when it comes to being a borrower. If they're buying, this is also good when it comes to borrowing. Rooming with other people is the worst status that we see in this data set. Living with parents, slightly negative. Mobile home is fairly close to living with parents. Then of course we don't know what other stands in this particular population. It's a grab bag and is also in the negative direction. Scanning through a few more of the graphs, here is another one where the results won't be surprising, but let's still take a look at it. What we see over here is that for this particular population of borrowers, the status of being married is best, single means never married, and that's second best.
Separated is probably worst when it comes to the probability of successfully repaying this loan. Divorce, which comes after separated is a little bit better and people living together but not married is a status worst than being married or single when it comes to predictability for repaying this loan.
Now let's look at deposit to cash price. The larger the fraction of the total cost of the vehicle that is paid as a down payment, the more likely the loan is to stay good. Not a surprise, the curve is a little bit wiggly, we'll say more about that in other videos about TreeNet, but the overall picture is kind of interesting and what we see here is there is a flattening out towards the right end of this particular curve, which is important to know in case you are using this investigation of this data in order to try to understand the potential non–linearity of any predictor in the outcome.
Another no surprise indicator, if someone has a mortgage then of course they are in a better status than if they don't. But that is very similar to that other variable that we saw before which was residential status.
Here is something that others have observed in other similar studies, but nevertheless interesting. When a person applies for an automobile loan, they also have to indicate how long they want that loan to be as a term. In other words, they are going to select how many months in order to pay off the loan. And it's not surprising that someone who needs 5 years to pay off the car loan is going to be a greater risk than someone that needs less.
An example in this case, the sweet spot seems to be at 3.5 years. What is not so obvious, and we don't have a really good explanation as to why the 24 month approximately should be slightly worse than slightly longer time period.
What Has TreeNet Shown Us?
Ok, so one of the things you can see from this is that we get a very nice intuition and either confirmation or disconfirmation of patterns that we expect to see in the data. And we also get to see the specific relationship in fairly detailed form as to how a particular why how a particular predictor effects the outcome why. There is much more we can say about TreeNet at this point but we want to go back to our slides in order to make a final observation, which is this.
TreeNet can quickly deliver world class models. What we mean by that is the performance in regards to prediction on new data will be world class. The methodology itself is state–of–the–art machine learning methodology. At this point in time, many independent researchers in the field will agree with us, this is in terms of off–the–shelf, general not custom built software for any particular problem, that this is the state–of–the–art. TreeNet will make a modeler better and will help a less experience modeler reach more impressive results and quickly. What can we do with a TreeNet model in terms of explaining it to non-technical individuals and managers? Well we can display interpretative diagrams in order to explain model details. We can use variable importance rankings to reveal key model drivers. The model can be genuinely complex comprising hundreds or thousands of trees but this is something for the scoring server to manage. It's an IT problem and not necessarily a modeler's problem. Now there's more to say about how to work with IT, in terms of the models you develop do not cause a problem for that part of your organization and that is also something we will talk about in another session. Again thank you very much for joining us and we hope to be having you as another visitor again to the Salford Trainings.