Improve Model Quality with Battery SHAVE
Use Battery SHAVE in the Salford Predictive Modeler® software suite to improve your model performance, increase model simplicity, and decrease the number of predictors needed for an accurate model. Using this battery will hep streamline and automate your model for optimal results.
Greetings, and welcome to the Salford Systems' online series of training videos. This is a series of videos that will be covering the basics of data mining, machine learning, and predictive analytics and in particular will be focusing on how to make best use of Salford Predictive Modeling technology.
We will be working with SPM™, The Salford Predictive Modeler™. This is a complete family of Salford Systems' classic data mining engines including CART®, MARS®, TreeNet®, and RandomForest®. You can work with us using SPM, which is freely downloadable from our website from where you can get a no–cost evaluation version. If you are working with one of the earlier versions of the stand–alone data mining engines such as CART, MARS, TreeNet, or RandomForest, you can still follow along although you may not have all of the features that we are going to be discussing.
In addition, you should look forward to 2012 and in other videos for detailed descriptions of what's going to be new in SPM version 7.0. In particular, a new generation of data mining engines and model post processors.
Introduction to Batteries
Today, we are going to be focusing on the SPM batteries. Batteries are a key part to the automation built into SPM. Batteries automatically generate collections of related models. They allow you to review not just one model, but a series of models laid out in a form of an experiment. This permits rapid perfection of model control parameters, and it also serves as a guide to model development because the little list of batteries is a list of recommended experiments that you might want to consider as you perfect your modeling.
You can use batteries to run hundreds or thousands of related models as you look for best performance or the best balance of performance and simplicity. Today we are going to be focusing in particular on battery shave and shortly we are going to start up the software and show you how we actually accomplish these things using the software and the menus in the graphical user interface.
Right now I'd just like to say a few words about the thinking behind what battery shave is all about. It's one of our favorite automatic model perfection tools. We start with a baseline model, that might have many predictors, and when I say many, I mean perhaps several hundred. This technology may not be the best to use if you have tens of thousands. From the initial model we can set the variable importance ranking and delete the least important variable. Keep in mind that all of our data mining engines rank the variables in order of their importance for the role that they play in the generation of that model's predictions. That list of variables is ranked with scores of importance, applied to each variable and at the bottom of that list we will discover one or possibly several that are the lowest ranked. This rank will include variables that were never used at all in the model and therefore have a zero importance. What we do with this list is we use it in order to remove the variables that are at the bottom that appear to have zero or close to zero value. Having removed those variables we then start with a new and reduced list of predictors and we develop a new and possibly slightly simpler model. This new model may only have one variable less than its predecessor. We build that model and we get a new variable importance ranking. Again we can set the variable importance list and remove the least important predictor from this new variable importance list. We then repeat this process as many times as we like. There is nothing wrong with starting with 300 predictors and peeling off one at a time until nothing is left. The only thing that this will cost you, is the time it will take to run such a backwards step wise elimination of variables, and perhaps you might have to set up one of these things and run it over night.
Building Our First Model
Ok let's get started then with the Salford Predictive Modeler, which is something you'll be able to access from the Salford website. So let's get started here by going to the open a file dialogue. This is the dialogue that we are always going to start with in order to begin our analysis. Here I am selecting a particular file called MJ2000. It happens to be a SAS format data file but it could have been one of about 50 other file formats including ASCII for example, or excel, or any of the other statistical analyses package formats. So having opened this data set, the thing that we will go directly to is modeling and in order to get started the first thing we have to do, the only thing we have to do, is indicate the dependent variable that is going to be the subject of this analyses.
I'm not saying very much about the nature of this data set right now except to tell you that it is a two class outcome. There were goods and there were bads. The goods were recorded as 1s and the bads were recorded as 2s. We have an addition, 25 predictors that are available to us and we're not going to say anything in detail about those predictors either. We're focusing here on just our methodologies. By default, all of these predictors are in the model but let's go ahead and make that explicit over here and the only other thing we are going to do to get started is when I go to the testing tab of the model setup dialogue I want to select a trained test methodology, with 50% of the data being held back for testing. I'm using the CART data mining engine, there are a bunch of others that I could have used instead but let's stick with CART for right now and let's hit the start button.
What Does Our Model Tell Us?
Ok, so our analysis is complete and what do we see here? We see that on the test set, which was a genuine set of data, which was not used to build the model, which is half the data, approximately. We have an ROC of .79 if you're not familiar with that, we'll look at some other performance measures, but that performance would be considered excellent for this type of problem. If we look at the summary reports, what do we see here? If we go to the prediction success or the classification accuracy, we go to the test data and what we can see here is that on the class 1s, which are the goods, on the test data we are 75% correct and on the class 2, which are the bads, we're 77% correct so the overall correct percent is a little over 76%. Not so bad, as we said before. When we want to start the shaving process if we were doing this manually, we would look at the variable importance ranking focus on the least important variables and remove one or more of them. Let's see if there are any with 0 importance. There was one variable with 0 importance and that will come off automatically in this process. But we don't want to do this manually, so let's go back to the modeling dialogue and let's go to the battery tab and this is where we are going to set up our modeling process.
Okay so we can see here in the battery window that we have a large variety of battery types and eventually we are going to work on all of these from videos. Today let's just select shaving, add that to the batteries that we are going to run and lets set up some of the controls that we have available to us. The first item is how many variables to shave in each one of the cycles, so if we are going to follow the Musical Chairs Paradigm then we are going to lose one variable in every cycle. This option here, shave from where while in this case we are going to be removing the least important variable when we remove we can also be removing the most important variable, but that's another topic for another video session and how many steps shall we take, well over here I want to set this to 20, we can set it as high as 25 if we wanted to shave off all the variables. Well let's just hit the start button and see what happens.
Building Multiple Models
The first thing to point out is that we are going to have to build 20 different models in order to do these 20 steps of back shaving. If you had therefore 300 variables, you will have to allow enough time to build all of those models and how long the process will take will depend on the number of records you have in your data set, the number of variables that you want to process through, and the speed of the processor you are working with, in this particular example I happened to know this will execute relatively quickly, so I haven't bothered to pause the recording and come back when it's done because we're already done.
Ok, so once we complete a battery, we always get a battery summary that looks like the one we have over here. What it shows us by default is some kind of performance on test data, and this graph over here shows the performance of all the models that we developed and we are showing the relative error on the y–axis here, so The higher the curve location, the worst the performance of the model. On the x–axis, what we are listing in order are the variables as they were removed during backward shaving process and if you remember when we developed one model manually, this variable here G–E–N was in fact the least important variable. So this is exactly what we should expect, it would be the first one that would be shaved off. The initial model that was built removes nothing. So it is actually described as x_none. Leave the model alone, then we get x_gen, remove the gen variable, once a variable is gone, it is gone forever next we remove this variable e and so forth.
So what is interesting about this particular curve? Well, what's remarkable about the curve is that as I start removing variables, initially it cost us nothing the model performance doesn't get any worst at all at least within the accuracy we can see from this graph, then it appears to get very slightly worst and then there are a number of steps in which the general trend is that the model is improving, with a little bump over here. So until we get to the removal of this variable CR we are in a downward trend in terms of predictive error and therefore an improving trend in terms of the performance of the model. The pattern that we see here, that is that a model actually gets better as we remove some of the available predictors and predictors that were actually used, is a pattern that is well–known to be relevant especially to the decision trees. If we can somehow get rid of the variables that really don't add much to the performance of the model, then we are very likely to get an overall better model. This is a topic that we will discuss more in another video.
Ok, so now that we observe the downward sloping trend, what happens after that? Well what we see is that the error curve starts to rise and not just a little bit but it increases quite a bit at a rapid pace, what does that tell us? It tells us that we have shaved at this point we are shaving too far that is we removed the variables that we could live without but we are also removing the variables that are important and critical to model performance, so we don't want to go that far.
If we click on the show minimum error button over here we get tree number 19 out of the models that we have built, the 19th is the one that shows the best performance and it has six predictors in it. Let's go ahead and double click on that line and see what happens. The model, that corresponds to that particular shaved set of predictors comes up and we can compare it directly with the initial model which is the one we started with. So let's position these on the screen so they are easier to see. Side–by–side I guess or sort of side by side and let's see what comes out of them.
How Has Battery SHAVE Improved Our Model?
This was the first one we ran. This measure of performance ROC on test data .7903 is a little bit worst than the model we get after the shaving process. It has an ROC of approximately .81. So that is fairly a big difference in terms of this type of model and so it's worth paying attention to. But what's more important than anything else is that we are getting equal or better performance on test data with many fewer predictors, only six instead of 25. So there are only six variables that are necessary in order to get us this level of performance. Also interesting is the fact that the optimal CART tree in the shaved version has 18 terminal nodes whereas where we started in which we had too many variables we have 36 terminal nodes. Again, this is not a surprise if we have variables that are not contributing very much in retrospect, from the perspective of the whole tree which happen to get involved in the analysis along the way, then the tree actually has to do more work because some of the splits were not as particularly good as they could have been if we worked with a streamlined set of predictors. So the new model will be more streamlined and also more accurate that is benefiting for all possible dimensions.