An Introduction To Cross-Validation
Learn how to prepare for and utilized cross-validation to test the accuracy of your results
An Introduction To Cross-Validation
Welcome to another in the series of Salford Systems online training videos, this is Dan Steinberg. Today we're going to be talking about cross validation, and a little different than some of our other videos. We will have less in the way of live use of the software, as we're trying to explain some concepts here. Cross validation can be a very technical topic, we are going to treat it in this first session as a non&nash;technical topic, and try to give you the principal ideas. You should master at least as much as what we're trying to convey here, and it is optional if you want to go further. Okay so let's get started. Cross validation is a built–in automatic method of self testing a model for reliability. It gives us an honest assessment of the performance characteristics of a model. It's trying to tell us will the model perform as expected on previously unseen that is new data, which is what models are usually designed to make forecasts for. Cross validation is available for all principles Salford data mining engines, and this all started with the first of our data mining engines namely CART. The CART monograph, which was published in 1984, was decisive in introducing cross validation into data mining, and it has spread throughout the current practice and it's now a dominant methodology. Many important details relevant to decision trees and sequences of models were developed in that original monograph for the first time. That monograph is a landmark for many reasons, and one of the reasons is its treatment of cross validation. Although it's a difficult book, we do recommend that our readers and our users get a copy, and refer to it from time to time to see if they can gather some useful insights from the book.
What is cross-validation?
Cross validation is a testing method, not a model development method. But why go through the special trouble to construct a sophisticated testing method when we can just hold back some test data. Well when working with plentiful data it makes perfect sense to reserve a good portion for testing. We work with quite a few large clients, both the software providers and as consultants. And it is not unusual to hear of people building credit risk models with 150,000 or more training records, and having 100,000 or more records for testing. Similarly in the direct marketing world there are people working with 300,000 and more records for training, and using 50,000 and more records for testing. In those circumstances if one of your models performs as well or better than expected on a test sample of that size, most people will feel pretty confident in the results. However, not all analytical projects have access to large volumes of data. So the principal reason, but not the only reason for cross validation is data scarcity. When relevant data is scare we face a data allocation dilemma. If we reserve sufficient data conduct a reliable test, we find ourselves lacking training data. If we insist on having enough training data to build a good model we will have little or nothing left for testing. In this little diagram below you can see there's an orange and blue part, and you can imagine that we have a little slider and we can make the division into train and test however we like, this is on a percentage basis. A common division of data is 80% train and 20% test. So suppose he had a really small data set, with say 300 data records in total. If you use the 80/20 split you'd have 240 to train and 60 to test. So we have a tough decision here, which is how much to allocate test and I'm showing that slider being sent to different positions at different locations here. You can see that in the top panel we are favoring the testing part of our work, but we're making it very difficult for us to learn a good model in the first place, because we've reserved so little data for training. At the bottom we've done just the opposite, we've made sure that we have adequate data for training, but we've left so little for test. It'll be hard for us to put much faith in the results of the test.
Now how do we measure the size of a data set? Well in most of our real–world applications we see unbalanced target data. And most classification studies that target a dependent variable is distributed in a very unbalanced fashion. We usually have one large data segment, the non–event, which is actually the event that we're not terribly interested in. And a smaller data segment, which is the event, which is the subject of the analysis. For example, who purchases on an e–commerce website? Who clicks on a banner ad? Who benefits from a given medical treatment? What conditions lead to a manufacturing flaw? When the data is substantially unbalanced the sample size problem is magnified dramatically. So you'd have to think of your sample size as being equal to the size of the smaller class. If you only have 100 clicks, it's useful to think of your problem as being limited to a data set of that size. It does not matter much that you also have 1 million non–clicks. The strategy of cross validation is sample reuse; anyone trained test partition of the data that leaves enough data for training will yield weak test results. Because it is based on just a fragment of the available data, but what if we were to repeat this process many times using different test partitions. Imagine the following, we divide the data into many 90/10 trained test partitions, and repeat the modeling and testing. Suppose that in every trial we get at least 75% of the test data events classified correctly. This would increase our confidence dramatically in the reliability of the model performance, because we have multiple tests, which are at least slightly different.
Preparing the data for cross–validation
So some technical details, cross validation requires a specialized preparation of the data somewhat different than our example of repeated random trained test partitioning. We start by dividing the data into K partitions. In the original CART monograph, Breiman, Friedman, Oshen and Stone set K equal to 10. K equal to 10 has become an industry standard both due to Breiman, and also other studies that followed. Then we mentioned an important one at the end of this set of slides. These can't be simple random divisions of the data into 10 parts; the K partitions should all have the same distribution of the target variable. So in this case that means the same fraction of events, and if possible we want the sizes of the partitions to be as equal as possible as well. And we have a little bit more to say about that later in case you were interested enough to try to create your own cross validation bins or partitions. But it does take care to get it right. The important point for the typical user is that all of this is done automatically for you in SPM software, the Salford systems tools will generate the proper cross validation partitions for you, and these slides here, this video is intended just to give you an idea of what's going on behind the scenes. But the more ambitious students can of course go on to try to do some of this in a manual way to get a better understanding of what this is really all about.
So we have over here a display, which shows the data having been divided into these 10 parts in the right way. And what we see is that we have highlighted different numbers depending on the row that we're looking at here. So now I will explain what that is all about. Once we've divided the data into our 10 parts we're now going to build an equal number of models. So if the data has been partitioned into 10 parts we're going to build 10 models. Each model is constructed by reserving one part for test and the remaining parts for training. In this case there will be 9 parts for training, 1 part for test. If K equals 5, then each model will be based on an 80/20 split of the data. If K equals 10, then each model will be built based on approximately a 90/10. Don't forget that when you do cross validation these numbers don't turn out to be exactly equal and exactly as we're talking about it here, because in order to get the target variable distribution right you might have to make the bins a little bit uneven. There is nothing wrong with considering K equals 15 or K equals 20, but it can be a mistake to use too many. In this strategy, it is important to observe that each of the K blocks of data is used as a test sample exactly once.
Let's look back one slide, and now recall what it is that we've just been talking about. In the first row block number 1 has now been highlighted, meaning it's been singled out for test. So the training or the learning is done on the remaining parts. We go through the whole process of building and testing a model even though the testing is not going to be particularly persuasive, because it's based on a small fraction of the data. We set all that data all those results aside for later review. We now go onto the second row; in the second row we set a different portion of the data for testing. Notice that there is no overlap between block 1 and block 2. There is no overlap between the test portions; however there is considerable overlap of the training portions. In the first row we have partitions 2,3,4 etc. all the way to 10 in the training partition, in the second row we have partitions 3,4,5 etc. through 10 plus 1. In the training partition and only part partition 2 is in the test part, and we march through each of these blocks of data until we finally get to the last row, in which block 10 has been set aside as the test part and partitions 1 through 9 have been used for training. So what we have now? We now have 10 models, in addition we're going to build an 11th model. And the 11th model doesn't have any test data at all, it builds a model using the training data alone, and because we have no test data we have no way of immediately evaluating that main model.
Running a little example now, using the GB2000_CVBIN.xls-Zipped data, which is data that is made available to you on our website.
And running our SPM version 6.8 we're going to use this little risk data set, 2000 records. We're just going to set this up in the way we always have. Going to allow for all the variables as predictors, except this data set is a little bit different than the others. And there's 3 variables, there is a row ID, there is a uniform random number, and there is something called bin we are going to take those out. We're not going to use them in this particular portion of the video. Now let's go to testing, notice that the default is V–fold cross validation, which is 10. You can change that to any number you'd like.So let's go ahead and run this, and the testing has been done automatically because the testing has been done via cross validation we don't actually see any test results on the main model display. But we do see test results in reports like the prediction success table, and in the gains and ROC chart, where we can get the test results for the model that we have chosen over here. There is a different set of test results for every size tree, so if we go to this size tree for example, and go to summary reports, we will get a different set of test results for the confusion matrix. And again we will see a different report for the ROC curve and for the test ROC results.
So how do we get those results? Well here I'm showing you some of the breakdown on actually a different data set, but the concepts are exactly the same.
We have a table, which you will not get reported by the software, we use a special technique to generate these results. But what we have here are 2 classes 0 and 1, and in every cycle of the cross validation you can see numbers 1,2,3 etc. all the way down to 10. You can see that there are some class zeroes in the training data, and there are some class ones in the training data, and there will be class zeroes in the test data and class ones in the test data for every one of these CV cycles. If you add up these two numbers 634 and 113 you get 747, and you will see that when you go to the next row 633 and 114, 634 and 113 were always ending up with the exact same number of records in the training data and were also ending up with the exact same number of records in the test data. It's always going to be either 70 and 13 or 71 and 12, it doesn't have to be that way, but we're trying to work out that way.
So this is an example of how the data might be divided up into the 10 folds. So what do we do with these 10 different models when we try to put all of the results together in order to try to get our final results? So you have to think about things this way, for a specific size tree and let's just think first about the two node tree.
We can generate a confusion matrix for each of the 10 cross validation models each of those confusion matrix will be based on approximately 1/10 of data, the tenth that has been held back for testing. What we're going to do is we're going to cumulate, that is we're going to add up confusion matrices until we end up with one aggregate total confusion matrix. That is the confusion matrix that we are going to assign to, and attribute to the main model, that is the model that was based on all the data. Then we're going to do it again for different size trees, and that's how we're going to end up constructing the curve that we see in the SPM software. So this curve over here, which is displaying a normalized version of the error rate and therefore the misclassifications, this curve is derived from the summation of those confusion matrices from the 10 models that were not actually showing here they've been run behind the scenes. So interestingly enough this curve here is indirectly derived and we are assigning it to the tree that has been built on all the data. If you ask whether that could be a legitimate methodology the answer is yes, if you ask whether it yields reliable results the answer is yes. It may seem a little bit surprising because the methodologies indirect, but it is spectacularly reliable and successful.
I have a slide here just for the advanced users, this slide over here explains how we align the results from the main tree with each of the CD trees. And this alignment is done via something called cost complexity pruning, and this will be discussed in a later more technical video. So we have already talked about the summing of the confusion matrices, and I just want to point out again that because we're summing correct and incorrect counts and we're summing across all 10 test partitions and the 10 test partitions together are equal in size to the original training data set we end up with a hypothetical test that that is actually equal to the size of the original training. So here are some important points, cross validation is not a method for building a model, it is a method for indirectly testing a model. That on its own has no test performance results. In classic cross validation we throw away decay models built on the parts of the data, we keep only the test results. Modern options for using these K different models exist and you can save the K models in SPM software if you want. What can you do with these different models? Well you could use them in a committee or an ensemble of models, modern methodology in which averaging the predictions of different models is often better than using any one model. It might also turn out that one of the CV models might be more interesting than the main model, but this is a rather advanced topic and is not commonly used, probably be used only in a situation where you have the time and you are willing to examine in very, very close detail, every little detail of the tree that you're finally going to publish and use.
Reviewing again the topic of this cross validation really work? We have done our own testing of cross validation by starting with huge data sets and then extracting the small training data portion and pretending that was all the data that we had. Within used to cross validation methodology to obtain a simulated test performance, but then we ran what would be a conventional test making use of some of that large volume of data that we had available. What we found was that the error rate that was being predicted by cross validation was usually very close to the air rate that we got when we actually had some real test data to run a conventional test set test on that same model. These tests convinced us that cross validation is reliable. The CART monograph also discusses similar experiments conducted by Breiman, Friedman, Olshen and Stone, they come to the same conclusion while observing the 5 fold cross validation tends to understate model performance, and that 20fold may be slightly more accurate than 10fold. So how many folds should you actually use? We already said that the industry standard is 10, but not many people understand why, and not many people understand that you actually have some choices. Well think about 2fold cross validation first, you divide the data into two parts, you first train on part one and you test on part two, then you reverse the roles and then you finally assemble results. The problem with 2fold cross validation is that when we train for the purpose of testing, we're only using half the available data. So the models that actually get tested, the cross validation models are nowhere near as good as models that would be built on much larger fractions of the data. So this is a severe disadvantage to the learning process and what it usually means 2–fold cross validation and similar smaller numbers 3,4,5 they tend to be pessimistic. That is they tend to suggest that your model is not as good as it really is, so the spirit of cross validation is actually to use as much training data as possible. So Brieman, Friedman, Oshen and Stone actually find that 20fold is better, meaning gives you more precise estimates of the actual performance of your model than 10fold. But only a little bit more, and at that point in time which is writing in the early 1980s they felt that 10fold was good enough given the speed of available computers. Today if you're working with similar sized data sets, we are working with computers that could be as much as 1000 times faster than the computers of that day perhaps more. And so the idea of a painful wait for the small data sets just doesn't exist anymore. In other words, we don't see any reason why you might not lean to going to numbers that are a little bit higher than 10. If you go to high for example, if you try the so–called leave one out methodology, which is setting the number of folds equal to the number of training records. This is an extremely bad idea, does not work well for trees and we don't have time to explain why that's so in this video. We advise you not to use those very large numbers when you're talking about classification trees. The results that we reported from Breiman at all were essentially reconfirmed with the exact same conclusions in a paper written by Ronny Kohavi in 1995, it's called a study of cross validation and bootstrap for accuracy estimation and model selection. He also comes down on the side of 10–fold, and also observes that results could be a little bit better for the 20–fold.
So suppose you want to create your own folds, we want to remind you again that this needs to be done with great care, with smaller samples. Take an example, suppose you have 100 records divided as 92 records are zeros and 8 records are one. The first thing to observe is that you cannot have 10–fold cross validation here, because every fold must have at least one example of an event and you've only got 8 of them. So the maximum that you can push cross validation to is 8fold, so let's try for 8folds then. If we try to make 8–folds, then we're not going to be able to have every fold, have exactly the same number of zeros and ones. So our recommendation here would be you would have four parts with 11 records of y=0 and of course you will have one of the y=1's. And then you're going to have four parts with 12 records, and the remaining record in each one of those parts is going to have a single y=1. What happens there is that you get the response rate, which is the same as the distribution of the Y variable as similar as possible. And this particular division is much better than the one above where we have seven parts with 11 zeros in each part, and one part with 15 records because you're going to have a situation where you have quite a big difference in the distribution. My slide over here says y=1 and it should really say y=0.
Okay so points to remember, the main model in CV is always built on all the training, nothing is held back for testing. And that means that when we have built that model alone with no supporting other models we have no idea what the test performance of that model is going to be. Let's look at a dataset that has 100 records divided as 92 zeros and 8 one's. The first thing to observe is that it is impossible to do 10fold cross validation here, why? Because each fold must have at least one example of the target equal to one. We've only got 8 of them, so the most that we can do is 8–fold cross validation. So there are a few different ways in which we could construct that. One of the ways in which we could do things is we could have 8–folds, seven of them would have 11 of the zeros, and each of those seven would also have one of the ones. And then the last part would have the remainder, which would be 15 zeros, and the final one. But that's not such a good idea, because the response rates in the different bins are all the same for the seven parts. But they're quite a bit different for that last eighth part relatively speaking. As one in 15, one to 15 verses 1 to 11, it is better to divide the data in order to get each of the bins as similar to all the others in the response rate. If we divide with four of the partitions with 11 zeros and four of the partitions with 12 zeros things are going to be more similar to each other across the bins, which are going to give us less variation due to the construction of the bins. This is a topic which of course we can't go into further detail here, we just need to point out that this is something that you will have to pay attention to when things don't naturally divide perfectly, according to the number of folds that you want to run.
So points to remember, the main model in CV is always built on all the training data, there is nothing held back for testing. When we build that model, which is before we build the CV models we have no idea what the test performance of that main tree or that main model is going to be. Now if you were to run CV in several different ways, for example varying the number of folds or varying the construction of CV folds by varying the random number seeds you would always get the exact same main model because the main model is always built on all the data and it's always built before we do any of this partitioning of the data into parts. But what will differ from one variation of cross validation to another is the estimate of the test performance, so it's useful to know if the results are actually going to be sensitive to these parameters. If you run battery CV you rerun the analysis with different numbers of faults. So as the number of folds gets larger things should look stable. The battery CVR uses the same number of folds in each replication, but creates the K partitions based on different random number seeds. This is expected to yield reasonably stable results, but you never know. Unstable results suggest considerable uncertainty regarding your model, and if the results are sufficiently unstable you might consider trying to find ways to revise and refine your model to see if you can get those results more stable. If you're going to try the batteries let me just point out where you can find them, they're located on the model setup dialog and there is a battery tab for most modeling methods. And over here you can scroll down, here is CV, if we add this let's have a look and see. It's going to automatically build 5,10, 20 and 50 fold cross validation results.
If you try CVR instead, then you're going to get a replication of the whole process of dividing the data into your partitions, and then running the models. But each time you do the preparation of the data you're going to start with a different random number seed, which means that the records that end up in each of the folds could be slightly different, and this won't make a difference. In terms of CVR repeat I don't recommend going as high as 200 normally, a number like 30 would be a perfectly reasonable number to get an idea of how stable your cross validation results really are. Okay that brings us to the conclusion of this particular video, we hope that you have found it useful and we look forward to meeting with you again on another topic.