Well, welcome everyone and good morning. My name is Mikhail. I'll be talking about data mining and sports analytics for the next 50 or 60 minutes or so. I'm very thankful that you found your time to come to my session. I understand the competition is fierce out there, but hopefully all of this stuff will be recorded and you can review it later. But I'm still very thankful because I like it when I have people to talk to, sometimes I do training sessions over the Internet and that's the worst kind of activity that you can find- talking to yourself in front of the computer, not even knowing if anyone's listening. As you may have guessed, I'm originally from the former Soviet Union space; I was raised and born in Ukraine. I was initially trained as, believe it or not, a rocket scientist, but when the whole thing collapsed, I kind of had to find a different area of occupation. Eventually landed in data mining, second degree- this time in America, and working in the field at this point, 12 years or so. I do a lot of data mining and we are located in San Diego. We have lots of clients all over the world, and a lot of over the United States. I travel to a lot of different places, different parts of the country, different parts of the world, and I talk about data mining and how it can be done. We also do a lot of consulting, we've seen all sorts of data, and we've seen many different situations, so we have a lot of expertise in how to work with the data- not necessarily knowing exactly what the data represents. So, and by the way, I can promise you it looks like there will be 45 to 50 minutes of solo play right here, and there will be no commercial breaks. All right, let's talk about data mining. The general structure is, there is a lot of confusion going on in data mining out there, so I wanted to clarify a few things to set the stage, so that we know what it is we are talking about. Then we'll take, just as an example, some of the baseball data sets available out there and I'll quickly show the kind of things that the data mining can do for you. Like the real data mining, not just descriptive statistics that a lot of people kind of think about.
Let's start with the definition of data mining first. What is it? It's the search for patterns in data using modern, highly automated, computer intensive methods. End of story. The key is to search and automate it. Now, people can do a lot of search on their own. You can look into data as hard and deep as you want to. But the key is, that the computer is, these guys, were designed specifically to enhance the searching capabilities. And just like computers complement our minds in doing certain tasks, the same thing happens in data mining. There's limits to what our mind can do; there's limits to what a computer can do and popular to contrary belief that computers will someday replace our minds, I believe it's just the opposite. It's a perfect complement, the things that computers can do; we'll never be able to do. We'll never be able to add seven digit numbers in the fraction of a second, never mind. And yet, we can recognize faces in an instant- something that is notoriously difficult for a computer, so that's why data mining is a perfect complement to what we can actually do ourselves. The nature, the real problem that we are focusing here, is finding hidden information in data- that's how data mining was born; that's its ultimate purpose. The uses of data mining, if you read trade journals, all that kind of hype out there, you'll find these fancy terms: predictive analytics, machine learning, pattern recognition, artificial intelligence, business intelligence, etc. etc. etc. Those are fancy marketing terms, all grouped under, the kind of banner, of data mining. They won't help you understand what's going on, that's why I'm here just throwing in a few things, so that you get the general picture. Where data mining is used, well, the real data mining is deeply rooted in the field of statistics, and statistics itself is rooted in science, and we have a very fancy name for this session out there: the evolution of sport. I really like it, because data mining is a kind of evolution of science. First, you had the general big science, when the laws that are discovered and people talk about laws as simple kind of generic equations that hold at all times. Then they invented statistics, as a start, looking at data, gathering data, and realized that sometimes you can actually find laws by looking into the data itself. And then as the databases became larger, and deeper, and more complex, people started sensing the limits of conventional statistics that traditionally requires an analyst to impose a certain picture on top of the data and sometimes it holds, sometimes it doesn't. So eventually the evolution progressed in such a way, that there was a complete detachment from what analysts thinks, and the focus on the data itself. So the data mining eventually found its way as a merge between statistics and computer science, and then it propagated into all other areas of life: insurance, finance, marketing, robotics, biotech, and now sports analytics. I welcome this trend, in the end, the data mining- the real data mining, can do enormous things for you and as analysts and also as consumers, users of data mining, and I could even stretch it as far as saying the future does belong to data mining and we will talk about it more and more during this session.
There is another interesting thing. Data mining is being extensively used in finance, marketing, insurance- all those real hard-core fields. Based on what I've seen, I can't really say that data mining has yet been used in sports. For some reason, there's a lot of resource that goes on in sports that focuses on constructing specific statistics, or performance measures, or doing some kind of general descriptive stuff which I will show in a moment, but they don't really still not benefit from the real uses of data mining that can actually show and prove, or disprove, in some ways and terms, on what's the utility of what you are doing with all of these performance measures. So we'll talk about it more. Next slide is interesting, 'Long live the King', and in data mining, that's our data. The reason that I put it here is, because in data mining, it is your data that guides the analysis. It's the alpha and omega of everything you do, as far as when the models are constructed, and then you serve as the final judge in terms of deciding whether what you see in the data really makes some practical implications or it could just be an error in the data that you have. We've seen it many times, when our clients approach us and they say, 'Yeah we've got this great data set, let's build a data mining model on it.' We build the model and say, 'Okay, well here's what the data tells you, but it obviously makes no sense.' And they say, 'Oh yeah, well you know, that variable is not exactly what we told you- it has a different meaning.’ Or it's like, 'Oh, we forgot that our database guy screwed up and merged two databases, or there was a misalignment in the variable names. Now, believe it or not, things like that really do happen and data mining quickly pinpoints those, so the data is your King. What's your role? Your role is to ask the right questions. And that's probably the biggest amount of work, believe it or not, in any data mining implication. You have the data set, but you may not have the right questions. There are some questions that can be easily answered if you're looking through the data set; there are other questions that just sometimes impossible to be answered, given the data that you have.
So that's where most of your time's going to be spent, looking into the data set and figuring out how to ask the right question and what can be solved by having access to the data that you have. The conclusion here, the success of data mining solely depends on the quality of available data. In data mining, and in statistics in general, we have this famous principal: garbage in-garbage out. So if you have a data set that's garbage, you can study all you want- in the end your models are not going to be useful at all. And that’s a well known principle that I want you to have it in your head from the very beginning. Now, why am I talking about the importance of your data? The interesting phenomenon that happens is that whenever we think about a problem of some sort, and especially when you become an expert in a field of some sort, at some point your mind starts pulling tricks on you. And your intuition might get it really wrong, so when you think that you know what you're dealing with, you may actually think that this is how things should be, and you draw all sorts of conclusions, but when you look at the data- all of a sudden you realize that the data points into the opposite direction. My boss and our company's president, Dan Steinberg, used to tell me a wonderful story about his teacher at Harvard, in econometrics, and the story goes like that. There is a well-known professor of econometrics, standing in front of the students, with a transparency that shows a graph of some sort. Fifteen minutes he spends, he looks at the graph; there is a kind of rising curve on the graph. And he talks about, okay, why the supply goes up and because that ratio goes down, he spends all of the time showing the intuition, why we have a rising trend and then a student raises a hand in the audience and says, "I'm sorry Professor, but looks like you put the slide upside down." But guess what, he turns the slide upside down, the graph goes down, he looks at it, 15-20 seconds, and then says, "Ah, that also makes a lot of sense." Then he spends another fifteen minutes explaining why the curve does down. Okay, never forget that your minds and intuition can sometimes play very funny tricks on you. Sometimes it works for you and sometimes it doesn’t. What is the essence of data mining? Sometimes it's known and called as machine learning and all these other terms. We already know the importance of the data; I kind of emphasized it in the previous slide. But ultimately, we really want to study is some kind of phenomenon, some kind of, what we call in statistical terms, population. There is an analyst there and a population here. There's a big gap between the two of you and it's kind of funny, in sports, I can tell you, you could probably run a simple predictive model to attempt to predict who here in the audience is an athlete and who is an analyst. Just look at the weight and things like that. So there is a gap that is bridged by what we call historical data and if you don't have data, there is no data mining. You can do all sorts of science, all sorts of intuition, but we have to have data in order to study the population in data mining. You feed your data into a data mining engine, you will see examples of such engines soon. The engine produces a model, the model could be as simple as saying Y equals X, or as complex as saying here is a black box- you feed all the inputs in, you get the predicted response out. So there are many different varieties out there. Now, if the model is, allows some kind of decomposition, than you can get what we call the insights. So you can study the past, so you can study, okay here's what happened, here is the model that attempted to explain it, so you know the degree by which you'd explain it. And then you look at the model structure, you see the insights. On the other hand, at some point you can get new data, and through the process that we call scoring, or applying a model to new data, you can get the predictions. We have some clients who utilize this model in horse racing. And some of them, they actually say, 'Screw the insights- I'm not interested about that. I'm really interested in predictions' but on the other hand, sometimes you will find a lot of added knowledge if you focus on the insights. And I will show, in a moment, a few runs and my primary goal here is to deliver a message of data mining in terms of insights. But at the background you should always realize that there is this predictions part, that's relatively straightforward. Once we have a model, we can always run your prediction of course, in any prediction game, just like elsewhere- you're trying to predict future based on what you learned in the past. Sometimes it works, sometimes it doesn't; there's nothing you can do about it. So when someone shows up and gives you a sales pitch, 'Oh, you can do all of that and you'll get great models, you go ahead and win everything.' That's just not true. Things change, people change, everything evolves, and sometimes pattern changes. It can be quite heartbreaking, but that's the general model that we are trying to work with in data mining. So in a nutshell, use historical data to essentially gain insights and/or make predictions on the data. That's the essence of data mining.
And now let's look at the possible implications of this in sports analytics. You know that to run data mining, you have to have data. In sports, data is already available either implicitly or explicitly. Why? Because any game is a data grove. Furthermore, it's an unambiguous data grove. It's all recorded, trace back, and if there is some kind of question, you can always review it- you can always come up with essentially, a very clean record of what has actually happened. Now I know some of you might think otherwise and think, oh, it's difficult to get data or it's incomplete or etc. etc. I can tell you that if you think you have bad data in sports, go ahead and do a quick career change into marketing, finance, or retail. That will quickly cure your misconceptions of what bad data really is. Like in retail, we have to deal with data sets that are so bad, that sometimes you can't just force store managers or warehouse managers out there to keep a complete record of everything. Sometimes they'll just run it from the back of their heads. In marketing, you have all these surveys that are distributed, sometimes in the worst possible scenario, which is a web based surveys and you have people filling in answers at random- just to get a five or 10 buck rebate or something. There's a whole people specializing in that kind of activity. They make their living out of it. So talking about bad data, well, there is a bad data out there- but not in sports. Now, what you do have in sports that is a problem is how to describe the data record. That's where it appears to me a lot of effort goes into. Say, 'Okay, here's a baseball game'. We know what happened. We know, like, all these different core descriptors and measures of the game itself. And now, you want to produce more variables and more fields to describe what really happened. Well, that's a legitimate question, that's a legitimate activity but trust me it's not related to the quality of data itself. It's more related to what we call data prep or the data preparation. Once you get the core information in place, then you start working with possible ways to enhance it and expand it. And once it's advanced, then you kind of helping the data mining engine later on to do its job. Again, this is something that you'll see. But that's essentially what we have here, and that's what I wanted to point out on this specific slide in terms of availability of data in sports analytics, and how it works, and what you have.
Talking about data. Here is an extract from the Internet. You just go on the web anywhere; you can pull it off right away, within a fraction of a second. So as I said, there is a good quality data available. There's lots of different measures you can observe and there's lots of sources where you can get it. Whenever you address a data mining problem, or the modeling problem in general, there are two key questions that you need to be absolutely clear on. The number one is, what constitutes a data record? Like here, the data record represents players’ performance for the season of 2010, or number of seasons, as matter of fact. But the key is record player and the record summarizes overall performance, or they say up to 160 games in baseball, in this case. You could go, alternatively, try to model things at the team level, or you could model things at the game level also, instead of having one record per season, you could describe the game itself- what happened within the game and now you'll have a lot of records available for analysis. But before you start doing anything, that's the important thing. You have to be clear, what is it that you have at the data level of your records? The question number two was, will be, at least in the type of data mining that we are involved with, which is what we call predictive modeling. The second important item is to know, what is it you are trying to predict? And the outcome needs to be available in a historical records. And again, in sports, it's very straightforward, because you can always talk about, say, win versus lost. That's a perfect target, we call that a binary target, yes/no type of event. It's clear, unambiguous, and available in our historic records. But you can go to other levels you can say okay, let's try to predict team score, or rank, or whatever, you can be as fancy as you'd like but this win/loss is a very good starting point, and that's what I will focus on now and the remainder of the presentation. Now, the data set is available like that and there were, I mean, you could you could try to extract these records on your own or you can buy it from some other authorities that specialize in that but the data is out there, it's available. Now what I will do here, I will work, just for the sake of illustration, because after all, we're just trying to show the potential of data mining. So I'll get some information from this baseball database that's publicly available. It's for research use only, non-commercial, so anyone is welcome to download or contribute to the guys that are behind that, but it has a wealth of information. Essentially, it takes these tables here and summarizes them into a player level database, so that you can extract any season you'd like, any time span you'd like, it has information on batting, pitching, fielding, post-seasonal data, all sorts of data, all in one, nice, ten megabyte file, once you download it. Now in our field, of real, kind of hard core data mining, sometimes we deal with databases that are gigabytes or terabytes in size. So, that database is complex, but it's nothing unheard of. It's actually a joy to work with- and it has clean records, there's no missing data, no imprecise data, and so. So I'll work with this and let's define a data mining problem that we will try to address or run using data mining approaches.
I want to be modern, so we'll focus on 2010 regular season performance in both the American and National League. I guess, as most of you know, there are two leagues in American baseball: American and National. One has 14 teams, the other 16 teams. They all divide each league into divisions. There are three divisions: East, West, and Central. And at the end of the season, each team plays about 160 games or so. At the end of the season there are six winners, so each division gets the winning team. So I can get all of the information from baseball database, and again, it's just one simple example. Say you're saying, okay, here's a record of players performance in the season of 2010 and I also know which teams of the set of thirty teams actually won the division. So I'll get six winners and 24 losers at the team level and each player's assigned to the corresponding team so I can easily propagate winning and losing information into the players’ space. So, and again, the reality's a bit more complex, we also have the wildcard business and so on, so that you can actually have four teams from each league entering into the Division series, and Championship series, and World Series, I mean you can easily propagate that information into this example. So, just consider this one possibility, and it's very easy to pose alternate targets in this case. So what we're trying to do here, we want to see which of the players’ stats are associated with the team winning the division. It's as simple as that, and very straightforward question. Like, we know that each player performs one way or another. We have, like, the batting average, we have number of home runs, right? We have all these other things that each player is described and we always have some kind of conception in our head, okay these are the things that we are looking for, this is a great player, this is a mediocre player, and so on. So let's get all of those characteristics and the mine the results of 2010 season to actually see what the data mining is going to tell us.
I focus on batting part of the game, first. Now when I looked at this database, ideally, of course, you can merge a lot of pieces together into a fancy records to see how all the different parts of the game jointly influence each other. But, I simply ran out of time, and I didn't want to be very fancy-at some point I decided to keep it simple. Let's focus on one type of activity, in turn, so we'll pick batting first and see how it all works in terms of having a team win the division, or lose the division. there's a lot of different performance measures that are known and introduced: number of at-bats, number of runs, number of hits, doubles, triples, home runs, runs batted in, stolen bases, caught stealing, base on balls, strikeouts, all of these things and more. Now I call them core stats. Why? Because they're kind of, independent entities, they're all, you're kind of looking at the game, you're observing what happened, and the baseball is kind of a sophisticated game, in terms of number of things that can happen, and each event is measured by the corresponding stat. Now we're talking about the season overall, so it's not individual game level, so all of these are essentially, accumulated sums up everything that happened for the entire span of up to 160 games, for some players. On top of these, people invest a lot of time and effort into constructing so-called derived stats. Now, derived stats are represented here. Some of them are simple averages: batting average; number of hits divided by the number of times at bat. Like, in baseball, each inning is essentially a battle between two teams, but it's more like a battle between David and Goliath. We have a pitcher and a batter, and something happens, so there is the outcome of that. So each time a batter faces the pitcher, and we eliminate all sorts of errors and other things, and as it gets really fancy there, but ultimately, your batting average is how many times you actually hit the ball out of the total times you faced the situation in that battle. Then you have other things like people calculate total bases, because when you hit, sometimes you run to the first, or to the second, or to the third, or home-home run. And you can easily calculate total number by that. There is also what they call slugging ratio, that's total bases divided by number of times at bat. There's on-base percentage, there's also all these kinds of other fancier things that were introduced, believe it or not, there's a whole field...
That's what we always do. When you go to financial company, for instance, they give you the data set, are you going to have hundreds or thousands of fields? You look at it and say, okay, now let's brainstorm and see what other derived things we can come up with. But it's just the beginning of the story and you should not confuse it with the end of the story, because ultimately, you look at those derived fields and then you data mine the data set and see whether those derived fields actually work for you, or they're doing the job that you thought they would be doing. So this is basically what we deal here and I'll focus on the core stats first, and that's a good exercise whenever we have a set of core measurements- you want to run data mining off the bat to see what it actually shows to you and there will be some interesting insights, as you'll see in a moment. But first, before I go data mining route, let me quickly show what happens with classical statistics. Now we call this scatter matrix, when each performance measure is plotted against everyone else in the list. And the dots here, the blue ones are losses, and the red ones are wins. If you look at all of these, it's a mess. You can't really draw any conclusions from it. Now, some people look at it for a long time, they jump inside of the data, run flight simulators, fly around those data points. They can go mad, but still can't really see what's going on. That's where data mining comes in play. There are a host of different data mining techniques out there and there's no time for me to discuss them all, or to introduce them all. That alone would've taken hours and hours. What I will introduce instead is the, essentially, a set of data mining techniques generally known as tree-based techniques. And the specifics of the terminology are irrelevant for the purposes of this presentation. They were introduced by four brilliant minds in statistics and machine learning at Stanford and Berkeley schools, back in early 80s. So the Brieman, Friedman, Olshen, Stone were kind of brought together by strike of destiny to work on the same problem. They revolutionized the statistics and machine learning by inventing a very powerful tool, known as classification and regression tree. Each of them worked on different parts. There is a heavy, solid statistics underneath. There's a lot of practical implications and computer implementation side of the story. They constructed this interesting thing called CART in 1984, published the Monograph, and went their separate ways. They remained friends for the rest of all the years, but they never worked together on the same problem again. So we have the kind of, a monument to big four being together and then later they each and everyone introduced alternative technologies like bagging trees, boosting, stochastic gradient boosting, all sorts of linear combinations, and so on. I don't want to bore you down with that, but the interesting part is that in 1984, they introduced a technique that allows to look beyond the conventional scatter plot and the simple faces of the data set- the technique that can actually dig deep inside.
Now, I will show examples of TreeNet analysis, which is ultimately the evolution of tree-based technique from Jerry Friedman's mind. And you can also try the other techniques, as I said, might take a long time to run them all. Well anyways, here's how it works. At this time, I will be switching back and forth between the presentation and some of the software. But have the software running here, and just close all of these windows and hopefully the screen resolution will be enough and if not, I'll switch back to the presentation. You start by opening a data set, like this one- the batting data set. If I open it in Excel, it's a simple set of records, one per player, and performance stats. It's a flat table, there's nothing fancy about it, and you can download it or you can construct it on your own. This one has about 1500 records- that's 2010 batting performance for season, the whole season. So that's what I have as the input, and now I want to run this through the data mining engine and it goes is as simple as saying, 'Okay'. Open my data set and it could be on an Excel spreadsheet, or conventional stats format. Say, open the data set. It has 1,245 records, it has 44 descriptors. I will skip the descriptive stats and go straight into model. And here I say, okay. What I'm really interested in is that the player is on the team that won the division, which is the division win variable. It's a yes/no variable-yes the team won the division, no, it didn't win. Alternatively, I could have picked it won, we have a wild card wins, or league wins, or World Series wins. So you decide your focus on the question and you solve it. And what I've done, prior to this, I ran a lot of different combinations of variables, I ran some variable selection techniques- in the interest of time I'll skip all that and simply focus on some of the core measures here. So suppose you want to run the predictive model based on the number of at bats, number of hits, home runs, then you have runs batted in, the strikeouts. See, I had to learn all of these different terms in here: at bats, home runs, the number of hits, the runs batted in, and strikeouts. Well, and we could stop at that and see, well, I believe I ran more variables though in the other, and a quick peek. There was runs batted in, and home runs, strikeouts, at bat, hits- oh yeah, I forgot about runs, of course. So, let's introduce that variable, and they are all in here, so now we have six. I could go all the way to 15, if I want to, and there's a little bit of trial and error involved. Then what I work with is the TreeNet engine and there are alternative data mining engines there, there is no time to discuss them. Whenever you do data mining you need to be careful, because you can explain a lot of things you see in the data, but you also want to justify what you see based on independent testing.
So I'll pick 20% as my test sample, you'll see what it means. And I go straight into TreeNet, where I say learn with a learned rate of one, construct a three hundred 3 node trees, and that's pretty much it. Now this is the part that we can train you on, you don't have to worry about it, it's relatively straightforward, it's just like setting a simple modeling parameters- how hard and how deep you want to dig your data set. Individual trees are used for that, and that's pretty much where it starts and ends. Then I hit 'Start'. The process goes very quickly, in this case- after all, it's a small database. We get some performance measures, and again, I don't want to bore you down that. If you know what area under ROC curve means, you'll realize that .61 is not a great modeling performance, but it's clearly better than random. Random is .5. Now, you're kind of getting 20% above random. So there is some predictability associated with the six core measures. But in my case, predictability is not entirely what I'm interested in. I actually want to see what's going on and how the model can be explained. When you do that, you go first under 'Summary', to see that all of these measures entered significantly, so they're all kind of explaining what's going on in the target. If I introduce more measures you would have seen some of the measures dropping out, they showed as unimportant. The information is already extracted from these core six. But what's more interesting, now I can look at the model decomposition in terms of plots, and when I click on the plots it shows these curves over here, and I placed those in the presentation. This is an example of how data mining engine sees the data. It has no mind of its own, it's very powerful, and it wants to explain everything. Now, you as an analyst say, okay, I don't really care about these fine variations, all I can do is when I look at this plot as is, I can actually approximate it by some kind of simple curve that goes like that. For mathematically inclined, it all can be explained to extracted at the formula level. But we are looking at this thing conceptually, what does it tell us? It says, okay, you increase the number of runs batted in; you have a direct increase in the odds of your team winning the division. No surprise here, very clear intuition, isn't it? So now, you say, okay, look at the strikeouts and say same story, but this time it's reversed. These are batters, if they have more strikeouts- the win has to lose; I mean, the team tends to lose. The number of at bats, interesting; but the most interesting part, as you may have noticed, is the home runs. And that's the phenomenon I want to talk about a little bit more. At first it occurs here and says, okay, initially you have the intuitive expectation; the more home runs, the better odds of winning. But somewhere, around 20, there is a collapse; and it says if the batter had more than 20 runs in the season, the team associated with that player actually lost the division. This is a typical example of data telling you something that you not necessarily expected. And that's just the first glance that we call one variable at a time.
So let's move on and look at what we call bivariate plots, when we'll start looking at two variables at a time. And for instance, if you take those strike out ratio and runs batted in, again, the plot might look somewhat fancy to many of you, but it really is very simple. It's no harder than looking at the CNN results of United States presidential elections when you have red and blue and you show which states won which candidate, right? So you have these strikeouts here and you have the runs batted in here. Now this is the area where you actually have players observed in the season. And the players who entered on the winning teams are colored-coded in the red, the players who are on the losing side color-coded in blue, and green and yellow some-what in the middle. So with the plot tells us is that, okay, you need to have a small number of strikeouts and a large number of runs batted in to positively contribute to the team winning. And it kind of makes perfect sense and of course you can focus on some of the fine parts. But you kind of look at these plots and you're trying to see whether they tell you something interesting and something useful. Look at the other one here. We already noticed this thing about home runs; that there is potentially something bad going on there. So let's look at the next plot which is the strikeouts versus the home runs. Again, when I present it as a colored map, you can see what's going on. The home runs are here, 20 is that magic number we saw on the other plot. The number of strikeouts, now, it's a well-known fact that players, the batters who work on the home runs, they tend to have higher strikeout rate. So the plot, the cloud of points in general, stretch that way. What is really interesting though, is the color coding itself, which shows that in this case the players that were hitting the home runs actually associated with the teams that lost the division. And again, I'm not including wild card and all of that other stuff. But this, in and of itself, is interesting because you would normally want to see a hot spot here and not there and I guarantee that if we were looking at the games where say Babe Ruth were involved, or one of those big, big ones, the big guns- you would definitely see the red spot out there and that's why you're getting that kind of intuition in the back of your head, but this is not what happened the season 2010. So you had players that might have had the nice, camera hot spot moments, and the crowd's cheering, and their salary's going up- but as far as the team performance, it was a losing strategy. At least, this is what we can see at the very first glance of the batting performance in data mining.
Now let me quickly switch to the presentation here, just to highlight another point. I placed these plots over here, and in particular strikeouts-home runs. So, if you take this plot that extracted by the data mining engine by having a deep look at your data; if you are tempted to look at the same combination: strikeouts versus home runs in a conventional stats approach, in other words I take my Excel spreadsheet and I plot those data against each other, this is what I see. And, as you can see, the red ones are wins the blue of their losses- as far as the division is concerned. But you cannot really see the pattern here. Why? Because of multi-dimensional shadows, or gloss projections, as we call it. I don't want to sound too mysterious here, but statisticians and data miners have learned about this phenomenon a long time ago. There is no mystery there. If you are trying to project a globe on to a flat map, and if you don't do it right, you can get the North Pole projected on to the South Pole on the one dot, right there on the map. And that's only 3D projecting onto 2D. Here I have 6D projected onto 2D. There's a lot of effects outside of these two that stretch in all these kind of weird dimensions. Once projected, they totally confuse what's really going on. And if you were to take one important point from this entire presentation, is that the real power of real data mining lies in the fact that it looks deeper than just simple two dimensional plots. It really has the powers to construct something that extracts the importance of the given pair, or even a single signal once the importance of everything else has been eliminated. So keep that in mind. But before I go further, at the presentation level here, what I did then, I took derived stats and did something very similar with what you had on the previous kind of example. You just basically say, here's my target, here's my derived stats as inputs, build a model. What I obtained was a model with essentially the same performance, in terms of predictive powers, as the previous model. And that's not surprising because, derived stats, they essentially carry the same original information. The model, the amount of predictability, in the data set has nothing, it's essentially the kind of feature of the phenomenon that you studying. You can derive as many variables as you like, but usually you just change the way things are focused on. As you can see, once you start working with fancy ratios, like some of these slugging rations, and there's etc., etc.. Some of these get very complex and convoluted. But you can actually study and gain a lot of insights by just looking at those plots. And again, that's the reason that I picked TreeNet, I could have worked with CART to work with trees, but I think as far as modeler's intuition, these heat maps are great to identify some of the interesting patterns because it kind of explains what happened in that specific season and I'm not being an expert on baseball, per se. I know how the engine works, how to interpret all of that. I'm pretty sure the experts out there, when they look at these, they might find a lot of useful, additional insights. And again, it's a very simple map. Each colored area represents some sort of dot and the color coding associated with either increased odds of winning or losing in terms of the 2010 season. It would be great to look at those for all of the seasons, say 2009, 2008, all the way back, maybe even combined. As I said, we have wonderful source of data and a whole lot of things you can do.
Now, to finalize this. Just to give you an extra added flavor, let's focus on pitching stats. Which is just essentially, a different data table, in the database. The pitching stats, same here, so they can all be extracted and organized. We'll get a somewhat smaller table, not 1245, but something around 6 or 7 hundred. But again, I'm looking at 2010. I'm looking at what happened in the season, not so long ago. Pitching stats, again, they have core stats and they have derived stats. And talking about how fancy people can get with deriving things, there is a ratio that is essentially a 10 line algorithm. I mean, you can Google it or Wikipedia it. You'll see, there is a whole page describing how the ratio is calculated. There is a ratio that has a long, convoluted equation. But the beauty of data mining, you can throw them all in and see who really does the good prediction job, which could be one in 2010 and different in 2009. If I model pitching data, and again, the same target: won versus lost, I get even better performance in terms of overall level of predictability. And look at the variables that it suggests as the top important drivers. I have included the weight of players there and so on, I mean, just for the sake of curiosity. And again, to make a long story short, eight variables were identified, those are 2D plots. I want to focus your attention on this part here, these two. Because the others, you kind of see, they go along with our intuition, but these two, we're talking about pitchers. The home run technically goes against the pitcher's record, right? That's the number of homes the pitcher allowed, and you have the rising contribution to your team winning, and if you look at the number of wild pitches, again, technically it goes against the pitcher, sort of. But the pitchers who had a large number of wild pitches also contributed to their team winning. That's an interesting fact, that's an interesting trend that we find in the data. And again, let's look at the two variable contributions and surprise is coming in the next slide. And again, here, you can see, let's say, the home runs versus wild pitches; that pair. There is this area over here that's hot and is associated with large number of home runs allowed and again, that magic 20 number, and the number of wild pitches exceeding 10. So, I'm working with a different data set right now and I'm finding the same set of conclusions, well, because batting and pitching- they go hand in hand, so you kind of accumulated from one side and you accumulated from the other side. But then you see this, and you can also look at all these other things, and in particular, what I did on the next slide-I looked at the strikeout ratio versus number of home runs. I think on this display, it's this plot over here. Again, a large number of home runs is associated with large strikeout rates. And this time, in terms of pitching, this produces better odds for your team to win. If you look at a stats plot, you can't really see it, again, because of the phenomenon of multivariate projections. If you look at the conventional regression, on that same data set, as I know a lot of you may have learned about that, you'll see that the conventional regression identifies this home run trend as, indeed, rising, on pitching on the 2010. But it fails to give you all the details that we saw on the previous slide. And with that, I kind of, rest my case, in terms of arguing why data mining is a lot more informative than conventional stats.
What have we learned? 2010 regular season, it almost feels like people who are betting on the home run, but not good enough at it, got their share of the spotlight- but in the end, resulted in the whole team losing the division. That's my initial conclusion, of course, the real conclusions may say, they could be, I mean, those are of the field for experts. I point out things that data points out. You decide. We report, you decide. And to finish it, a few words on data mining mythology, because when people learn about it, they all of a sudden think that now they have absolute powers. That they can become rich, or that they become, whatever, they'll find an algorithm that'll explain everything, or they'll have something that they can do from start to finish. Never fall for these. Data mining is very powerful, but it will never replace your mind, your understanding, your expertise. It's a wonderful help, but not a replacement for you. And hopefully, I scored a home run- even though home run does not necessarily mean that the team won; at least in 2010.


