INTERVIEWS WITH FOUNDING FATHERS OF DATA SCIENCE.
In March 2004 Salford Systems hosted a data science conference in honor of the four co-authors of the 1984 CART monograph (Classification and Regression Trees) Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone. All four authors spoke extensively about their careers.
As part of our preparations for the conference we interviewed each of these pioneers separately, recording and transcribing the conversations. Never before released, we now begin with a considerably shortened version of our interview with Richard Olshen. Fortunately, shortly after our interview, Olshen wrote a detailed technical overview of his and related research into the mathematical foundations of CART, published in 2007 (see references at the end).
That published paper should serve as the official exposition of Olshen’s technical thinking. The interview was intended to provide us with more of the back story including how the four CART authors came to know each other and work together and how their thinking evolved.
Edited Interview with Richard Olshen (2004):
Q. As someone who was trained in conventional statistics, how did you come up with the different approach embodied in CART?
A. I first got interested in binary tree structure for classification in 1974. At that time I was working half-time at the Stanford Linear Accelerator Center (SLAC) and half- time at the Stanford Department of Statistics. Jerry Friedman was the head of the computation research group at SLAC, so in a sense I was employed by him. I had known about Jerry for many years. My first wife, Vivian, was related to him by marriage, but I had never met him prior to joining SLAC. My first wife Vivian died very young, of illness, in 1973. We had been living in Ann Arbor, where I was at the University of Michigan, but I came back to the Bay Area because I had two small children and needed the help of my mother and other relatives to care for them. When I met Jerry at Stanford, we hit it off pretty well from the beginning. At the time an explicit problem related to the bubble chamber and the linear accelerator motivated Jerry's and my conversation. My memory of the problem was this: Particles came flying down the hill and hit some kind of target and were shattered. In the particular interaction of interest to us, two kinds of particles flew out from the collision, electrons and positrons. They went through an electric and plastic sandwich that had something like nine layers. The problem was to do something automated so that in the future when particles came through this sandwich, you could say this was the vectorial voltage of a positron. Several groups were working on binary tree structure algorithms at this time. The conventional wisdom was that you can partition a feature space with binary splits, do something to develop a base rule of the terminal regions of a tree that are associated with the partitioning and come up with decision rules. That worked fairly well for the learning sample, but the conventional wisdom was that it wasn't much good for subsequent samples if these rules weren't validated. Work by Cover and Hart seemed to shore up this conventional wisdom because at the end of the day, even if you vastly overbuilt your tree, you were going to get a nearest neighbor rule, but if the samples were large enough, that wasn't so bad. Jerry and I wondered if we could do better. So I called my colleague Lou Gordon at Ann Arbor to talk about the work being done there by the University of Michigan's Institute for Social Research.
Q. So other important tree-related research being done at this time?
A. Around 1972, Mervyn Stone and Dennis Lindley developed the idea that we ought to be able to learn how a rule behaves by some kind of internal validation scheme, perhaps by dividing the sample into K pieces and trying to make inferences for the one piece left out. That is now known as the process of cross validation. Although I was trained as a Bayesian and that is not a particularly Bayesian idea, it resonated with me. Even Jimmy Savage, who was my main teacher in graduate school and a Bayesian if ever there was one, admitted he didn't want to give up the inferential powers of permutation tests because there was something right about them. Even if they couldn't be justified by Bayesian arguments, they smelled too good to be a bad idea. Better to give up philosophy than to give up something that's patently a good idea. At the same time, there were two other threads that led to my being persuaded to the tree-structured way of thinking. I had met Chuck Stone in the late 1960s, when he worked with me on a problem in probability that had to do with random walks. Chuck was at UCLA, which was a plus when I considered taking a job at UC-San Diego. I liked Del Mar and there were several people in the math department at UCSD whose work I had studied. The idea of being on a medical school faculty also appealed to me on several levels, not least because it was a way of paying homage to Vivian's memory as she had a Ph.D. in biometry. At a meeting in Los Angeles, I got into nearest neighbor rules, which are fairly close to CART rules. In addition to the nearest neighbor connection with Chuck, I had the connection with Jerry Friedman and working on the data analytic problems at SLAC, which, after all, was my job. I also had a very deep connection in a personal way to the University of Michigan Institute for Social Research, so nothing about tree-based analysis seemed foreign to me. I must say that most of all I thoroughly enjoyed my conversations with Jerry. Even more than all that, there was math to be done. CART is basically an empirical decision rule, that is to say, you think of a response and think of carving up the space of the predictors in terms of making the response as homogeneous as possible within the regions. You want to partition the space of the X's, but you want to partition it so that the Y is as homogeneous as possible within the regions of the DX's that you find. That's exactly what a Lebesgue interval consists of. So I called my friend and former graduate student Lou Gordon at Stanford. I told him I was talking to Jerry about these rules from engineering and physics points of view, but that I wanted to work out the math. We used to get together regularly and just talk. At the time we never had a plan to write a paper.
Q: What year was this?
A. It started in 1974 and proceeded through 1975. That's what led to Lou's and my paper in the Annals in Statistics that actually didn't appear until 1978. (See references). The paper basically showed that under the right circumstances, if the terminal nodes were big enough, a binary tree-structured rule could yield Bayes-risk efficient classification for any problem provided the features were Euclidian and the outcome was discreet and finite. That was novel in its day and grew out of Chuck Stone's paper on nearest neighbors that appeared in 1977. (See references). Chuck's paper was something Chuck and I had talked about a lot. I read his paper and told him it was dynamite, and that it wasn't too far from our tree stuff because each of our sets of rules were local. It's just a question of: if I've got a point in feature space, what do I take as a near neighbor in order to agglomerate a rule for that point in the feature space? Do I take something that comes as a region after finally splitting? Do I take the nearest neighbor? Do I take some other way of partitioning? We can think generally of taking a point and defining a neighborhood that's salutary for making an inference about the conditional expectation of Y given an X at that point. I found Chuck's paper absolutely brilliant. It was very closely related to some things in ergodic theory. Chuck called and told me he had sent the paper to Annals of Statistics and they had summarily rejected it! After some further communication with the editor, the paper ended up not only being published as a lead article, but being published as a discussion paper, which is pretty unusual.
Q. And you were one of the discussants?
A. Yes, along with a number of others in the field, including Leo Breiman.
Q. Had you met Leo by this time, or know of him?
A. No. I first became aware of Leo due to his and Chuck's Technology Services Corporation (TSC) Technical Report, probably about 1978. (See references below). Sometime around 1981, Chuck Stone's son had a bar mitzvah and Chuck invited me and Leo and made certain that Leo and I were seated next to each other at the reception. I remember as if it were yesterday, meeting Leo for the first time and having a lengthy conversation about trees. We had both been thinking about them, though not in the same way, so we tried to do a verbal file merge of what we knew. The union was better than either one of us had been able to accomplish alone. I had known Chuck Stone since the late 60Õs and had known and worked for Jerry for a long time during the 1970Õs, so finally meeting Leo in 1981 completed the square. When the four of us met to discuss the writing of what was to become the CART monograph the others mostly wanted empirical medical studies from me. I also wrote the first draft of the theoretical foundations of CART in Chapter 12 but Chuck didn't like it and completely rewrote it.
Q. Chaper 12 is one of the asymptotic theory chapters, right?
A. Yeah. I never really critiqued much of what Leo wrote, but I critiqued what Chuck wrote very carefully. We talked especially about Chapter 10, and also about the inconsistency of the bootstrap for nearest neighbor rules, when X and Y are independent, which is in Chapter 11. There are two pieces to asymptotic CART. The first is whether, when you partition a feature space, you would get a rule from that partition that is close to the rule you would get for a point inside that partition. So, if you think of having a terminal node and a point inside that terminal node, you can say the rule is in some sense "mushed" over the box. Then you can ask yourself how close that is to the rule at a particular X inside that box. It's probably close if the box is small. That's one principle that has a bearing on CART rules, which leads you to the theory of the differentiation of integrals as it bears on CART properties. The other thing you can say is that given you've got a little box and given that you really have data and that the box doesn't just come from an oracle (because you don't know the oracle rule), all you have are X's and Y's, so you take the Y's, throw them in the box and try to agglomerate them. That's then your decision rule for what goes on in the box. So that gets you into the whole issue of empirical process. Lou and I used a crude, at the time brilliant, but in retrospect crude, result from a paper by Jack Kiefer that appeared about 1961. Basically, there's a certain combinatorial property called the Vapnik-Chervonenkis property that has to do with what's called "shattering of sets." That condition is sufficient to get a large deviation result for the difference between the empirical and the true probabilities of a box or, alternatively, a region. If the class of sets about which you're asking has a certain combinatorial property and if the property comes down to that, it shouldn't be too complex because it has to do with subdivisions with finite numbers of points and Euclidean spaces or other spaces.
Q. When you say "too complex," do you mean that the box shouldn't have too many conditions, too many dimensions?
A. Yeah. For example, if it's a polygon, it shouldn't have too many sides. If it's the set of positivity of a polynomial in the variables of Euclidean space, that polynomial should be a bounding degree. The point is that Chuck learned from David Pollard at Yale, and perhaps from others, that the Vapnik-Chervonenkis stuff really had something to say about CART trees and about the empirical half of CART consistency results. Chuck taught that to me and I taught it to Lou. Lou and I used it in our third 1984 paper and it also played a prominent role in the CART monograph’s Chapter 12.
Q. But you related it to a Vapnik result, did you say?
A. Yes. Basically, if you look at a terminal note of a CART tree, then as long as you split parallel to the coordinate axes and the D dimensions, there are at most two D sides, which is well within the bounds of fitting it to the Vapnik-Chervonenkis class, and so the large deviation results from Vapnik-Chervonenkis fails. Now it gets a little trickier because, and with apologies for the mathematics, if you take any finite number of points and a Cartesian product of intervals, then that makes it compact with respect to the product intervals. What that tells you is that there is a binary tree you can think of as being like a product of finite sets because you have a root node and each of those has two daughter nodes, which in turn may or may not have daughter nodes. In any given level of a tree, you can have only a certain number of sub-nodes of all the nodes that are that level. So, even if you allow for linear combination splits, which would mean the sides, that is, the number of sides of a box could be unbounded. If you had only linear combination splits you could get a polytrope that has one side for every node you are down the tree, but you still have a tree. So there should be some compactness lying around, for the same reason that the product of finite sets is compact. And there is, and so the Vapnik-Chervonenkis result really shouldn't worry about cuts being parallel to the coordinate axes. It turns out if you look at things the right way, you can get a large deviation result that will get you around that problem. The was shown by Andrew Nobel in 1996. But what makes that hard, and I'm going to get away from the mathematics so this will be the end of this particular diversion, the thing that makes the subject interesting to me from a mathematical point of view is that you need a kind of niceness to the set to do the differentiation of integrals part and you need a kind of set to do the empirical process part of studying CART and asymptotics, but those two conditions are not the same. That is, what's simple for Vapnik-Chervonenkis is not necessarily simple for the differentiation of integrals because the latter is inevitably tied up with topology and the emprical process stuff is not as purely combinatorial. That's why the logicians are interested in it.
Q. So the mathematical challenge was actually a part of what intrigued you.
A. Oh, yeah, I loved it. It was great.
Q. But at the same time, I think you've always been the one of the four CART co-authors who has been most interested in CART's use of real world data.
A. A lot of that had to do with my job, because at both Stanford and UC-San Diego, I was in the math department, but my billet came from the medical school, so I was also a professor in the medical school. I've always been interested in both medicine and mathematics. Each side always informed the other. I never found any conflict between them.
Q. At the time you started your early work on applying CART to the medical data, the cross validation hadn't been worked out yet, is that right?
A. Well, for example, in the medical stuff that had to do with predicting whether a person had had a heart attack when he or she came to the Yale New Haven emergency room with a chief complaint of acute chest pain, there was an important validation sample from Brigham and Women's Hospital in Boston. That's what made the subject of real interest because even if you have the world's best internal validations, the one problem with epidemiology is that the prevalence of various joint distributions of things changes. That is, if you make an inference on left-handed, red-haired women in San Francisco and want to extrapolate that to all women in Italy, it doesn't work.
A. Because they're not the same. Lee Goldman argued that an emergency room is an emergency room, a patient with acute chest pain is a patient with acute chest pain and if you can tell me something important about a patient in New Haven, you'd better be able to tell me something about a patient in Boston. So Goldman recruited 12,000 to 15,000 patients from a dozen or more hospitals in various places. If CART-like rules were going to be any good, they had better validate across the places.
Q. So Goldman was working with you in the applied work. Is that correct?
A. Yes. He worked with me because I happened to meet him when we were both sitting on the floor in the corridor waiting for our printouts on the 12th floor of the Health Sciences computing facility in the Harvard School of Public Health. He was trying to do a classification problem -- the heart attack problem. And I said, hey, I think I know something that can help you. By that time, Lou Gordon and I had written one paper and a better part of another one. Chuck's nearest neighbor paper was in print, Lou's and my first paper was in print, and Jerry and I had had conversations off and on for years. Then I gave a short course at the Harvard School of Public Health about recursive partitioning. I don't remember all the players, but a group of us met once or twice a week for months and talked a lot about Lee's data.
Q. At this point this was an empirical investigation?
A. Yes, but the notion of validation wasn't lost on us. At that point I think we used the bootstrap for validation. I'd learned about it in 1978 or 1979 from Brad Efron. Admittedly, it's not so great for single nearest neighbor rules when X and Y are independent, but the truth is that a suitable bootstrap is not a bad way to validate a lot of trees. A lot of people have used it for trees, and now we have the .632 plus rule of Ephron and Tibshirani, but that didn't exist then, so we had to fool around with the plain vanilla bootstrap.
Q. Could you help me a little by giving me a chronology of where you were when? First, where did you study?
A. My A.B. was from UC Berkeley, 1963 and my Ph.D. was from Yale in 1966. both in statistics. For a year I was a research statistician and lecturer at Yale, teaching in the math and stat departments. That was 1966-67.
Q. Where did you go after Yale?
A. To Stanford in 1967, where I met Chuck Stone. In 1970, I was at Columbia for a year, came back to Stanford for a year and then went to Ann Arbor, for what I thought would be a permanent position. It was my first tenure job. I was in the Stat and Math departments at the University of Michigan. However, tragedy befell me and my family, and I needed help with my two small children, so I came back to Stanford in 1973-75. I went to UC-San Diego in 1975 except for a couple of sabbaticals, including one in Boston in 1979-80.
Q. So in 1979 you were back at the Harvard School of Public Health?
A. And in the MIT math department, where I learned a lot from Dick Dudley about empirical processes. So I got a good dose of the medicine from Harvard as well as empirical processes from MIT.
Q. When did the four of you start writing the CART monograph?
A. I almost don't know how to answer that. The original version of Chapter 12 grew out a variety of sources, including my 1978, 1980 and 1984 papers. The pruning stuff owes a lot to the Breiman and Stone TSC 1978 Technical Report. So the monograph was a furthering of ideas from a lot of different places, definitely not starting from scratch. Many of the ideas were pretty mature by that time. But when it came time to put things together, every word was written fresh for the monograph. It wasn't as if we just cut and pasted. Jerry Friedman had written the original CART software which is now distributed by Salford. It matured because the rest of us badgered him about this and that and we should have this part of it and we should have this splitting rule and we shouldn't have that. It all matured over a period that I would say extended from 1974 to 1983.
Q. So when do you think the first real CART appeared? Or a proto-CART, but close enough to deserve the name? Was it as early as 1974?
A. Well, certainly when the idea of overgrowing trees and pruning them back on the basis of cross validation was developed. I think you would have to say that really appeared in print for the first time in the 1978 Breiman and Stone Tech Report.
- Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone. (1984) Classification and Regression Trees. Wadsworth.
- Leo Breiman, Charles Stone. (1977) Parsimonious Binary Classification Trees. Technology Service Corporation, Santa Monica, Ca., Technical Report. http://media.salfordsystems.com/pdf/201601_Parsimonious_Binary_Classification_Trees.pdf
- Louis Gordon and Richard A. Olshen, (1978) Asymptotically Efficient Solutions to the Classification Problem, The Annals of Statistics Vol. 6, No. 3 (May, 1978), pp. 515-533)
- Richard Olshen (2007) Tree Structured regression and the Differentiation of Integrals, Annals of Statistics,Vol. 35, No. 1, 1-12, and available online as http://projecteuclid.org/download/pdfview_1/euclid.aos/1181100178
- Charles J. Stone, (1977) Consistent Nonparametric Regression. The Annals of Statistics, Vol. 5, No. 4 (Jul., 1977), pp. 595-620.
- Jack C. Kiefer, (1961) On Large Deviations Of The Empiric D. F. Of Vector Chance Variables And A Law Of The Iterated Logarithm. Pacific Journal of Mathematics, Vol 11, No. 2, 649-660.
- Lou Gordon, and Richard Olshen, (1984). Almost surely consistent nonparametric regression from recursive partitioning schemes. J. Multivariate Anal. 15 147-163.
- Nobel, A.,1996, Histogram regression estimation using data-dependent partitions. Ann. Statist. 24 1084-1105.
- Lugosi, G. and Nobel, A.,1996, Consistency of data-driven histogram methods for density estimation and classification. Ann. Statist. 24 687-706.
- Lee Goldman, et. Al. (1982) A Computer-Derived Protocol to Aid in the Diagnosis of Emergency Room Patients with Acute Chest Pain N Engl J Med 1982; 307:588-596