Survival Analysis with CART, MARS, and TreeNet
CART, MARS, and TreeNet were originally developed to analyze cross-sectional data, where each observation or record in the data is independent of all other records and no explicit accommodation is made for either time or censoring. Fortunately, research in statistics has shown us how to adapt our tools, as well as classical statistical tools such as logistic regression, to the analysis of time series cross-sectional and survival analysis data. This brief note outlines the topic, sometimes known as "discrete time survival analysis," showing you how to set up your data to estimate survival or failure time models. The methods discussed here also apply to the analysis of web logs and other sequentially-structured data. A collection of useful references is provided below.
The Nature of the Problem
In censored survival analysis, we normally want to understand the factors affecting the length of time one must wait before an event occurs. By far the best known examples of survival analysis come from biomedical research in which the length of time a patient survives after diagnosis of a cancer is studied, but many other uses are possible. In sociological studies, for example, the event might be "first marriage," so the research would focus on age at first marriage, and also, given a first marriage, time until divorce or separation, if any. Education scholars have studied the number of years required to complete an undergraduate (nominally four-year) degree. Engineers might analyze the factors affecting the time until a component (such as a hard drive) fails.
Survival analysis is increasingly being used in market research and customer relationship management (CRM), where topics of interest include churn (how long before a customer switches to a competitor), the length of time between an inquiry and a sale, or the length of time before a product or service is upgraded. It should be clear that survival analysis can be applied to a broad range of topics that have in common the need to understand or predict a continuous non-negative quantity.
Survival analysis becomes complicated only when the data are "censored," meaning that for some sizable fraction of the data the time to the event is not known. Most often, the time is not known because when the data were collected the event had not yet happened for select individuals, and indeed might never happen. For example, if we are following the cohort of residents of Paris born in 1980 and studying their age at first marriage, we might find that more than 20% have never married by 2007. Assuming that we must conduct the analysis now (in 2007), for 20% of the data we know only that the age at first marriage will be greater than 27, but we cannot know how much greater without waiting (perhaps for decades) until the person either marries for the first time or dies. For those who die before marrying, we can never know how long we would have had to wait for an answer had they continued living.
Hundreds of scholarly text books and thousands of research articles have been published on survival analysis, and specialized software for survival analysis models has been developed.
The question addressed here, however, is: how can we use modern data mining methods to analyze censored survival analysis data?
Two Approaches to Data Mining Survival Data
The first approach develops entirely new methods tailored explicitly for survival analysis. Richard Olshen, one of the four creators of CART, developed a system known as SURVANAL, a specialized variant of CART. Leo Breiman also developed an experimental survival-oriented version of Random Forests. Salford plans to release these products (the Olshen survival tree and the Breiman survival forest) at some future date as part of a broader set of survival tools. Other survival-oriented trees have been proposed by statisticians, but none of these methods have yet met with general acceptance.
The second approach, and the one we take here, is to adapt already existing tools such as CART, MARS, and TreeNet so that they yield correct results for censored data. This approach requires that you have the right kind of data and that you prepare the data correctly. If the data preparation and management is done correctly. then effective survival models can be developed using CART, MARS, and TreeNet.
Data Requirements and Data Management
For the approach we are suggesting here, data collected at roughly regular intervals for the subjects being studied is ideal. For example, in biomedical studies, patients might be observed once a month and various measurements such as weight, blood pressure, lab results, and treatments recorded. The outcome of interest, such as disease status (cured/not cured or alive/dead), would also be recorded. In CRM studies such as telecommunications churn, data are usually available monthly,including bill amount, call volume, types of services, interaction with a call center, timeliness of payments, etc.
In some cases, however, data are available only at irregular intervals. For example, in e- commerce we may know when a registered user visits a website, but long and irregular spaces may occur between visits. This type of data is also amenable to the analysis we suggest.
Setting Up Your Data
We begin with a description of how to set up your data. Assuming you have regularly spaced data (such as daily, monthly, quarterly, or even yearly), you will want to organize your data with one record per observation period. Thus, with monthly data availability, you should prepare an analysis file containing one record per month. The analysis will need to be a flat file, so unchanging data will simply be repeated on every relevant row. As an example, we have created a fictitious file pertaining to mobile phone customers.
ID MonthYear Age Gender TimeOnFile Mins Calls Churn Churn_in_2
100120040322M 100120040422M 100120040522M 1001 200406 22 M 133520040135F 133520040235F 1335 200403 35 F 133520040435F 1335 200405 35 F 133520040635F
61325400 7892201 835100. 91821. 04078500 13389100 266712600 31252000 46441070. 5688870.
14 0 0 7 0 0 1 0 0
27 0 0
32 0 .
43 0 .
In the above file we have partial histories of three customers, with ID=1001, ID=1335 and ID=2877. Customer 1001 has six previous months of history at the point this file is started and eventually cancels his account in June, 2004. Customer 1335 is a new customer in January 2004 and is still a customer when we last have data for her in June, 2006.
These data have been set up to properly estimate a variety of survival or hazard models. The simplest model would estimate the probability of churn in the month in question. For most records, the churn indicator is 0, but for those who do churn, it is 1 in the last record of data we have for them. In many cases the model we want to estimate is not "did the event occur this month?" but "will the event occur two months from now?" or perhaps six months from now. The latter questions can be addressed by first creating a new column corresponding to the number of months until a churn is observed. A new target can then be defined as CHURN_in_2= 1 if the customer churns two months from now, CHURN2=0 if churn does not occur two months from now, and CHURN2= missing if the churn is less than two months away. (The target is set to missing for time periods too close to the churn to ensure that these records are excluded from the analysis when we are trying to predict CHURN2.)
In most studies of this type you will want to create backward-looking predictors such as "calls last month" and "calls two months ago," summary predictors such as "average calls last three months," and durations such as "time since last handset upgrade." Such predictors appear in most survival analyses and good examples appear in many published papers. (See the bibliographies in the referenced papers below for more.)
Comments on the Target Variable
Once your core data have been set up in the recommended format you will have considerable flexibility in defining your target (dependent) variable. As we noted above, you might be forecasting two months ahead for a customer relationship program (customer retention). However, your forecasts could be less specific and simply predict whether an event occurs at any time in the next 12 months (a common question for insurers wanting to predict claim frequency). Note that your predictive accuracy is likely to be much higher for the latter problem.
Managing Train, Test and Validate Data Partitions
In discrete time survival analysis, the unit of observation is not a row of data but a history, which is the collection of all records pertaining to a given subject. In the example above, customer 1001 is a unit of observation and the rows of data belonging to ID=1001 must be assigned as a block to either the training, test, or validation samples. To accomplish this you will need to create your train/test/validate flags as part of your core data preparation. You can also assign an entire block of records to a cross-validation bin if you prefer to use cross validation for your model performance testing method. The most recent version of Salford Systems tools allow you to define your own CV partitions in an indicator variable. For 10-fold cross validation, for example, this variable would take the integer values 1,2,3,...10. If you create your own partitions you will be responsible for appropriate balancing of the event (death, churn, etc.) frequencies across the CV partitions.
We create train/test partitions for event history data in our own work by first randomly assigning the person IDs to a partition and then joining the partition assignment table with the main data table, thus ensuring that every row of data is flagged with the appropriate partition. You may prefer to use a more complex assignment method based in part on the length of history available and other predictor variables (for stratified sampling purposes).
Note that if you fail to do the above and use simple random assignment at the ROW level, you are likely to obtain wildly optimistic results with considerable overstatement of the predictive power of your models. You must never use the built-in random assignment to train or test or random assignment across cross-validation bins for discrete time survival analysis data!
Justification of the Approach and Literature
A survivor function is often derived from an underlying hazard function, which is the probability that an event occurs in the next small unit of time, given that the event has not occurred until now. Once you know the hazard function you can derive any other survival- related function. It should be plain that if the data are collected sufficiently frequently and we model the data in the discrete time format discussed above, we are fitting a hazard model. This method is not only acceptable, it is often considered ideal if the data are sufficient to support such a model. However, while the requirement that the data be collected frequently enough is important, it is often glossed over in practice. Thus, even annual data could be analyzed as discrete time hazard data.
The articles and books listed below are fairly accessible and comprehensible and are a good place to start learning about discrete time survival analysis. These references in turn cite other more mathematically rigorous treatments for the more technical researcher.
Allison, P. D. (1984). Event History Analysis. Beverly Hills, CA: Sage.
Brown, C. C. (1975). On the use of indicator variables for studying the time dependence of parameters in a response-time model. Biometrics, 31, 863-872.
Cox, D. R. (1972). Regression Models and Life Tables. Journal of the Royal Statistical Society, B34, 187-220.
Singer, J. D. & Willett, J. B. (1993). It's about time: Using discrete time survival analysis to study duration and the timing of events. Journal of Educational Statistics, 18(2), 155-195.
Yamaguchi, K. (1991). Event History Analysis. Beverly Hills, CA: Sage.