Teradata Center for CRM at Duke Competition
Predicting Customer Churn with TreeNet®
Customer churn presents a particularly vexing problem for the wireless telecommunications industry, with 20-40% of customers leaving their provider in a given year. As once-explosive subscriber growth rates slow down,
retaining existing customers becomes increasingly important to a company's overall profitability. If the customers who are likely to churn can be identified, the company can target them with retention campaigns, giving them an incentive to stay and preventing loss of revenue.
The Teradata Center for CRM at Duke University set out to discover the best methods for determining which customers are most likely to churn. They posted an open challenge to data analysts and modelers: using customer records from a major wireless provider, predict which subscribers would leave the company in the next two months.
Entrants were free to use whatever analysis methods they wished. When the competition ended, the submissions were compared against the actual data over two different time periods. Two accuracy measures were then used to judge the data, for a total of four categories. The competition officals also conducted a "meta-analysis" to see which methods generally produced the most accurate results.
Salford Systems was declared the winner in all four categories. Salford's models were created with their TreeNet® software, an innovative form of boosted decision trees known for building extremely accurate models. Across all the entries, the judges found that decision trees and logistic regression methods were generally the best at predicting churn, though they acknowledged that not all methodologies were adequately represented in the competition.
Salford's TreeNet models captured the most churners across the board and discovered which of the 171 possible variables were most important for predicting churn. In the top 10% of customers, TreeNet found 35-45% more churners than the competition average and three times more than would be found in a random sample. For companies with large subscriber bases, this could translate to the identification of thousands more potential churners each month. Targeting these customers with an appropriate retention campaign could save a company millions of dollars each year.
The data were provided by a major wireless telecommunications company using its own customer records for the second half of 2001. Account summary data was provided for 100,000 customers who had been with the company for at least six months. To assist in the modeling process, the churners were oversampled so that one half of the sample consisted of churners (those who left the company by the end of the following 60 days) and the other half were customers remaining with the company at least another 60 days. A broad range of 171 potential predictors were made available, spanning all the types of data a typical service provider would routinely have available. Predictor data included:
- Demographics: Age, Location, Number and ages of children, etc.
- Financial: Credit score, Credit card ownership
- Product details: Handset price, Handset capabilities, etc.
- Phone usage: Number and duration of various categories of calls, etc.
The "training" or "calibration" data described above were provided to support predictive modeling development. Participants were asked to use their best models to predict the probability of churn for two different groups of customers to be scored: a "current" sample of 51,306 drawn from the latter half of 2001 and a "future" sample of 100,462 customers drawn from the first quarter of 2002. Predicting "future" data is generally considered more difficult because external factors and behavioral patterns may change over time. Of course in real world settings predictive models are always applied to future data and the tournament organizers wanted to reproduce a similar context.
Each contestant in the tournament was asked to rank the current and future score samples in descending order by probability of churn. Using the actual churn status available to the tournament organizers, two performance measures were calculated for each predictive model: the overall Gini measure and the lift in the top decile. The two measures were calculated for the two samples, current and future, so that four performance scores were available for every contestant. (The evaluation criteria are described in detail in a number of locations including the tournament web site.) The top-decile lift is the easiest to explain non-technically: it measures the number of actual churners captured among the customers ranked most likely to churn by a model.
Contestants were free to develop a separate model for each measure if they wished to try to optimize their models to either the time period or the evaluation criterion, or both. Salford Systems submitted two models: a straightforward out-of-the-box TreeNet model, and a more complicated model averaging the predictions of several different TreeNet models. The contest results are summarized below along with an explanation of their meaning and significance.
|Data Set||Measure||TreeNet Ensemble||Single TreeNet||2nd Best||Avg. (Std)|
|Current||Top Decile Lift||2.90||2.88||2.80||2.14 (.536)|
|Future||Top Decile Lift||3.01||2.99||2.74||2.09 (.585)|
In the "Current" data set the contestants were provided with account data for 100,462 customers of which 1,808 churned in the following month, at a rate of approximately 1.8% per month. Of course, the tournament contestants did not know which accounts churned. If 10% of all accounts were chosen at random we would expect to capture 10% or 181 churners. The TreeNet ensemble method captured 525 and a single routine TreeNet model captured 521. By contrast, the best competing model captured 507 and the average model captured 387. Assuming a mobile telecommunications service provider has 1 million accounts, using the Salford model would capture an additional 1,380 accounts per month over a routine model.
In the "Future" data set there were 51,036 accounts of which 924 churned, also at a 1.8% per month rate. A 10% portion of these data selected at random should capture about 92 churners. For these data the ensemble TreeNet model captured 278 and single TreeNet captured 276. The best alternative model captured 253 churners and the average model captured 193 churners. In a one million account portfolio the Salford Systems model would capture about 500 more churners per month than the best alternative model and 850 more churners per month than the average model.
The benefits of the TreeNet model in the top decile have two components. First, by selecting the right accounts to target, marketing resources are not wasted on the wrong customers. Second, the TreeNet model would allow the company to achieve a higher retention rate. There is more to the model than the top decile, however. Given that the TreeNet model is superior across the board, as reflected in the Gini coefficients, the use of the TreeNet model in a targeted retention plan should yield an additional 20,000 plus churners identified per year.
Further advantages stem from the ease with which the TreeNet model is constructed. Because the process is fully automatic, TreeNet models can be built in less time and with less preparatory effort than competing methods.