A Simple Explanation Of TreeNet Models For Regulators
Data mining and machine learning are technological fields that are having a substantial impact on how data is analyzed and how predictive models are built in virtually all industries. The new methods are being adopted rapidly by expert data analysts because of their extraordinary power, but the ability of consumers of the models and results have been lagging in their understanding of what the new methods actually do and how they work. To the extent that the new methods appear to be "black boxes" that mysteriously produce results, those outside the data mining field sometimes appear hesitant to trust the results. This brief note is intended to address this matter focusing on TreeNet, introduced into the literature as "stochastic gradient boosting" and "multiple additive regression trees."
TreeNet is the brainchild of celebrated Stanford University Professor Jerome H. Friedman (Statistics Department), and was announced in a paper in the prestigious Annals of Statistics in 1999. Friedman explains the methodology in a series of highly technical papers and Salford systems has followed with a variety of tutorial materials presented several times at the Joint Statistical Meetings of the American Statistical Association, and we will not attempt to summarize that material here, other than to provide references. Instead, we will offer a somewhat different and succinct explanation of the underlying logic of this innovation.
In March of 2000 Stanford University statistics professors Jerome Friedman, Trevor Hastie, and Rob Tibshirani published "Additive Logistic Regression: a Statistical View of Boosting" as a special invited paper in the Annals of Statistics. In this now classic paper the authors rigorously prove that boosting regression trees for binary (yes/no) dependent variables generates a logistic regression! This logistic regression differs from the classical statistical version in the following ways:
The new logistic regression chooses predictors to be included in the model dynamically, (something like forward stepwise regression, but see below).
The model is built up in a very gradual way very different from that of traditional stepwise regression: if a new variable is introduced into the model it always enters with a very small coefficient (e.g. .001) further learning steps are free to alter that coefficient, but only in very small increments (the procedure in this respect is similar to modern regularized regression such as the lasso and Least Angle regression.
Predictors never enter the model in a simple linear way. Instead a predictor is always binned prior to entry into the model and the bin boundaries are searched for using an optimal binning algorithm.
Typically, a variable used in the model is introduced repeatedly in slightly different forms, meaning that the variable is binned with different bin boundaries every time it is entered.
A classical scorecard construction method is based on linear logistic regression in which each variable is binned prior to modeling by studying the log-odds of events within bins. The variable is binned just once and then later considered as a potential predictor (in binned form, or sometimes recoded as a continuous predictor using weights of evidence coding). In TreeNet, the multiple versions of binning are a mechanism for essentially tracing out a smooth curve for the predictor as the different binnings can be used to construct averaged predictions for the value of the outcome variable conditional on a specific value for the predictor. But these bins are constructed, and their coefficients are estimated during the stepwise construction of the model. Unlike classical score card construction, which must commit to bin boundaries prior to modeling, the TreeNet model has the opportunity to estimate the bins and their coefficients in the presence of other conditioning variables. The result is that instead of imposing a "shape" on the relationship between the dependent variable Y and a predictor X where the variables are studied in isolation, the TreeNet model is able to take all other relevant variables into account during the discovery of the shape of this relationship.
The end of this process is a logistic regression in which variables have been selected and transformed dynamically during the stepwise modeling process. Because the transforms are based on a potentially large number of alternative binnings they are very flexible and cover shapes beyond those available to the classical statistician. But more important is that every transform is discovered while controlling for other relevant variables. The closest precedent to this type of modeling is the GAM Generalized Additive Model, introduced Hastie and Tibshirani in Statistical Science, in 1986. A GAM is an automatic smoothing machine that does take all relevant variables into account as it tries to optimally transform a predictor, but it requires the modeler to decide which variables to use in advance, can be sensitive to outliers, and cannot accept missing values.
TreeNet can be thought of as a modernized version of the GAM with the following enhancements:
Automatic predictor selection
Resistance to outliers (because the core method of learning is a decision tree)
Ability to handle missing values (missing values always get their own bin)
Resistance to overfitting (the very slow update process restrains overfitting)
Use of holdout samples to select optimal (not overfit) models.
The relationships that the TreeNet model discovers can be displayed in graphs that are based on simulations generated by the software to trace out the shape relating Y to any X.
A Comment About Interactions
TreeNet can be run in two modes: a mode that generates strictly additive models and a mode that permits the discovery of interactions. It is straightforward to limit to TreeNet to additive models and such models are excellent baseline predictors, which we can then try to improve with the addition of interactions. TreeNet interactions are more subtle than the interactions familiar to the statistician. Rather than "whole variable" interactions of the form Xi*Xj, a TreeNet interaction would be of the form "if ICO600 and debt_to_income_ratio .10 then adjust score by +10" In other words, specific regions of the data are identified that have behaviors different than captured by the main effects. Such interactions, which can be extracted in the forms of rules, can improve model performance noticeably. In conventional scorecard construction interactions are rarely used because they are so difficult to discover using conventional methods. Instead scorecard developers settle for "segmenting" a database and generating segment specific scorecards. This does serve to capture some dominant interactions is but is a limited vehicle for interaction detection and cannot be employed when working with modest sample sizes.