Modeling tricks with TreeNet: Treating Categorical Variables as Continuous
Every experienced modeler knows that it is important to differentiate between ordered and unordered variables. If a variable X happens to be coded as 1, 2, or 3 but is unordered, then the three possible values are arbitrary labels not intended to convey any sense of order. In other words, the value of X for a record that records X=1 is not necessarily larger nor smaller than the possible values of 2 or 3; it is simply different.
Therefore, were we to run a regression that treated X as continuous, any slope we discovered would be an illusion. Further, X treated as continuous in a regression would embed the notion that a value of "3" for X is not just larger than the value of "1," but is specifically three times larger.
When it comes to CART, a single tree is somewhat more flexible because CART can decide, for example, that the predicted value of the target Y is only slightly larger for X=3 than for X=1 or X=2. However, if X is treated as continuous, then a CART split must keep records with a value of 3 on one side of the split and values of X=1 on the other side. In other words CART can split a node either by sending values "1" and "2" to the left and "3" to the right, or "1" only to the left and "2" and "3" to the right. If X is declared categorical, then CART is free to send "1" and "3" to the same side of the tree because it will not be bound by the order of the values.
The surprising situation with TreeNet is that it does not matter how you decide to treat variable X; you will essentially end up with the same predictions, record by record, either way. It is important, however, that you grow TreeNets with a relatively large number of trees for this to work correctly. Here is why: the TreeNet model is highly nonparametric and is not governed by any type of functional form. With sufficient data and sufficient trees, TreeNet can get close to estimating the relationship
y = F(X)
for continuous X with a separate estimate for every possible distinct value of X. You can observe this in the TN dependency plots where the shape of the curve can be quite "wiggly" with dozens of slope reversals and substantial movement in amplitude. The normal reason for needing to declare a predictor as unordered (categorical in SPM) is to allow a level-specific prediction for y=F(X|X=k) for each possible value of k. Because TN can do this anyway, even for continuous variables, the requirement that we keep continuous and categorical variables declared differently disappears.
Does this mean that you should simply declare all variables as continuous? Not necessarily. If a categorical has a relatively small number of levels (e.g., fewer than 15 or 20), the results generated by TN may be much easier to interpret from the dependency plots because TN will display a bar graph for categorical predictors. Declaring variables as categorical also permits TN to do its work with fewer trees. When the categorical predictor has many levels, however, there is a substantial advantage to treating it as continuous.
In all decision trees, it is well understood that the splitting power of a categorical predictor grows dramatically with the number of levels in the variable. A node can be split in up to 2^(K-1) -1 different ways, where K is the number of levels of the predictor. When K=17, this is essentially 64K different ways. When K=33, this is essentially four billion different ways. By contrast, if the predictor is continuous then we have a total of K-1 different ways of splitting the node. If X is truly continuous with a unique value for every record in the data sea, then the number of possible splits is limited by the number of records in the training data. The power of categorical splitters with a large number of levels is well understood by decision tree practitioners and CART builds in the option to penalize such splitters. If high level categoricals (HLCs) are not penalized, you will find that the trees are dominated by splits on these variables. Further, the splits found in the training data will often not be confirmed in the test data and the trees will underperform.
The same vulnerability to HLCs is found in TreeNet: HLCs can easily dominate the models by dominating most trees. If the HLC has four billion ways of splitting the data, there is an excellent chance that one of those splits will be near perfect and thus outperform (on training data only) all other possible splits. The ideal way to handle this, in TreeNet at least, is to treat the predictor as continuous. This automatically reduces the number of possible splits based on X to K-1 alternatives and eliminates the outsized HLC advantage. Naturally, if we only grow a few trees we will find that the variable X is coarsely binned into a modest number of intervals, and all records falling into a bin are treated as identical so far as the value X is concerned. If X is something like, say, an ID variable, then all values of the variable close to each other will be treated as the same. However, if we grow a very large number of trees, and also see that the predictor X is involved in a good number of these trees, then the coarse binning will progressively become finer, allowing close to unique predictions for every distinct value of X.
This technique is very easy to implement if X is already coded as a number. However, if X is coded as text then it will be necessary to map each level of X to a number and use the numerical version of the predictor in the model. For example, in the United States, we have 51 geographical regions corresponding to the individual states and the District of Columbia, which are typicaly coded as "AL" "AK," etc. It will be necessary to create a new predictor, for example, coded with the integers 1, 2, ...,51, to represent this variable. It has been suggested by our R&D scientists that it could be beneficial to code several differently-ordered versions of this numerical representation, based on different random orderings of the original variable. For example, the first version could map "AL" to "1," whereas the second could map it instead to any other number, say, 17. While this will make it easier for TN to arrive at an optimal model with fewer trees, it will complicate interpretation and model deployment.