How to I define penalties to make it harder for a predictor to become the primary splitter in the node?
CART supports three "improvement penalties." The "natural" improvement for a splitter is always computed according to the CART methodology. A penalty may be imposed, however, that causes the improvement to be lessened depending, affecting the penalized splitter´s relative ranking among competitor splits. If the penalty is enough to cause the top competitor to be replaced by a competitor, the tree is changed.
This penalizes a given predictor (perhaps because it is expensive to collect and we do not want it serving as a splitter unless it is a really powerful predictor). If the user-defined variable-specific penalty is in the range [0,1] inclusive, then the natural improvement is adjusted as:improv-adj = improve * (1 - variable_specific_penalty)
If the user-specified penalty falls outside of [0,1] then no penalty is imposed.
This penalizes the improvement of a competitor based on the proportion of missing values for the competitor in the node in question. This makes it difficult, but not impossible, for a competitor with many missing values in a node to rise to the top of the competitor list and assume the role of primary splitter. If there are missing values, the improvement is adjusted as:improve-adj = improve * SW1 * [ (Ngood/N} ^ SW2 ]
in which SW1 and SW2 are controlled in the PENALTY command, N is the size of the node, and Ngood is the number of records with nonmissing values for the variable in question. If there are no missing values (NGOOD=N), no adjustment is made.
High level Categorical Penalty
This penalizes a categorical variable that has many levels relative to the size (unweighted N) of the node in question. For a categorical variable:ratio = log_base_2(N) / (Nlevels - 1)
in which NLevels is the number of levels for the categorical predictor and N is the number of learn sample records in the node.improve-adj = improve * [ 1 - SW3 + SW3 * (ratio ^ SW4) ]
in which SW3 and SW4 are controlled on the PENALTY command.
Note that all three penalties can be in effect, in which case they all serve to decrease the "freely computed" improvement, resulting in a "adjusted" improvement, which is what appears in the competitor table and is used to rank the competitors.
These penalties are first used in adjusting the improvements evaluated for the competitors in a node. When generating surrogates, the penalties will affect the improvements computed for the surrogates in the same way — unless PENALTY SURROGATES=NO is specified, in which case improvements are not adjusted for surrogates even if missing values or high level categoricals are involved.
Note that the associations for surrogates are not penalized, so these penalties will not change the ordering of surrogates for a given primary splitter. They will only affect the improvement listed for a surrogate.