# How to I define penalties to make it harder for a predictor to become the primary splitter in the node?

CART supports three "improvement penalties." The "natural" improvement for a splitter is always computed according to the CART methodology. A penalty may be imposed, however, that causes the improvement to be lessened depending, affecting the penalized splitter´s relative ranking among competitor splits. If the penalty is enough to cause the top competitor to be replaced by a competitor, the tree is changed.

#### Improvement Penalties

Variable-Specific Penalty
This penalizes a given predictor (perhaps because it is expensive to collect and we do not want it serving as a splitter unless it is a really powerful predictor). If the user-defined variable-specific penalty is in the range [0,1] inclusive, then the natural improvement is adjusted as:
improv-adj = improve * (1 - variable_specific_penalty)
If the user-specified penalty falls outside of [0,1] then no penalty is imposed.
Missing-value Penalty
This penalizes the improvement of a competitor based on the proportion of missing values for the competitor in the node in question. This makes it difficult, but not impossible, for a competitor with many missing values in a node to rise to the top of the competitor list and assume the role of primary splitter. If there are missing values, the improvement is adjusted as:
improve-adj = improve * SW1 * [ (Ngood/N} ^ SW2 ]
in which SW1 and SW2 are controlled in the PENALTY command, N is the size of the node, and Ngood is the number of records with nonmissing values for the variable in question. If there are no missing values (NGOOD=N), no adjustment is made.
High level Categorical Penalty
This penalizes a categorical variable that has many levels relative to the size (unweighted N) of the node in question. For a categorical variable:
ratio = log_base_2(N) / (Nlevels - 1)
in which NLevels is the number of levels for the categorical predictor and N is the number of learn sample records in the node.
improve-adj = improve * [ 1 - SW3 + SW3 * (ratio ^ SW4) ]
in which SW3 and SW4 are controlled on the PENALTY command.
Note that all three penalties can be in effect, in which case they all serve to decrease the "freely computed" improvement, resulting in a "adjusted" improvement, which is what appears in the competitor table and is used to rank the competitors.
These penalties are first used in adjusting the improvements evaluated for the competitors in a node. When generating surrogates, the penalties will affect the improvements computed for the surrogates in the same way — unless PENALTY SURROGATES=NO is specified, in which case improvements are not adjusted for surrogates even if missing values or high level categoricals are involved.
Note that the associations for surrogates are not penalized, so these penalties will not change the ordering of surrogates for a given primary splitter. They will only affect the improvement listed for a surrogate.

[J#373:1602]

• ### SPM Version 8 Just Released!

NEW Salford Predictive Modeler software suite.

• ### Environmental Forecasting

Forecast the evolution of environmental outcomes using changes in habitat and climate as predictors.
• ### Sports Analytics

"Discover the undisclosed predictors to successful athletic performance using modern decision trees."
• ### Targeted Marketing

Enabling you to get appropriate prospective customers more efficiently than any other marketing strategies.
• ### Text Mining

Derive high-quality information from text to improve your understanding of behaviours and patterns.
• ### Bioinformatics

"Increase your probability of solving formal and practical challenges arising from the analysis of biological data."
• ### Bioinformatics

Learn how to make knowledge-driven decisions that can revolutionize your business performance.
• ### Financial Services

Analyze your spending and financial investments to help influence a profitable future for your company
• ### Industrial Optimisation

Overcome retail challenges and achieve new levels of predictive accuracy, profitability and reliability.
• ### Music

Predict musical score groupings, composers that complement each other and what song listeners prefer to listen to.
• ### Retail Analytics

Make smarter decisions to help manage your business more effectively and efficiently.
• ### SPM Version 8 Just Released!

Salford Systems' applications span every major industry and business function

• ### Environmental Forecasting

Forecast the evolution of environmental outcomes using changes in habitat and climate as predictors.
• ### Sports Analytics

Discover the undisclosed predictors to successful athletic performance using modern decision trees.
• ### Targeted Marketing

Enabling you to get appropriate prospective customers more efficiently than any other marketing strategies.
• ### Text Mining

Derive high-quality information from text to improve your understanding of behaviours and patterns.
• ### Bioinformatics

Increase your probability of solving formal and practical challenges arising from the analysis of biological data.

Learn how to make knowledge-driven decisions that can revolutionize your business performance.
• ### Financial Services

Analyze your spending and financial investments to help influence a profitable future for your company
• ### Industrial Optimisation

Overcome retail challenges and achieve new levels of predictive accuracy, profitability and reliability.
• ### Music

Predict musical score groupings, composers that complement each other and what song listeners prefer to listen to.
• ### Retail Analytics

Make smarter decisions to help manage your business more effectively and efficiently.

# Get In Touch With Us

Request online support

Ph: 619-543-8880
9685 Via Excelencia, Suite 208, San Diego, CA 92126