What are "intelligent surrogates for missing values"?
CART handles missing values in the database by substituting "surrogate splitters," which are back-up rules that closely mimic the action of primary splitting rules. Suppose that, in a given model, CART splits data according to household income. If a value for income is not available, CART might substitute education level as a good surrogate.
The surrogate splitter contains information that is typically similar to what would be found in the primary splitter. Other products' approaches treat all records with missing values as if the records all had the same unknown value; with that approach all such "missings" are assigned to the same bin. In CART, each record is processed using data specific to that record. This allows records with different data patterns to be handled differently, which results in a better characterization of the data.
By using surrogates to stand in for missing values, CART generates robust and reliable predictive models, even when applied to very large databases with hundreds of variables and many missing values. CART's identification of surrogate predictor variables also provides an effective way to discover low-cost predictive mechanisms. If the best splitting criterion in a tree involves an expensive or difficult-to-obtain measure, a less-expensive surrogate can be considered instead.