Several important points should be kept in mind if you are serious about analyzing a high cardinality categorical variable (a target with many levels). First, you need to review the distribution of your data (training and test data) across all the target levels you plan to include in your analysis. In our auto choice analysis we worked with a data set in which there were many buyers for ultra–popular cars, and very few for rarely–chosen alternatives. Specific car models may be enocuntered rarely because they are expensive,impractical, offer untested new technology, or are released in low numbers. For whatever reason, these car models, represented by levels of the target, may not appear with sufficient frequency in the data to support reasonable analysis. In our real world data set designed to support real world decisions, we encountered some models with fewer than 10 purchases. Clearly, such small samples cannot support reliable analysis. However, if you have sufficient data for every level of your target then moving forward with a CART analysis can be very productive. We have a few further comments to make about such analysis below.
Other SPM data mining engines
Technically any data mining engine capable of handling a binary target variable can de adapted to handle an unlimited number of target values. You accomplish this by building a binary model for each level of the target, contrasting the level in question against all other levels. Thus, a three-level target with values of say “A,” “B” and “C” would be tackled with three binary targets {“A” vs "not–A"}, {“B” vs “not–B”}, {“C” vs “not–C”}. The problems with an aproach is, first, the fact that you have to build a separate model for each level. CART handles all levels simultaneously and thus builds one efficient model. The multi–model approach requires a complete new analysis for every model. If you have 50 levels you will have to wait for the 50 models to complete. Of course, such an approach would benefit dramatically from parallel processing. The second problem is the assembly of the separate models into a coherent single model. Having made these introductory comments we now review the engines available in SPM.
MARS
MARS was designed originally as a regression tool to capture the partial linearity and smoothness of responses that can be expected in most successful regression models. It was never a surprise that MARS could also be used to model the binary response two–valued target (Yes/ No or 1 vs 0) as a form of logistic regression. This, however, is as far as MARS can be expected to go when it comes to modeling multi–class problems out of the box. MARS could be used to develop a series of binary response models, one for each level of a multi–class target, but at the moment SPM provides no additional support for refining or combining the separate models into a coherent whole.
TreeNet
TreeNet was designed to handle the multi–class target automatically. TreeNet offers some very useful reports in such models, chiefly the level–specific variable importance list, and the level–specific partial dependency plots. TreeNet accomplishes this by using the strategy of one model per target level and then automatically combining the separate submodels into a single coherent whole. in general, this strategy works well for a relatively small number of levels. Because one model must be built for each level, care should be taken when working with, for example, 50 levels, as TreeNet will need to build 50 distinct submodels. With today’s multi–core processors the 2012 edition of SPM can leverage the parallel processing possibilities to accelerate the modeling, but the runs can be expected to take longer than, for example, a CART run.
RandomForests
Like CART, RF is inherently designed to handle the multi–class target, and some of RF’s notable successes and interesting visual displays are seen in three–class problems. RF models are not easy to explain and are not as robust as the original CART engine. However, it is always worth experimenting with RF at a minimum to benchmark the potential predictability existing within the data.
Conclusion
We recommend starting with CART for multi–class problems; further, the larger the number of target levels the greater the strengths of CART become. The 2012 edition of SPM includes a variety of multi–tree options as well, including bagging and RF–style forest ensemble construction, that offer both the ultra–robustness of CART and the potential predictive accuracy advantages of tandomized tree ensembles.


