Use of CART and TreeNet on Microarray Datasets
CART (Classification and Regression Trees) was originally developed by Breiman, Friedman, Olshen, and Stone to construct data-driven solutions to a predictive modeling problem. The essence of the technology is recursive partitioning, where the original dataset is progressively split into mutually exclusive regions using a series of binary splits. The resulting solution is presented in the form of a binary tree with key variables shown at each node in a tree. While this approach works extremely well with traditional data mining applications, where datasets have large number of observations and reasonable number of variables, it may not work well with microarrays. The key underlying problem is "data fragmentation" and the associated limited capability of a single tree to accommodate multiple predtictors at a time. For example, having a symmetric split of a dataset with only 1000 observations results in two child segments of 500 records each, two subsequent binary splits produce four nodes of 250 records each, and so on. Very quickly, the data are being fragmented at an exponential rate with only a few variables ultimately entering the model structure. Given that microarrays usually have only a few hundred observations, it is theoretically impossible to include many genes/variables within a single-tree structure, diminishing the utility of CART in such applications.
The right approach, therefore, is to construct multiple trees. When a single tree cannot accommodate many genes, multiple trees can! This is exactly how TreeNet (Stochastic Gradient Boosting, by Jerome Friedman) methodology operates. First, a single small CART tree is grown to extract the top-level modeling signal, then the errors of the tree are analyzed and a second-stage tree is grown to correct for those errors. The process continues with many stages, each stage learning a bit more from the previous. With careful use of a test set or cross validation, it is possible to extract a useful signal with full precision utilizing all relevant variables. This is why we nominate TreeNet as a primary tool for model building on microarrays.