Unsupervised Learning and Cluster Analysis with CART
CART in its classification role is an excellent example of "supervised" learning: you cannot start a CART classification analysis without first selecting a target or dependent variable. All partitioning of the data into homogeneous segments is guided by the primary objective of separating the target classes. If the terminal nodes are sufficiently pure in a single target class the analysis will be considered successful even if two or more terminal nodes are very similar on most predictor variables.
Unsupervised learning, by contrast, does not begin with a target variable. Instead the objective is to find groups of similar records in the data. One can think of unsupervised learning as a form of data compression: we search for a moderate number of representative records to summarize or stand in for the original database. Consider a mobile telecommunications company with 20 million customers. The company database will likely contain various categories of information including customer characteristics such as age and postal code, product information describing the customer's mobile handset, features of the plans the subscriber has selected, details of the subscribers use of plan features, and billing and payment information. Although it is almost certain that no two subscribers will be identical on every detail in their customer records, we would expect to find groups of customers that are very similar in their overall pattern of demographics, selected equipment, plan use, and spending and payment behavior. If we could find say 30 representative customer types such that the bulk of customers are well described as belonging to their "type", this information could be very useful for marketing, planning, and new product development.
We cannot promise that we can find clusters or groupings in data that you will find useful. But we include a method quite distinct from that found in other statistical or data mining software. CART and other Salford data mining modules now include an approach to cluster analysis, density estimation and unsupervised learning using ideas that we trace to Leo Breiman, but which may have been known informally in among statisticians at Stanford and elsewhere for some time. The method detects structure in data by contrasting original data with randomized variants of that data. Analysts use this method implicitly when viewing data graphically to identify clusters or other structure in data visually. Take for example customer ages and handsets owned. If there is a pattern in the data then we expect to see certain handsets owned by people in their early 20's, and rather different handsets owned by customers in their early 30's. If every handset is just as likely to be owned in every age group then there is no structure relating these two data dimensions. The method we use generalizes this everyday detection idea to high dimensions.
The method consists of these steps:
- Make a copy of the original data, and then randomly scramble each column of data separately. As an example, starting with data typical of a mobile phone company, suppose we randomly exchanged date of birth information at random in our copy of the database. Each customer record would likely contain age information belonging to another customer. We now repeat this process in every column of the data. Breiman uses a variant in which each column of original data is replaced with a bootstrap resample of the column and you can use either method in Salford software.
Note that all we have done is moved information about in the database, but other than moving data we not changed anything. So aggregates such as averages and totals will not have changed. Any one customer record is now a "Frankenstein" record, with item of information having been obtained from a different customer. Thus, date of birth might be from customer 101135, the service plan taken from customer 456779 and the spend data from 987001.
- Now append the scrambled data set to the original data. We therefore now have the same number of columns as before but twice as many rows. The top portion of the data is the original data and the bottom portion will be the scrambled copy. Add a new column to the data to label records by their data source ("Original" vs. "Copy").
- Generate a predictive model to attempt to discriminate between the Original and Copy data sets. If it is impossible to tell, after the fact, which records are original and which are random artifacts then there is no structure in the data. If it is easy to tell the difference then there is strong structure in the data.
- In the CART model separating the Original from the Copy records, nodes with a high fraction of Original records define regions of high density and qualify as potential "clusters". Such nodes reveal patterns of data values, which appear frequently in the real data but not in the randomized artifact.
We don not expect the optimal sized tree for cluster detection to be the most accurate separator of Original from Copy records. We recommend that you prune back to a tree size that reveals interesting data groupings.
If we simply scramble the data without resampling then the summary statistics for the Original and Copy data sets must be identical. The scrambling destroys any correlation structure in the data (linear or nonlinear). Hence, when using all the data for training no variable can split the data productively in the root node (which is at it should be). If the data sets can be separated at all it will require the use of a combination of at least two variables. Thus in the telecommunications example, the average customer age is of course identical in the original and the copy. But the average age of customers having iPhones may very well not be equal across Original and Copy datasets.
If it is not possible to develop a good model to separate Original and Copy data this means that there is little structure in the Original data and there are no distinctive patterns of interest.
This approach to unsupervised learning represents an important advance in clustering technology because:
- Variable selection is not necessary and different clusters may be defined on different groups of variable
- Preprocessing or rescaling of the data is unnecessary as these clustering methods are not influenced by how data is scaled
- Missing values present no challenges as the methods automatically manage missing data
- The CART-based clustering gives easy control over the number of clusters and helps select the optimal number
In a Salford client's paper, Genentech scientists used a method for cluster explanation that I introduced in 1992 (to our knowledge the first time CART was used in this way). The idea is to create a clustering solution using conventional statistical methods; any conventional clustering software could be used in this first phase. The cluster solutions are generally not so easy to understand, especially if there are more than a handful of variables being used. Further, it can be cumbersome to assign new records to their proper cluster.
To deal with these issues we more on to a second stage CART analysis in which the target variable is the cluster number of each record in the data set and the predictors are the variables used to arrive at the cluster solutions. The model is a multi-class problem with CART's goal being to assign records to the proper cluster. Generally, the CART solution will not agree perfectly with the results that we would obtain by computing distances to cluster centroids, or example, to make cluster assignments. But the CART tree usually does a very effective job while dramatically simplifying the process of cluster understanding and cluster assignment. Whereas in the original cluster methodology, if 30 variables were used for clustering then all 30 variables need to be known and enter in to the cluster assignment mathematics. With the CART approach, we often need to refer to only a very small number of variables to make an assignment and can use CART's built-in missing value handling via surrogates to make good cluster assignments even when much of the required data is missing.