Reading TreeNet Partial Dependency Plots
If you have ever used TreeNet you have probably moved quickly to the partial dependency plots in order to gain insight into how the principle drivers of your outcome variable operate. In this post we explain how such graphs are constructed by TreeNet. For example, here is a partial dependency plot that is generated for a TreeNet regression to predict that typical home values in a Boston neighborhood as a function of the official crime rate:
We generated the model from which this graph was extracted by using the BOSTON.CSV data set and running this little command script:
MART TREES=1000 NODES=2 LOSS=LS
You can use the GUI to set all this up if you prefer by visiting the Model SetUp dialogs and the Edit‚ Options dialogs. For now, we just want to point out that we have elected to work with 2 node trees and we have asked TreeNet to generate its plots using 600 values along the X-axis (if possible).
The first thing to understand is that the dependency plots are essentially model-based simulations. The graph is intended to show us how our target variable MV is forecasted to change in response to changes in the predictor variable CRIM. In other words, the graph such as the one above is not based directly on the raw data and it is not a smoothed plot of the actual MV against actual CRIM data. The values for the target value we see plotted are predictions generated by the model. A good model should yield reliable plots and teach us much about the true underlying relationship between MV and CRIM. Remember, the plots produced by TreeNet display the relationship as seen by the model.
Here is a simple scatter plot of the two variables in question:
So how do we generate the simulations required to extract the partial dependency plot? It is pretty simple: if the TreeNet model uses predictors X1-X10 and we want to trace out the dependency curve for X1 then we need to do two things:
a) Select values for each of the other predictors X2-X10 that are not explicitly part of this plot (note we are starting with X2 in this list)
b) Generate a series of model-based predictions setting X1 first to a low level, then to a slightly higher level, and then again to an even slightly greater value, until we reach perhaps the largest value of X1 seen in the data.
Step (b) above generates a new predicted Y for each value of X1 selected. The final output consists of two columns of data, which are the basis of our dependency plot. You could plot this data in Excel or any other plotting software; TreeNet will optionally print this raw plotting data for you if you prefer to make your own plots. The reason for step (a) is that the model requires values for all the other variables in the model to make predictions. (We set aside the topic of missing values for the other predictors).
So how should we manage step (a)? How should we set the values of the variables, which are not part of the plot? Many analysts would suggest "plugging in the means" for the continuous predictors and the mode (most common value) for the categorical predictors. While this might work you can imagine why it could be misleading: a collection of mean values might not be a good representation of the other predictors and completely ignores how they vary and co-vary. What we do in TreeNet is trace out a separate dependency plot for every record in the training data (behind the scenes)! If we have 100,000 records in the training data we produce 100,000 curves! The Partial Dependency plot displayed by TreeNet is the average of all of these curves and is the most effective way of representing the dependency in question.
To duplicate what TreeNet already does internally and very quickly for you automatically you could run a script. In the script you read in your data, replace all values of CRIM in the data set to a specific value, (Leaving all other variables as they are), use the saved Treenet model to score the data and then repeat for an entire series of different values of CRIM. Here is the top of such a script written for SPM 7.0:
We would then need to combine all of the columns containing the predicted values for MV into a single table, compute the average for every column and finally plot the results. We went through that exercise and combined all of these columns into an Excel spreadsheet to obtain this graph:
Fortunately, TreeNet will do all of this work for you automatically. Your general impression should be that the graphs are very similar; note that the TreeNet generated version plots the values on the Y-axis as deviations from the mean of the Y variable.
What is special about this method and why should we want to use it? The method allows us to control for all other variables in the model or "hold all the other variables constant". Working directly from the scatterplot of the raw data would not be nearly as effective and could easily mislead us.
We can extend these ideas to working with subsets of the data and in fact to working with a single record in the data. Given that every record can generate sufficient data for our plot we can produce record-specific or subset specific plots. Below are two plots for individual records, one for a neighborhood with a relatively low MV and another with a relatively high MV:
Observe that apart from level these two graphs appear to be identical (and they are). This is due to the fact that we developed an additive TreeNet model allowing for just two-node trees. All that changing the values of the other variables in the model can do is cause the graph to shift in a parallel fashion up or down. That means, in this particular and special case we really did not need to generate the record specific plot data. But this is a special case; most Treenet models allow for larger trees and interactions, which means that every record can generate a potentially differently shaped graph.
We will be addressing the topic of graphs in the presence of interactions in a subsequent post.