There are several stages to interaction detection using Treenet models. The first stage is to run a simple comparison of test sample performance for TreeNet models run with trees of different sizes. The baseline model would be the Treenet using 2-node trees (sometimes known as "stumps"). The core idea is that a tree grown with a single split cannot reflect any kind of interaction as the entire story for the tree involves a single variable and by definition an interaction requires at a minimum two different variables. The 2-node tree baseline model thus represents the best possible model TreeNet can grow when interactions are prevented. We then grow at least one more tree allowing more than 2-nodes, which thus allows interactions. The simple story is that if the 2-node TreeNet is as good, or almost as good as the larger tree model, then we have compelling data-based evidence that interactions are irrelevant to the data generation process (how the real world actually operates to produce this data).
Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today’s post is actually quite basic in nature and is in response to a user’s question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made ‘missing value codes’ for unknown data fields; favorite values have include a string of 9’s such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing ‘unknown’ and 9997 for ‘refused.’ Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.
Autodiscovery leverages the stability advantages of multiple trees to rank variables for importance and thus select a subset of predictors for modeling. In SPM 7 and earlier Autodiscovery runs a very simple training data only TreeNet model growing out to 200 trees. The variable importance ranking generated from this model is then used to reduce the list of all available predictors down to the top performing predictors in this background model. Autodiscovery is fast and easy, as there are no control parameters to set, but it is just a mechanism for quickly testing whether a substantial refinement in the number of predictors can improve model performance.
In most serious modeling projects we would supplement Autodiscovery with more intensive variable selection mechanisms such as we have built into BATTERY SHAVING, where the model, rank, select, and model again cycle is repeated possibly a very large number of times.
In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.
Salford Systems Predictive Modeler, including CART®, MARS®, TreeNet®, and RandomForests®, can handle any number of variables you care to work with. By default your software will launch prepared to work with up to 32,768 variables which is sufficient for many users. However, if you need to work with a larger number you just need to let the software know at the time the application is launched.
If you are working with non–GUI version you make use of command line arguments informing the application of your preferences. For example the command line syntax is:
SPM.EXE -v< N > Specifies max N variables for the session.
With the GUI version you essentially do the same adding the command line arguments by modifying the properties of the application.
Just follow the following steps, for example, to inform SPM you expect to work with up to 50,000 variables:
The value used for this parameter reflects the number of variables allowed to be used in the application. For example, if you need to use 75,000 variables, then you would need to set this parameter at –V75000.
Recent advances in machine learning technology make it possible to determine definitively whether or not interactions of any degree need to be included in a predictive model.
We can thus establish conclusively, for example, for a given set of predictors, that an additive model (one with no interactions) cannot be improved upon with interactions. Or alternatively, one might prove that a model with interactions will outperform a model without them.
Further, we can now identify precisely which interactions are supported by the data, and also the degree of interaction, even in very high dimensional data. The tools we use to achieve these results are extensions of Stanford University Professor Jerome Friedman's TreeNet, developed by the authors and embedded in the Salford Systems TreeNet 2.0 Pro Ex product.
Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of “ensembles.” Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of “YES” or “NO” then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict “NO” otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).