Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
  • Statistical Learning and Data Mining III
    October 01, 2012
    Boston, MA
  • DMA
    October 13, 2012 - October 19, 2012
    Las Vegas, NV
  • INFORMS
    October 14, 2012 - October 16, 2012
    Phoenix, AZ
View full calendar
David Tolliver

David Tolliver

Tuesday, April 06 2010 14:02

Regression Tree Ensembles

Many have asked if RandomForests (RF) supports regression analysis.

The short answer is: not with the current implementation. Salford Systems plans to support RF regression in our next release.

That said, if you have been thinking about RF regression we urge you to consider using TreeNet regression instead. Some reasons follow:

  1. TreeNet originally was designed to be about regression and not classification. Friedman's original name for the TreeNet technology was Multiple Additive Regression Trees.
  2. TreeNet is a superb performer for the regression problem; we have used it in a number of demanding real world applications.
  3. TreeNet develops multiple tree models but the trees are generally quite small and remain small regardless of the size of the training data file. By contrast, RF trees grow with the size of the training data and can become unmanageable, particularly in deployment.
  4. RF was originally designed for the classification problem and much of the post-processing of the RF trees focuses on the class membership of the records (with-in and -out of bag). None of this elaborate machinery is useful for regression.
  5. Leo Breiman left regression out of the original RF stream of his work. Only after years of focusing on the classification problem did he address regression and this work was never completed. As a result, we do not know where Leo was going with RF regression, although we do know that he wanted to use a completely new code base for it. His co-author and collaborator, Adele Cutler, also has remained focused on classification; thus, RF regression has languished. (TreeNet regression thrives and is being enhanced on an ongoing basis.)

  6. TreeNet delivers useful "partial dependency plots" that reveal the true conditional relationship between the target Y and any predictor X and can also be used to definitively identify key interactions. We know of no other technology that can offer this kind of insight into the data-generating process. (TreeNet Pro Ex can do this automatically.)
Monday, October 19 2009 16:00

What is the Variable Importance Measure?

When fielding support questions over the years, I am often asked about CART’s variable importance measure. Questions like: “What is the definition of Variable Importance?” Or maybe, “Why is a variable shown as important, but is never a splitter?”

Given that one of the goals of CART is to develop a simple tree structure for predicting data, relatively few variables may appear explicitly as splitters, which might be interpreted to mean that the other variables are not important in understanding or predicting the dependent variable. However, unlike a linear regression model, a variable in CART can be considered highly important even if it never appears as a node splitter. Because CART keeps track of surrogate splits in the tree-growing process, the contribution a variable can make in prediction is not determined only by primary splits. (The primary splitter is the variable you see exhibited in the tree structure. Behind the scenes, however, whenever that variable is missing, the surrogate splitters will be used instead to move a record down the tree to its appropriate terminal node.)

One way to think about this is to consider pairs of variables that contain similar information, such as father's and mother's education. Although only one of these variables can appear in a particular primary split, because one will perform better than the other in a given context, to rank one of these variables as important and the other as unimportant would be a mistake. Suppose, for example, FED (father’s education) was chosen as the primary splitter and MED (mother’s education) turned out to be the best surrogate. Just how close these two variables are in predictive power would become evident if we either deleted the primary splitter FED or set all its values in the data set to missing when applying the tree to new data. In these circumstances, the surrogate variable could end up doing all the work of the primary splitter, and the predictive accuracy of the tree might not be any worse if MED had to be used everywhere instead of FED. The phenomenon of one variable obscuring the significance of another, known as masking, is addressed in CART's variable importance measure.

To calculate a variable importance score, CART looks at the improvement measure attributable to each variable in its role as a either a primary or a surrogate splitter. The values of ALL these improvements are summed over each node and totaled, and are then scaled relative to the best performing variable. The variable with the highest sum of improvements is scored 100, and all other variables will have lower scores ranging downwards toward zero. A variable can obtain an importance score of zero in CART only if it never appears as either a primary or a surrogate splitter. Because such a variable plays no role anywhere in the tree, eliminating it from the data set should make no difference to the results. (Some rare circumstances occur in which this rule of thumb is violated, but these are not discussed here.)

The importance score measures a variable's ability to perform in a specific tree of a specific size either as a primary splitter or as a surrogate splitter. It says nothing, however, about the value of the variable in the construction of other trees. For example, a variable that is very important in a 20-node tree might not be important at all in a two-node tree because it plays no role in the splitting of the root node (which is the only split in a two-node tree). As a tree is allowed to become bigger, variables have more opportunities to play a role in the tree and thus to receive non-zero importance scores. The relative importance rankings of variables can change dramatically as you compare trees of substantially different sizes. Thus, you should not take importance scores to indicate an absolute information value of a variable; the rankings are strictly relative to a given tree structure.

The scores reflect the contribution each variable makes in classifying or predicting the target variable, with the contribution stemming from both the variable’s role as a primary splitter and its role as a surrogate to any of the primary splitters. In our example ANYRAQT, the variable used to split the root node, is ranked as most important. PERSTRN received a zero score, indicating that this variable played no role in the analysis, either as a primary splitter or as a surrogate.

To see how the scores change if each variable’s role as only a primary splitter is considered, click the Consider Only Primary Splitters check box; CART automatically recalculates the scores.

You can also discount surrogates by their association values if you check the Discount Surrogates check box and then select the By Association radio button. Alternatively, you can discount the improvement measure attributed to each variable in its role as a surrogate by clicking on the Geometric radio button and entering a value between 0 and 1. CART will use this value to geometrically decrease the weight of the contribution of surrogates in proportion to their surrogate ranking (first, second, third, etc.). Finally, you may click on the Use Only Top radio button and select the number of surrogates at each split that you want CART to consider in the calculation.

NOTE: This update is necessary only for users experiencing problems opening SAS 9 databases.

All Salford Systems predictive modeling engines come with a set of database drivers that allow you to read and write a large number of different file formats, including various versions of SAS files. Periodically, some drivers need to be updated to accommodate new file format versions. In this note we discuss SAS 9 files.

If you are running some of our older software you may find that you cannot successfully read SAS 9 files. This older software uses a legacy database conversion system distributed with older versions of the Salford product line.

The process for updating the driver is as follows:

  1. Download the replacement driver from:

    http://www.salford-systems.com/dist/Support/p7sas7.zip

    and unzip it to obtain the file named p7sas7.dll

  2. Use the new p7sas7.dll to replace the existing file of the same name. This file is located in application’s >\...\…\dbmscopy\mip7\ directory. The full pathname may vary depending on how the product was installed. It is easily located.
    1. Once updated, this driver will allow you to open SAS 9+ files without a problem.

      NOTES:

      • The "File of type:" selected in the Open Data File directory will still remain "SAS for Windows 7/8 (*sas7bdat)."
      • Newer versions of the Salford product line do not require this update. This is only applicable to older releases of our software utilizing the older driver set.