Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

View full calendar
Friday, February 03 2012 12:23

AutoDiscovery of Predictors in SPM

Autodiscovery leverages the stability advantages of multiple trees to rank variables for importance and thus select a subset of predictors for modeling. In SPM 7 and earlier Autodiscovery runs a very simple training data only TreeNet model growing out to 200 trees. The variable importance ranking generated from this model is then used to reduce the list of all available predictors down to the top performing predictors in this background model. Autodiscovery is fast and easy, as there are no control parameters to set, but it is just a mechanism for quickly testing whether a substantial refinement in the number of predictors can improve model performance.

In most serious modeling projects we would supplement Autodiscovery with more intensive variable selection mechanisms such as we have built into BATTERY SHAVING, where the model, rank, select, and model again cycle is repeated possibly a very large number of times.

Friday, January 13 2012 09:54

TreeNet for Beginners

Work through large databases quickly and accurately.
Published in Tutorials

In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.

Published in Dan Steinberg
Thursday, December 29 2011 10:44

Working With A Large Number of Variables In SPM

Salford Systems Predictive Modeler, including CART®, MARS®, TreeNet®, and RandomForests®, can handle any number of variables you care to work with. By default your software will launch prepared to work with up to 32,768 variables which is sufficient for many users. However, if you need to work with a larger number you just need to let the software know at the time the application is launched.

If you are working with non–GUI version you make use of command line arguments informing the application of your preferences. For example the command line syntax is:

     SPM.EXE    -v< N >      Specifies max N variables for the session.

With the GUI version you essentially do the same adding the command line arguments by modifying the properties of the application.

Just follow the following steps, for example, to inform SPM you expect to work with up to 50,000 variables:

  1. Right click on the program group icon and select “Properties.”
  2. From the Properties dialog, be sure to select the “Shortcut” tab.
  3. Click to open image!
  4. From the Shortcut tab, add the parameter “-V50000” to the “Target” path. It should end up looking something like:
  5. Click to open image!

    The value used for this parameter reflects the number of variables allowed to be used in the application. For example, if you need to use 75,000 variables, then you would need to set this parameter at –V75000.

  6. Click the [Apply] button.
  7. Click the [OK] to close the shortcut properties dialog.
  8. Use your program group icon to start SPM or any other individual Salford Systems’ product.

Play This Video

Recent advances in machine learning technology make it possible to determine definitively whether or not interactions of any degree need to be included in a predictive model.

We can thus establish conclusively, for example, for a given set of predictors, that an additive model (one with no interactions) cannot be improved upon with interactions. Or alternatively, one might prove that a model with interactions will outperform a model without them.

Further, we can now identify precisely which interactions are supported by the data, and also the degree of interaction, even in very high dimensional data. The tools we use to achieve these results are extensions of Stanford University Professor Jerome Friedman's TreeNet, developed by the authors and embedded in the Salford Systems TreeNet 2.0 Pro Ex product.

Published in News
Friday, December 09 2011 08:34

A Few Comments On Boosting Decision Trees

Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of “ensembles.” Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of “YES” or “NO” then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict “NO” otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).

Published in Dan Steinberg
<< Start < Prev 1 3 4 > End >>
Page 1 of 4