Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
  • Statistical Learning and Data Mining III
    October 01, 2012
    Boston, MA
  • DMA
    October 13, 2012 - October 19, 2012
    Las Vegas, NV
  • INFORMS
    October 14, 2012 - October 16, 2012
    Phoenix, AZ
View full calendar
Thursday, March 22 2012 11:49

Are Interactions Relevant To Your Data?

There are several stages to interaction detection using Treenet models.  The first stage is to run a simple comparison of test sample performance for TreeNet models run with trees of different sizes. The baseline model would be the Treenet using 2-node trees (sometimes known as "stumps"). The core idea is that a tree grown with a single split cannot reflect any kind of interaction as the entire story for the tree involves a single variable and by definition an interaction requires at a minimum two different variables. The 2-node tree baseline model thus represents the best possible model TreeNet can grow when interactions are prevented. We then grow at least one more tree allowing more than 2-nodes, which thus allows interactions. The simple story is that if the 2-node TreeNet is as good, or almost as good as the larger tree model, then we have compelling data-based evidence that interactions are irrelevant to the data generation process (how the real world actually operates to produce this data).

Published in Dan Steinberg
Monday, February 20 2012 15:02

A Reminder About Missing Values

Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today’s post is actually quite basic in nature and is in response to a user’s question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made ‘missing value codes’ for unknown data fields; favorite values have include a string of 9’s such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing ‘unknown’ and 9997 for ‘refused.’ Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.

Published in Dan Steinberg
Learn to address the challenge of testing small training data sets and improve the reliability of results using Battery Cross-Validation (CVR).
Published in Tutorials
Learn to control the size of the maximal CART tree in two ways: Telling CART to stop early and limiting CART's freedom to produce small nodes.
Published in Tutorials
Friday, February 03 2012 12:23

AutoDiscovery of Predictors in SPM

Autodiscovery leverages the stability advantages of multiple trees to rank variables for importance and thus select a subset of predictors for modeling. In SPM 7 and earlier Autodiscovery runs a very simple training data only TreeNet model growing out to 200 trees. The variable importance ranking generated from this model is then used to reduce the list of all available predictors down to the top performing predictors in this background model. Autodiscovery is fast and easy, as there are no control parameters to set, but it is just a mechanism for quickly testing whether a substantial refinement in the number of predictors can improve model performance.

In most serious modeling projects we would supplement Autodiscovery with more intensive variable selection mechanisms such as we have built into BATTERY SHAVING, where the model, rank, select, and model again cycle is repeated possibly a very large number of times.

Friday, January 13 2012 09:54

TreeNet for Beginners

Work through large databases quickly and accurately.
Published in Tutorials

In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.

Published in Dan Steinberg
Thursday, December 29 2011 10:44

Working With A Large Number of Variables In SPM

Salford Systems Predictive Modeler, including CART®, MARS®, TreeNet®, and RandomForests®, can handle any number of variables you care to work with. By default your software will launch prepared to work with up to 32,768 variables which is sufficient for many users. However, if you need to work with a larger number you just need to let the software know at the time the application is launched.

If you are working with non–GUI version you make use of command line arguments informing the application of your preferences. For example the command line syntax is:

     SPM.EXE    -v< N >      Specifies max N variables for the session.

With the GUI version you essentially do the same adding the command line arguments by modifying the properties of the application.

Just follow the following steps, for example, to inform SPM you expect to work with up to 50,000 variables:

  1. Right click on the program group icon and select “Properties.”
  2. From the Properties dialog, be sure to select the “Shortcut” tab.
  3. Click to open image!
  4. From the Shortcut tab, add the parameter “-V50000” to the “Target” path. It should end up looking something like:
  5. Click to open image!

    The value used for this parameter reflects the number of variables allowed to be used in the application. For example, if you need to use 75,000 variables, then you would need to set this parameter at –V75000.

  6. Click the [Apply] button.
  7. Click the [OK] to close the shortcut properties dialog.
  8. Use your program group icon to start SPM or any other individual Salford Systems’ product.

Play This Video

Recent advances in machine learning technology make it possible to determine definitively whether or not interactions of any degree need to be included in a predictive model.

We can thus establish conclusively, for example, for a given set of predictors, that an additive model (one with no interactions) cannot be improved upon with interactions. Or alternatively, one might prove that a model with interactions will outperform a model without them.

Further, we can now identify precisely which interactions are supported by the data, and also the degree of interaction, even in very high dimensional data. The tools we use to achieve these results are extensions of Stanford University Professor Jerome Friedman's TreeNet, developed by the authors and embedded in the Salford Systems TreeNet 2.0 Pro Ex product.

Published in News
Friday, December 09 2011 08:34

A Few Comments On Boosting Decision Trees

Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of “ensembles.” Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of “YES” or “NO” then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict “NO” otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).

Published in Dan Steinberg
<< Start < Prev 1 3 4 > End >>
Page 1 of 4