Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • Predictive Analytics World
    March 05, 2012 - March 09, 2012
    San Francisco, CA Booth 224
  • Statistical Learning and Data Mining III
    March 15, 2012 - March 16, 2012
    Palo Alto, CA, Booth TBA
  • INFORMS OR
    April 15, 2012 - April 17, 2012
    Huntington Beach, CA Software Workshop April 15, 1-2:45pm Booth 6
  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
View full calendar
Home Blog Dan Steinberg
Dan Steinberg

Dan Steinberg (25)

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

A Reminder About Missing Values

Written by Dan Steinberg Monday, February 20 2012

Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today’s post is actually quite basic in nature and is in response to a user’s question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made ‘missing value codes’ for unknown data fields; favorite values have include a string of 9’s such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing ‘unknown’ and 9997 for ‘refused.’ Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.

Why Use Cross-Validation?

Written by Dan Steinberg Friday, January 27 2012

Salford Predictive Modeler™ and its component data mining engines CART®, MARS®, TreeNet®, and RandomForests® contain a variety of tools to help modelers work quickly and efficiently. One of the most effective tools for rapid model development is found in the BATTERY tab of the MODEL Set Up dialog. Because there are so many tools embedded in that dialog we are going to start a series of posts going through the principal BATTERY choices, one at a time.

Let’s start with the idea of the BATTERY. The BATTERY mechanism is an automated system for running experiments and trying out different modeling ideas. Instead of you having to think about how you would like to tweak your model to try to make it better the BATTERY does it for you. Each BATTERY is a planned experiment in which we take some useful modeling control and run a series of models in which we systematically change that control. The best part of this is the SUMMARY which provides you with an executive summary of the results and points you to the best performing model. We recommend that you use the BATTERY often; some modelers don’t do anything without setting up pre–packaged or user customized batteries.

Most users of Salford Systems’ data mining tools (CART®, MARS®, TreeNet®, RandomForests® or the more recent integrated SPM™ package) rely on the GUI (Graphical User Interface) to do their work. The GUI makes life easy as you do not need to remember any command syntax and of course the GUI has many useful visual displays of important results. But there are some good reasons to learn how to work with command scripts which is the topic for the current posting. We will refer to our software as SPM (Salford Predictive Modeler) which includes all of our individual data mining engines.

It is useful to remember that almost everything you do during a GUI session using SPM has a “command equivalent.” That means that you could accomplish the identical model and results simply by submitting a set of commands to SPM instead of pointing and clicking. Even more useful to remember is that SPM automatically creates the equivalent set of commands for you as you work, saving the results to a text file. We will return to how to locate that text file a bit later.

Controlling Node Sizes in a CART Tree

Written by Dan Steinberg Friday, January 13 2012

The theory behind the CART decision tree, as laid out in detail in the classic monograph Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone, dictates that CART trees always be grown to their largest possible size before pruning. This means that the smallest allowed terminal node will have only one record in it! In theory, this is not a problem because large CART trees are just the raw material that the pruning engine starts with to arrive at the optimal tree. Thus, it is likely that all the small nodes will be pruned off anyway.

In practice, however, real world data sets do not always work that way. In fact, in small to moderate sized data sets it can pay to manipulate the controls that govern the sample sizes allowed in CART nodes. In this note we discuss this topic and provide some examples and instruction on how to use the controls.

In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain “honest” estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.

Salford predictive modeling engines use high precision algorithms to compute essential results but printed reports and the GUI may display results with relatively less precision, for convenience of the display. There may well be circumstances when you need to pay careful attention to this however, and insist that the data mining tool print, display, and save results in the highest useful precision.

Does CART allow multiple targets?

Written by Dan Steinberg Thursday, December 22 2011

There are two ways to interpret your question:

  • Does CART® allow multi–class targets (eg, a class label with values 1,2,3,...etc)

    CART has been used in real world classification problems with more than 400 classes.

    In one project our goal was to predict which specific model of new car a given person actually bought. In the project there were more than 400 different car models available and the predictors were drawn from a lengthy set of attitude and interest questions.

    For such models to be useful you need to have a decent sample size for each level of the target. In the car purchase study some models had been bought by more than 2000 people (a good sample size) while some exotic and expensive cars had been bought by fewer than 10 people (the total sample size was over 50,000 records). Naturally, we could not place much faith in the predictions concerning the least frequently bought cars. However, overall, the models built were both quite accurate and generated considerable insight into the factors influencing consumer choice in car purchases.

CART® vs. The Clones

Written by Dan Steinberg Friday, December 16 2011

At Salford Systems we are frequently asked what the difference between the trademarked decision tree CART® is and the various clones that have been created by other companies, or that have been contributed as user written packages to community oriented systems. Our website contains a variety of essays and FAQs on this matter and we’ve link to them below. But here is a very brief summary of the details:

The original and true CART was written entirely by Stanford University Professor Jerome H. Friedman, and has always been proprietary source code available only to Salford Systems. Friedman is one of the inventors of CART and widely regarded as one of the most influential and important researchers in data mining. He is also considered one of the world's best algorithm writers and scientific programmers. In other words, we offer the only true CART written by a creator of this revolutionary technology. It contains everything discussed in the original CART monograph and much more that was not touched upon in the book.

Boosting is a machine learning strategy that came into being shortly after researchers discovered the value of “ensembles.” Ensembles are collections of models which are used as a group to make predictions (and classifications) that are often considerably more accurate than individual models. The models are combined either by averaging predictions or using a voting scheme (for classification). Thus, if we built 101 classification models where the output of each model is a prediction of “YES” or “NO” then the ensemble prediction might follow a majority vote rule: predict YES for any record that obtains at least 51 YES votes, and predict “NO” otherwise. Some ensemble methods use weighted voting where the weights reflect the predictive accuracy of the individual models. In this post we want to focus on a few key ideas related to Salford products rather than the scientific field (we will do that in another post or paper).

Binary Classification

CART®

The original CART monograph discusses a study the authors performed working with 215 observations and 19 predictors, where 37 records were of class 1 and 178 of class 0. We think that this is example, with 37 examples in the smaller class is close the smallest sample size you can usefully work with CART.

Recommendation: We suggest using a minimum of 100 records, with the target variable distributed not more unbalanced than in proportions (1/3, 2/3) for up to 30 predictors. We recommend repeated cross-validation to estimate the out-of-sample (previously unseen data) performance.

<< Start < Prev 1 3 > End >>