Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
  • Statistical Learning and Data Mining III
    October 01, 2012
    Boston, MA
  • DMA
    October 13, 2012 - October 19, 2012
    Las Vegas, NV
  • INFORMS
    October 14, 2012 - October 16, 2012
    Phoenix, AZ
View full calendar
Monday, February 20 2012 15:02

A Reminder About Missing Values

Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today’s post is actually quite basic in nature and is in response to a user’s question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made ‘missing value codes’ for unknown data fields; favorite values have include a string of 9’s such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing ‘unknown’ and 9997 for ‘refused.’ Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.

Published in Dan Steinberg
Thursday, December 29 2011 10:44

Working With A Large Number of Variables In SPM

Salford Systems Predictive Modeler, including CART®, MARS®, TreeNet®, and RandomForests®, can handle any number of variables you care to work with. By default your software will launch prepared to work with up to 32,768 variables which is sufficient for many users. However, if you need to work with a larger number you just need to let the software know at the time the application is launched.

If you are working with non–GUI version you make use of command line arguments informing the application of your preferences. For example the command line syntax is:

     SPM.EXE    -v< N >      Specifies max N variables for the session.

With the GUI version you essentially do the same adding the command line arguments by modifying the properties of the application.

Just follow the following steps, for example, to inform SPM you expect to work with up to 50,000 variables:

  1. Right click on the program group icon and select “Properties.”
  2. From the Properties dialog, be sure to select the “Shortcut” tab.
  3. Click to open image!
  4. From the Shortcut tab, add the parameter “-V50000” to the “Target” path. It should end up looking something like:
  5. Click to open image!

    The value used for this parameter reflects the number of variables allowed to be used in the application. For example, if you need to use 75,000 variables, then you would need to set this parameter at –V75000.

  6. Click the [Apply] button.
  7. Click the [OK] to close the shortcut properties dialog.
  8. Use your program group icon to start SPM or any other individual Salford Systems’ product.
Binary Classification

CART®

The original CART monograph discusses a study the authors performed working with 215 observations and 19 predictors, where 37 records were of class 1 and 178 of class 0. We think that this is example, with 37 examples in the smaller class is close the smallest sample size you can usefully work with CART.

Recommendation: We suggest using a minimum of 100 records, with the target variable distributed not more unbalanced than in proportions (1/3, 2/3) for up to 30 predictors. We recommend repeated cross-validation to estimate the out-of-sample (previously unseen data) performance.

Published in Dan Steinberg

A Salford Systems client has recently published a book chapter focused on MARS® in “Statistical Models of Characteristics of Metal Vapor Lasers.”

Intro: In this Chapter, MARS models have been obtained based on all available data for examined CuBr lasers and not only on random samples from the conducted experiments. For this reason, the data are not random, since they have been selected by the researcher. On the other hand, in this way, the fullest possible information about the investigated dependences is utilized. Unlike classic parametric techniques, the models in this chapter are entirely data driven. The data are taken from 274 observations of all 12 variables, described in Table 2.2. Observations where some measurements are missing have not been included. All models are the best MARS models of the respective type. Although the presented models are complex in form, they require few computations when calculating a predicted value. What is more, these are relatively easy to interpret for each specific experiment case.

The full book chapter can be accessed free of charge online at: https://www.novapublishers.com/catalog/product_info.php?products_id=31157

Published in News

The Salford CART decision tree is exceptional in supporting an essentially unlimited number of target levels. Of course the vast majority of classification problems tackled by analysts have two classes, or are reformulated to have two classes. There is no reason, however, to confine yourself to just two levels if you are working with CART. In our training materials we discuss three–level, five–level, and ten–level examples in detail.  The ten–level example concerns the reverse engineering of a clustering solution, in which a market researcher was looking to extract a simple set of rules that could be used to assign new records to a previously constructed clustering solution based on a very large number of variables. Ten levels is a rather small number when considering how far you might be able to stretch the CART machinery. In our work with a car manufacturer our goal was to predict the specific car model chosen by a new car buyer from a set of more than 400 alternatives.  The analysis was based on survey responses to several hundred attitude and preference questions administered to more than 20,000 new car buyers, and the results yielded extraordinary insight into the needs and wants driving ultimate car model selection. In our own internal testing of CART classification based on synthetic data, we have successfully run CART models on targets with 1,000 levels.

Published in Dan Steinberg

By: Edouard Philippe Martin

Ground–level ozone (O3) and fine particulate matter (PM2.5) are two air pollutants known to reduce visibility, to have damaging effects on building materials and adverse impacts on human health. O3 is the result of a series of complex chemical reactions between nitrogen oxides (NOx) and volatile organic compounds (VOCs) in the presence of solar radiation. PM is a class of airborne contaminants composed of sulphate, nitrate, ammonium, crustal components and trace amounts of microorganisms. PM2.5 is the respirable subgroup of PM having an aerodynamic diameter of less than 2.5 μm. Development of effective forecasting models for ground-level O3 and PM2.5 is important to warn the public about potentially harmful or unhealthy concentration levels.

Published in News

CHARLOTTE, N.C. — Salford Systems returns to the annual Institute for Operations Research and the Management Sciences Conference (INFORMS) where they will present data mining technology to expert and novice users in a user–friendly approach at the conference’s technology workshop. The workshop will be held on Saturday, Nov. 12, 9:00 a.m. – 11:30 a.m. at the Charlotte Convention Center.

Published in News

SAN DIEGO — Academic and Student license fees for Salford Systems’ predictive modeling software have been dramatically reduced for the 2011 academic year. Through Studica, a source for software and technology products for academics, Salford Systems has been able to reduce its license fees for these groups up to nearly 95 percent.

Salford Systems’ four core data mining products – CART, MARS, TreeNet and RandomForests – as well as the Salford Predictive Modeler™ (SPM) suite – will maintain a one-year license for a single user. Students interested in utilizing SPM, which includes all four products, may purchase it for as little as $120 per license. Professors will also be eligible for discounted software licenses. In addition, professors may receive free 90-day student licenses with the purchase of a license upon submitting their current faculty ID and course syllabus.

Published in News
Wednesday, August 31 2011 08:02

Introduction

Dan Steinberg, CEO of Salford Systems, has initiated a blog principally devoted to technical matters pertaining to our core products CART, MARS, TreeNet, RandomForests, Generalized Path Seeker, and RULEFIT, among others. This new blog focuses on the fields of data mining, machine learning, predictive analytics, and business intelligence, but with a personal perspective. Entries here could well recount conversations with product developer Jerry Friedman, or some time ago with Leo Breiman, or could reflect his thoughts on the art and practice of advanced analytics and the development of new analytics methodology.

Published in Dan Steinberg

Part 5
Louise Francis of the Casualty Actuary Society and Francis Analytics presents "Data and Disaster: the Rold of data in the Financial Crisis" at the Salford Data Mining Conference, 2009.

Published in Conference
<< Start < Prev 1 3 > End >>
Page 1 of 3