SAN DIEGO — RandomForests® Co–Developer Dr. Adele Cutler is presenting a case study of archetypal analysis of dietary patterns related to memory and aging at the Salford Analytics and Data Mining Conference. The conference will take place on May 24–25, 2012 at the Courtyard San Diego Old Town hotel in San Diego, Calif.
Dr. Cutler–s work with the late Dr. Leo Breiman of the University of California, Berkeley on RandomForests has helped enable data mining programs and consulting firms accomplish key project objectives with its ability to work with large datasets and provide extreme predictive accuracy.
“RandomForests and Archetypal Analysis of Dietary Patterns in the Cache County Study on Memory and Aging” is Dr. Cutler’s joint work with Heidi Wengreen, Ron Munger, Chris Corcoran and Anna Quach at the University of Utah. This, and other real–word data mining case studies to be presented at ADMC, are a true testimony to the power that algorithms such as RandomForests have in the modern world when aiming to turn a dataset of information into knowledge.
Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today’s post is actually quite basic in nature and is in response to a user’s question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made ‘missing value codes’ for unknown data fields; favorite values have include a string of 9’s such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing ‘unknown’ and 9997 for ‘refused.’ Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.
Salford Systems Predictive Modeler, including CART®, MARS®, TreeNet®, and RandomForests®, can handle any number of variables you care to work with. By default your software will launch prepared to work with up to 32,768 variables which is sufficient for many users. However, if you need to work with a larger number you just need to let the software know at the time the application is launched.
If you are working with non–GUI version you make use of command line arguments informing the application of your preferences. For example the command line syntax is:
SPM.EXE -v< N > Specifies max N variables for the session.
With the GUI version you essentially do the same adding the command line arguments by modifying the properties of the application.
Just follow the following steps, for example, to inform SPM you expect to work with up to 50,000 variables:
The value used for this parameter reflects the number of variables allowed to be used in the application. For example, if you need to use 75,000 variables, then you would need to set this parameter at –V75000.
SAN DIEGO — CART® and RandomForests® co–developers include two of the prominent speakers for Salford Systems’ Analytics and Data Mining Conference, which will be held in San Diego, CA May 24–25, 2012.
CART co–developer Dr. Richard Olshen’s interests regarding research are in statistics and mathematics and their applications to medicine and biology. Many efforts have concerned binary tree–structured algorithms for classification, regression, survival analysis, and clustering. Those for classification and survival analysis have been used with success in computer–aided diagnosis and prognosis, especially in cardiology, oncology, and toxicology.
The Salford CART decision tree is exceptional in supporting an essentially unlimited number of target levels. Of course the vast majority of classification problems tackled by analysts have two classes, or are reformulated to have two classes. There is no reason, however, to confine yourself to just two levels if you are working with CART. In our training materials we discuss three–level, five–level, and ten–level examples in detail. The ten–level example concerns the reverse engineering of a clustering solution, in which a market researcher was looking to extract a simple set of rules that could be used to assign new records to a previously constructed clustering solution based on a very large number of variables. Ten levels is a rather small number when considering how far you might be able to stretch the CART machinery. In our work with a car manufacturer our goal was to predict the specific car model chosen by a new car buyer from a set of more than 400 alternatives. The analysis was based on survey responses to several hundred attitude and preference questions administered to more than 20,000 new car buyers, and the results yielded extraordinary insight into the needs and wants driving ultimate car model selection. In our own internal testing of CART classification based on synthetic data, we have successfully run CART models on targets with 1,000 levels.
In 1995 Leo Breiman was actively experimenting with his first version of the bagger, and that at time I was in constant contact with him via email. In some cases at Salford Systems we implemented ideas of Leo's as we were discussing them with him. At other times we debated certain details and exchanged ideas in a lively give and take. Leo's initial ideas always took as a given that the bagged trees needed to be pruned and he was using 10–fold cross validation to do so. Because this added a substantial computational burden to the process I suggested that he use the OOB (out of bag) data to test and prune each bagged tree. In response, Leo began experimenting with this idea and eventually concluded that the entire training sample (both in–bag and out of bag) should be used to prune each bagged tree. Of course, subsequent research showed that unpruned trees were in fact ideal and thus the topic of using OOB data for pruning trees fell by the wayside. OOB data became very important in Leo”s subsequent work on RandomForests four years later.
The emails here are a selection of messages I received from Leo in mid–1995 on the topic. Unfortunately, we do not appear to have any copies of my side of the conversation. We hope to post other messages from Leo here from time to time as his remarks covered a very broad range of topics pertaining to trees and data mining.
CHARLOTTE, N.C. — Salford Systems returns to the annual Institute for Operations Research and the Management Sciences Conference (INFORMS) where they will present data mining technology to expert and novice users in a user–friendly approach at the conference’s technology workshop. The workshop will be held on Saturday, Nov. 12, 9:00 a.m. – 11:30 a.m. at the Charlotte Convention Center.
SAN DIEGO — Academic and Student license fees for Salford Systems’ predictive modeling software have been dramatically reduced for the 2011 academic year. Through Studica, a source for software and technology products for academics, Salford Systems has been able to reduce its license fees for these groups up to nearly 95 percent.
Salford Systems’ four core data mining products – CART, MARS, TreeNet and RandomForests – as well as the Salford Predictive Modeler™ (SPM) suite – will maintain a one-year license for a single user. Students interested in utilizing SPM, which includes all four products, may purchase it for as little as $120 per license. Professors will also be eligible for discounted software licenses. In addition, professors may receive free 90-day student licenses with the purchase of a license upon submitting their current faculty ID and course syllabus.
Dan Steinberg, CEO of Salford Systems, has initiated a blog principally devoted to technical matters pertaining to our core products CART, MARS, TreeNet, RandomForests, Generalized Path Seeker, and RULEFIT, among others. This new blog focuses on the fields of data mining, machine learning, predictive analytics, and business intelligence, but with a personal perspective. Entries here could well recount conversations with product developer Jerry Friedman, or some time ago with Leo Breiman, or could reflect his thoughts on the art and practice of advanced analytics and the development of new analytics methodology.