Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.
Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.
You can use CART itself to do this via the built-in SCORE facility. If you use the GUI you access the SCORE dialog via the ToolBar icon to the etxreme right, or from the Model menu item.
Scoring any data set will produce one output record for each input record along with the CART prediction (RESPONSE) and the node number of the terminal node for that record (NODE). You can then SELECT the relevant records from the saved data set in subsequent analyses. The built-in BASIC can be used to delete data for NODE values you are not interested in, but this requires that you first SAVE the scored data set.
There are several stages to interaction detection using Treenet models. The first stage is to run a simple comparison of test sample performance for TreeNet models run with trees of different sizes. The baseline model would be the Treenet using 2-node trees (sometimes known as "stumps"). The core idea is that a tree grown with a single split cannot reflect any kind of interaction as the entire story for the tree involves a single variable and by definition an interaction requires at a minimum two different variables. The 2-node tree baseline model thus represents the best possible model TreeNet can grow when interactions are prevented. We then grow at least one more tree allowing more than 2-nodes, which thus allows interactions. The simple story is that if the 2-node TreeNet is as good, or almost as good as the larger tree model, then we have compelling data-based evidence that interactions are irrelevant to the data generation process (how the real world actually operates to produce this data).
Once you have built an SPM model (CART, MARS, TreeNet, RandomForests) and have saved the grove (.GRV) file you are in a position to make predictions for any other data set containing relevant predictors. Thus, if you trained your model on file A using variables X1, X2,...,X50, for example, you can now predictions for file B, provided that file B contains at least some of the same variables (and preferably all of the variables actually used in the model).
This process of prediction generation is called SCORING in our software and most models are built specifically so that they can be put into production to generate predictions. The process can also be used for SIMULATION. In this case you prepare a data set which will also contain the columns X1, X2, ...,X50 but the values appearing may not necessarily be real data. Instead the file could contain hypothesized or imagined values, or forecasted values, as in the case when you want to make predictions for certain possible future scenarios.
If you open a saved grove for any any Salford Systems data mining engine (CART, MARS, TreeNet, RandomForests) you will notice a “Commands” button among a row of controls along the bottom of the display. The Commands button will open a plain text window displaying all the commands entered in your session up until the run that generated the grove.
Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today’s post is actually quite basic in nature and is in response to a user’s question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made ‘missing value codes’ for unknown data fields; favorite values have include a string of 9’s such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing ‘unknown’ and 9997 for ‘refused.’ Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.