Autodiscovery leverages the stability advantages of multiple trees to rank variables for importance and thus select a subset of predictors for modeling. In SPM® v8.2 and earlier Autodiscovery runs a very simple training data only TreeNet model growing out to 200 trees. The variable importance ranking generated from this model is then used to reduce the list of all available predictors down to the top performing predictors in this background model. Autodiscovery is fast and easy, as there are no control parameters to set, but it is just a mechanism for quickly testing whether a substantial refinement in the number of predictors can improve model performance.
The short answer is YES such plots can be generated. Historically, we concluded that such graphs would normally not be that interesting as they would frequently be single step functions reflecting the fact that individual variables often appear only once or twice in a tree. Also, such graphs would not properly reflect the effect of a varible across most of its range of values. Thus, as of SPM 7.0 CART does not offer such plots. However, we can see what such plots would look like by using TreeNet to grow a one-tree model. To do this, just set up a normal model, choose the TreeNet analysis method, and set the number of trees to be grown to 1 (see green arrow below).
CART is capable of determining the number of records in your data sets, and uses this information to predict the memory and workspace requirements for trees that you build. Also, CART will read your entire data set each time a tree is built. At times these actions may be problematic, especially if you have enormous data sets.
Feature Matrix and download PDF
SPM 6.6 (TreeNet TN 6.4) or greater supports data access to Microsoft SQL Server, Oracle, MySQL and other RDMS via ODBC interface.
Since SQL Queries cannot be entered via standard Windows ODBC dialog data source selection dialog, one has to use command line to open data directly from SQL Server.
CART supports three "improvement penalties." The "natural" improvement for a splitter is always computed according to the CART methodology. A penalty may be imposed, however, that causes the improvement to be lessened depending, affecting the penalized splitter´s relative ranking among competitor splits. If the penalty is enough to cause the top competitor to be replaced by a competitor, the tree is changed.
The SPM® software suite must be downloaded with Administrator rights and read/write & modify permissions MUST be applied to the /bin directory PRIOR to proceeding. If you need help with SPM Installation (Administrator Rights & Ensuring Proper Permissions),
please contact us
or email Support (at) salford-systems (dot) com
Once the above instructions have been completed, you can now request your Unlock Key.
or e–mail the following information to Unlock (at) salford-systems (dot) com
Feature Matrix and download PDF
A user's license sets a limit on the amount of learn sample data that can be analyzed. The learn sample is the data used to build the model. Note that there is no limit to the number of test sample data points that may be analyzed. In other words, rows -by- columns of variable and observations used to build the model. Variable not used in the model do not count. Observations reserved for testing, or excluded for other reasons, do not count.
Feature Matrix and download PDF
SPM for Windows has long had the ability to read tables in relational databases through the ODBC interface. This capability was also recently added to the command line version on Windows and it is planned on UNIX platforms (including MacOS X). The purpose of this article is to describe how to access MySQL databases specifically, but the same principles will apply to accessing data stored in other relational database systems. Probably, the only thing that will differ will be the driver used.
At Salford Systems we believe that expert and timely technical support is a critical part of our business and therefore have organized our technical team to provide our customers with some of the finest support in the industry. Here is how we help you get the best from your data mining software:
You wish to apply your results to new data, but CASE will not accept the data.
CART will only search over all possible subsets of a categorical predictor for a limited number of levels. Beyond a threshold set by computational feasibility, CART will simply reject the problem. You can control this limit with the BOPTION NCLASSES = m command, but be aware that for m larger than 15, computation times increase dramatically.
CART is an acronym for Classification and Regression Trees, a decision-tree procedure introduced in 1984 by world-renowned UC Berkeley and Stanford statisticians, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Their landmark work created the modern field of sophisticated, mathematically- and theoretically-founded decision trees. The CART methodology solves a number of performance, accuracy, and operational problems that still plague many other current decision-tree methods. CART's innovations include:
Cross-validation is a method for estimating what the error rate of a sub-tree (of the maximal tree) would be if you had test data. Regardless of what value you set for V-fold cross validation, CART grows the same maximal tree. The monograph provides evidence that using a V of 10-20 gives better results than using a smaller number, but each number could result in a slightly different error estimate. The optimal tree — which is derived from the maximal tree by pruning — could differ from one V to another because each cross-validation run will come up with slightly different estimates of the error rates of sub-trees and thus might differ in which tree was actually best.
CART and MARS continue to read data stored in the legacy SYSTAT format, a binary (i.e., not human-readable) format widely used by statisticians and researchers using the SYSTAT statistical programs. Relative to comma-separated-text and some other binary formats, the legacy SYSTAT format is quite restrictive (limited variable name lengths, limited lengths of character data). We do not recommend that you use it. However, for our clients that do need to work with this format, we provide the following C and Fortran programs that illustrate how legacy SYSTAT datasets are structured. Originally, legacy SYSTAT format was written and read with Fortran code. Thus, because the format must accommodate the record segmentation and padding typical of Fortran I/O, the C version handles these issues explicitly.
CART automatically produces a predictor ranking (also known as variable importance) based on the contribution predictors make to the construction of the tree. Predictor rankings are strictly relative to a specific tree; change the tree and you might get very different rankings. Importance is determined by playing a role in the tree, either as a main splitter or as a surrogate. CART users have the option of fine tuning the variable importance algorithm.
Salford Systems' CART is the only decision tree based on the original code of Breiman, Friedman, Olshen, and Stone. Because the code is proprietary, CART is the only true implementation of this classification-and-regression-tree methodology. In addition, the procedure has been substantially enhanced with new features and capabilities in exclusive collaboration with CART's creators. While some other decision-tree products claim to implement selected features of this technology, they are unable to reproduce genuine CART trees and lack key performance and accuracy components. Further, CART's creators continue to collaborate with Salford Systems to refine CART and to develop the next generation of data-mining tools.
If a variable does not enter the tree as a primary node splitter, it may still play a important role in the tree as a surrogate splitter. If you have turned the displaying of surrogate splitters off, you will not see how these variables affect the tree but they will still be used internally by CART when applying the tree to data. The Variable Importance Table produced by CART ranks the variables in the tree by their importance, a statistic measuring how strongly a variable acts as a primary or surrogate splitter.