The last thing most data scientists want is a machine that replaces them! The idea that we can build a machine to conduct sophisticated analyses from start to finish has been around for some time now and new attempts surface every few years. The fully automated data scientist is going to be attractive to some organizations with no analytics experience whatsoever but for more sophisticated organizations the promise of such automation is bound to be met with skepticism and worry. Can you imagine visiting a machine learning driven medical service, accepting a diagnosis and prescriptions, and even undergoing surgery with no human oversight involved?
Dan Steinberg's Blog
If you open a saved grove for any any Salford Systems data mining engine (CART, MARS, TreeNet, RandomForests) you will notice a "Commands" button among a row of controls along the bottom of the display. The Commands button will open a plain text window displaying all the commands entered in your session up until the run that generated the grove.
The Salford Predictive Modeler® software suite's component data mining engines CART®, MARS®, TreeNet®, and RandomForests® contain a variety of tools to help modelers work quickly and efficiently. One of the most effective tools for rapid model development is found in the BATTERY tab of the MODEL Set Up dialog. Because there are so many tools embedded in that dialog we are going to start a series of posts going through the principal BATTERY choices, one at a time.
The theory behind the CART decision tree, as laid out in detail in the classic monograph Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone, dictates that CART trees always be grown to their largest possible size before pruning. This means that the smallest allowed terminal node will have only one record in it! In theory, this is not a problem because large CART trees are just the raw material that the pruning engine starts with to arrive at the optimal tree. Thus, it is likely that all the small nodes will be pruned off anyway.
In practice, however, real world data sets do not always work that way. In fact, in small to moderate sized data sets it can pay to manipulate the controls that govern the sample sizes allowed in CART nodes. In this note we discuss this topic and provide some examples and instruction on how to use the controls.
The gradient boosting machine has recently become one of the most popular learning machines in widespread use by data scientists at all levels of expertise. Much of the rise in popularity has been driven by the consistently good results Kaggle competitors have reported over several years of competition. Many users of gradient boosting machines remain a bit hazy regarding the specifics of how such machines are actually constructed and of where the core ideas for such machines come from. Here we want to discuss some details of the shape and size of the trees in gradient boosting machines.
Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level.
Most users of Salford Systems' data mining tools (CART®, MARS®, TreeNet®, RandomForests® or the more recent integrated SPM™ package) rely on the GUI (Graphical User Interface) to do their work. The GUI makes life easy as you do not need to remember any command syntax and of course the GUI has many useful visual displays of important results. But there are some good reasons to learn how to work with command scripts which is the topic for the current posting. We will refer to our software as SPM (Salford Predictive Modeler) which includes all of our individual data mining engines.
It is useful to remember that almost everything you do during a GUI session using SPM has a "command equivalent." That means that you could accomplish the identical model and results simply by submitting a set of commands to SPM instead of pointing and clicking. Even more useful to remember is that SPM automatically creates the equivalent set of commands for you as you work, saving the results to a text file. We will return to how to locate that text file a bit later.
In their 1984 monograph, Classification and Regression Trees, Breiman, Friedman, Olshen and Stone discussed at length the need to obtain "honest" estimates of the predictive accuracy of a tree–based model. At the time the monograph was written, many data sets were small, so the authors took great pains to work out an effective way to use cross–validation with CART trees. The result was a major advance for data mining, introducing ideas that at the time were radically new. The main point of the discussion was that the only way to avoid overfitting was to rely on test data. With plentiful data we can always reserve a portion for testing, but with fewer data we might have to rely on cross validation. In either case, however, only the test or cross–validated results should be trusted. In contrast, earlier approaches tended to ignore the training data performance results and focus only on the test data.