Getting TreeNet Results more Rapidly
Interested in getting TreeNet results more rapidly? What would you say to a 50% time savings? Depending on how your defaults are set right now, following the tips we list in this note could save you more than 50% of your normal run times. There is a cost, as you might expect, but often a small cost relative to the benefits. To explain how the tips work we have to first discuss a few light technical matters.
Generating TreeNet models generally involves three separate processes:
Reading and Preparing the Data
Building the sequence of trees
Generating Graphical Displays
There are a variety of ways to speed up each of the processes; during the course of a project, speeding up each of them could significantly reduce the amount of time you spend getting to your preferred model.
Reading and Preparing the Data
In this phase all the Salford data mining and predictive modeling engines go through the same steps, which involve, among other matters, deleting records as required by any BASIC programming statements, selecting or ignoring records according to any SELECT statements, and mapping text variables to integers (behind the scenes). If you are analyzing a rather small fraction of your data you will be better off by extracting just the records you want to model, saving them to a separate table or file. In the Salford engines you can do this using a set of commands such as the following:
REM lines beginning with REM are comments
REM Basic programming statements here (optional)
REM EXAMPLE CODE; NOTE the % sign at the start of the BASIC code
REM %if variable1> c1 or variable2
After this step, instead of analyzing "myfile," you would just work with "newfile," saving processing time for every run.
Building the Trees
The only effective option here to speed up model development is to build fewer trees. This comes at a cost, however, of not allowing the TreeNet learning process to work itself out fully. Still, for rapid exploration of alternative modeling parameter settings, this can be of some use. We would urge caution with this trick.
How many trees should you allow for? The TreeNet default of 200 trees is what we normally consider the minimal number to allow TreeNet to gain traction with the data. We often develop models for our clients using 1,000 to 3,000 trees. In data mining competitions, where every fraction of a point of accuracy matters, we may go to 20,000 trees. You will have to experiment with your data to get a feel for how many trees will be sufficient to get you to the level of performance you require for your current needs. In exploratory work, you could easily live with many fewer trees.
One of our better-known clients regularly uses models with fewer than 100 trees because their models must be deployed in real time and hence must be small. In other words, you have to make a decision based on your own needs and circumstances.
The TreeNet graphical displays are essential to understanding the workings of a model. The graphs reveal not only the nature of the relationship between the target and a predictor, but also potential oddities in the data and possible shortcomings of the model. However, when building a collection of relatively similar models the need to review the dependency plots for the key variables every time occurs rarely and there may be no need to review more than a handful of graphs for the most important variables. This leads us to the possibility of switching off all or part of the automatic graph generating that occurs after every TreeNet model is built. Surprisingly, the graph generation can often take as much time to complete as the original model building. So switching the graph generation off can save you 50% of your model development time! To get to the TreeNet options controls, first click on the options icon on your toolbar. In the screen shot below the red arrow points to what you are looking for:
You can also reach the options controls via the EDIT menu item and selecting Options. The dialog you want will look something like the one shown below:
The screen shot above is taken from the Salford Predictive Modeling Suite that combines TreeNet with other analytical engines. If you are running a standalone TreeNet then you will not see tabs for CART, MARS, or RandomForests, you will see the GENERAL, TREENET, and DIRECTORIES tabs.
In the "Plot Creation" section of this dialog you have the option of checking or unchecking four items; here we are concerned with just the first two. Unchecking ALL the options will turn off all post-model graphs displaying relationships between the target and the predictors. This will save the most time and is recommended when you are evaluating runs strictly on the basis of some summary statistic like R-Squared for regression or Area Under the ROC curve for classification.
Another, less radical shortcut is to turn off just the "Bivariate Dependence" plots; these are 3D plots of the target against every pair of important predictors. With ten important predictors there would be as many as 55 such 3D plots, and each takes time to produce. You can also cut back the number of plots that will be produced. In the display above we have entered the value "10" for the number of important predictor variables to plot. This means that plots will be produced for the ten highest ranked predictors. You can cut this number down to, say five, if you need to examine the plots but can live with an examination of just the topmost predictors.
If you turn off the graphs in TreeNet, what are your options if you decide you do want to browse the available graphical displays? You have two options: first, you can always rerun the model with the graphs turned on. Just set the check marks on the Treenet tab of the Options dialog. You can do this at any time in the future, and it will be especially easy if you have saved the commands that generated the model in question. Your other option is available immediately after running a TreeNet model. The TreeNet RESULTS display is your gateway to access all the available reports for the model and will look something like this:
The "Create Plots" button will allow you to generate plots of the target variable against any single predictor used in the model, whether that variable is important or not. It will also let you produce 3D plots of the target against any pair of predictors used in the model. The "Create Plots" button remains active only while you are still connected to your training data and is only guaranteed to be active immediately after building your model.
The great thing about this option is that you can use it only when you have decided that your model is sufficiently interesting to merit further exploration. Otherwise, if plot creation has been turned off, you save quite a bit of compute time.