Download Now! Free 30 Day Trial of Salford System's Predictive Modeling Suite

Upcoming Tradeshows

  • JSM
    July 28, 2012 - August 02, 2012
    San Diego, CA, Booth TBA
  • KDD
    August 12, 2012 - August 16, 2012
    Beijing, China, Booth TBA
  • Statistical Learning and Data Mining III
    October 01, 2012
    Boston, MA
  • DMA
    October 13, 2012 - October 19, 2012
    Las Vegas, NV
  • INFORMS
    October 14, 2012 - October 16, 2012
    Phoenix, AZ
View full calendar
Home Blog Company How to Control Salford Console Applications From an External Process

How to Control Salford Console Applications From an External Process

Written by  John Ries Thursday, December 15 2011
Rate this item
(2 votes)

Warning: The following article is ‘geeky.’ It has to be, since it discusses programming techniques (but non–geeks are welcome to continue).

One of the questions our more technically adept users sometimes ask is how to run our applications from an external program written in a language such as Perl, Python, or Microsoft Visual BASIC. While our standard GUI programs do allow this, it is much more conveniently done with the console (command–line or non–GUI) versions which are standard on UNIX and Linux, but are also available for MS Windows. The reasons for this are as follows:

  1. Windows GUI applications disable standard input and output, which are the most convenient means of communicating with an external program.
  2. Our GUI applications necessarily use more memory and are slower than their non–GUI counterparts (after all, they produce a lot of graphical displays and reports that our console applications do not), and they take much longer to start up.
  3. At the present time, our GUI applications do not allow the display to be suppressed, which makes it inconvenient to run them from Windows services.

Standard Input and Output

These terms refer to what used to be the most common means of interacting with a computer program: typing messages at a terminal (or a virtual terminal like the Windows Command Prompt), and receiving responses. In the days before CRT terminals or personal computers were widespread, the input messages were usually written on punch cards and the responses normally came back over the printer. More recently, users would send messages to the program from their keyboards and the results would return to their monitors. The stream of messages to the program is called ‘standard input,’ while the stream of messages from the program is called ‘standard output.’ By default, standard input comes from your keyboard, and standard output goes to a terminal display, but all modern operating systems allow standard input and output to be redirected in many different ways.

For example, if you launch console CART interactively, you will get something like this:

CART ProEX version 6.2.0.162
Copyright, 1991–2006, Salford Systems, San Diego, California, USA
Launched on 9/8/2009 Licensed until 12/31/2010.
This launch supports up to 32768 variables.
256 MB RAM allocated at launch, partitioned as:
Real : 65109998 cells
Integer : 1114112 cells
Character: 3539016 cells
The license supports up to 9999999 MB of learn sample data.
StatTransfer enabled.
>

CART then waits for you to type commands. At this point, if you opened a dataset with the USE command, CART would report its success or failure to do so and, if successful, would list the variables in the dataset thus:

>use boston
Opening text file as dataset:
/test/datasets_csv/BOSTON.CSV
/test/datasets_csv/BOSTON.CSV uses, as delimiter.
VARIABLES IN RECT FILE ARE:
CRIM ZN INDUS CHAS NOX
RM AGE DIS RAD TAX
PT B LSTAT MV
/test/datasets_csv/BOSTON.CSV: 506 records.
>

From there you could perform the desired analyses by typing additional commands. When finished, you could exit the session with the QUIT command, at which point you would be returned to your friendly command prompt (or perhaps, depending upon how you launched CART, the window would simply close).

While this is not a very pleasant way for humans to interact with CART (the GUI is much better for interactive use), it does provide an easy way to write programs that interact with CART (or whichever Salford console application is desired). This is accomplished by re–directing standard input and output in such a way that the application is communicating with an external process, rather than with the user directly.

Korn Shell Example

In the course of a data mining competition in which Salford recently took part, I wrote a script which ran a battery of TreeNet models which were all specified in the same manner, but each model used a different division between the learning and test samples. I wrote the script in Korn Shell 93 (the most recent specification of the Korn Shell language, as published by AT&T; see http://www.kornshell.org) and ran it under CentOS 5.1 (CentOS is a Linux distribution based on Red Hat's Fedora distribution). Since this uses ksh93–specific constructs, it will not run under such versions of the Korn Shell as pdksh, which is based upon older specifications. The script is as follows:

#!/bin/ksh
SPM=spm640205
ND=26
{
print submit fpath
print use \"../Data/cacmps5x3.sas7bdat\"
print submit class2
print submit keep66
print category mps CAC_GENDER DWELL_TYPE cred mail donor resp,
printLIFESTAGE_GRP2 LIFESTYLE_DIM1 LIFESTYLE_DIM2 SILHOUETTE,
printactiveyr CNTY_SIZEN CAC_MARSTAT2 OCCUP_CD2 agecatd CRED_ANY,
printdimn DIM_ANY DONOR_ANY RESP_ANY kids Lifestage_Bin Lifestage_Grp2_Bin,
printLifestyle_Dim1_Bin Lifestyle_Dim2_Bin, Silhouette_Bin binarydpv
printloptions timing pred gains roc
printmart trees=2000 nodes=6 learnrate=.01 fullreport=yes
printmodel binarydpv
print keep copy keep66
print cw unit

for ((i=1; i<=$ND; i++)); do
model=bt66tlun6d$i

printoutput $model
printgrove $model
printsave $model.csv /model
printnote \"TreeNet model on BinaryDPV \(KEEP66\)\"
printnote \"Logistic/CW UNIT\"
printnote \"ERROR SEPVAR=D$i\"
printid respid d$i
printerror sepvar=d$i
printmart go
done

}|$SPM >/dev/null

spm640205 is Salford Predictive Miner (SPM) 6.4.0.205, a developmental version of our data mining suite, which I used in this analysis. I previously created a series of variables D1-D26, which specified 26 different divisions between the learning and test samples. The print statements in the code between ‘{‘ and ’}’ write commands to SPM, which in turn executes them. The base model is specified first, and then the 26 models are built and scored with a for loop. In this example, no attempt is made to read the responses to the commands. Instead, standard output is redirected to /dev/null which is the standard null device on UNIX and Linux systems (NUL: is the standard null device under DOS/Windows); and the text (classic) output for each model is saved to individual files with the OUTPUT command. No QUIT command is required because SPM interprets the end of the input stream as QUIT. In my next post, I will demonstrate how to extract useful data from standard output and use them to create a report.

Perl Example

When I wrote the above shell script, I took the ‘quick and dirty’ approach, rather than trying to write something generic. The ‘right way’ would be to write a parameterized script that could be used on a variety of datasets. I recently wrote such a script in Perl, as shown below:

else {$EXEC=treenet}
#!/usr/bin/perl
use Getopt::Std;
#Set Command-line flags
getopts("cdgi:sx:",\%opt);
$WRTCMD=$opt{"c"};
$DEBUG=$opt{"d"};
$SAVEGRV=$opt{"g"};
$ID=$opt{"i"};
$SCORE=$opt{"s"};
if (defined $opt{"x"}) {$EXEC=$opt{"x"}}
#Set constants
$ARGC=@ARGV;
$NULL="/dev/null";
#Help message
if ($ARGC<4) {
print "Run battery of TN models, each on a different learn/test division\n";
print "Usage:\n";
print "$0 [-cdgs] [-i ] [-x ] \n";
print "cmd: Name of the command file specifying the base model\n";
print "basename: Name to use as base for output, grove file names, and scores\n";
print "basefold: Base name of variables defining learn/test divisions\n";
print< "nfolds: Number of learn/test divisions\n";
print "\n";
print "Options:\n";
print " -c: Write commands generated to standard output and then exit";
print " -d: Run in debug mode. Text output not suppressed";
print " -g: Save grove files (name=basename+index.GRV)\n";
print " -i : Specify one or more ID variables\n";
print " -s: Save scores to CSV (name=basename+index.CSV)\n";
print " -x : Use TN executable named \n";
exit}
#Set command line parameters
$CMD=$ARGV[0];
$BASENAME=$ARGV[1];
$BASEFOLD=$ARGV[2];
NFOLD=$ARGV[3];
#Run the battery
if ($WRTCMD) {$XPROC="cat"}
elsif ($DEBUG) {$XPROC=$EXEC}
else {$XPROC="$EXEC>$NULL"}
open TN,"|-",$XPROC||die "Failure to run TreeNet executable $EXEC\n";
print TN "submit '$CMD'\n";
for ($index=1; $index<=$NFOLD; $index++) {
$model=$BASENAME.$index;
$fold=$BASEFOLD.$index;
print TN "output $model\n";
if ($SAVEGRV) {print TN "grove $model\n"}
if ($SCORE) {
print TN "save '$model\.csv' /model\n";
print TN "idvar $ID $fold\n"}
print TN "error sepvar=$fold\n";
print TN "mart go\n"}
print TN "quit\n";
close TN||die "Failure to close TreeNet process\n";

In this example the base model specification is contained in an external file, the name of which is passed to the script as a parameter; together with a base model name, and a name specification for the learn/test indicators. There are also some options which may be set, such as the name of the TreeNet executable, and any ID variables. For our purposes, however, the most important section of the script is the one labeled “Run the battery.” Here we open the TreeNet process as if it were a file and then we write the appropriate commands to it. We save the classic output for each model in a separate file; optionally we save the groves and scores as well. When processing is finished, we write the QUIT command to TreeNet and close the process. As in the previous example, no attempt is made to parse the responses written to standard output; by default they are sent to the null device. When generating the commands without executing them for debug purposes (the –c flag), the script sends them to the UNIX cat utility, which is will not work under Windows unless an appropriate set of UNIX utilities is installed.

I have found that the easiest way to pass model specifications to a generic script is in the form of a command file. The script executes the command file with a SUBMIT command, followd by the specific commands required to run the battery. For example, the commands specifying the base model used in the Korn Shell example would look like this:

submit fpath
use "../Data/cacmps5x3.sas7bdat"
submit class2
submit keep66
category mps CAC_GENDER DWELL_TYPE cred mail donor resp,
LIFESTAGE_GRP2 LIFESTYLE_DIM1 LIFESTYLE_DIM2 SILHOUETTE,
activeyr CNTY_SIZEN CAC_MARSTAT2 OCCUP_CD2 agecatd CRED_ANY,
dimn DIM_ANY DONOR_ANY RESP_ANY kids Lifestage_Bin Lifestage_Grp2_Bin,
Lifestyle_Dim1_Bin Lifestyle_Dim2_Bin, Silhouette_Bin binarydpv
loptions timing pred gains roc
mart trees=2000 nodes=6 learnrate=.01 fullreport=yes
model binarydpv
keep copy keep66
cw unit

Note that the MART GO command, which would actually build the model, is absent. The purpose of the file is not to build a model, but merely to specify one. The script itself will generate the commands necessary to build the appropriate models.