A Reminder About Missing Values
Today's post is actually quite basic in nature and is in response to a user's question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made 'missing value codes' for unknown data fields; favorite values have include a string of 9's such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing 'unknown' and 9997 for 'refused.' Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.
Our tech support department receives a steady stream of interesting questions regarding how to use our products, with questions about specific features or how to accomplish a given task. We also receive questions about data mining (and predictive analytics generally), modeling strategy and a variety of other topics. One type of query that comes up periodically is what to do with missing values. We have spoken before about missing values in a variety of contexts, but usually at a fairly technical and advanced level. Today's post is actually quite basic in nature and is in response to a user's question about what to do with special values for variables that are intended to represent missing values. Data input practice stemming from at least the 1970's has made 'missing value codes' for unknown data fields; favorite values have include a string of 9's such as 9999 or -9999. There are a number of variations on this theme. For example, survey research firms have wanted to distinguish between different reasons for a missing value using, for example, 9999 to represent values missing for no known reason and 9998 representing 'unknown' and 9997 for 'refused.' Data input clerks have been known to fill in missing birthdays with values such as January 1, 1960.
This brings us to two important points. First, if you are working with data that was managed and created by someone else you may be vulnerable to undocumented missing value codes. This means that some of the data values you have may not represent genuine data values, but artificial constructions; it will be up to you to discover them. Usually, variable–by–variable descriptive statistics will help you locate candidates for undocumented missing value codes. Frequency tables are a big help here, as missing value codes often show a much higher frequency than would otherwise be expected (favorite fill–ins for missing birth dates were first discovered this way). Within Salford software you can always request a DATAINFO from the View menu item to obtain detailed summary stats and frequency tables.
The second point is that, in general, you do not want to feed such data directly into you predictive modeling mechanism. Feeding data with unmodified —9999 values to an ordinary regression routine could generate disastrously bad models. More capable modeling methods such as CART, MARS, or TreeNet are less vulnerable to distortion but are disadvantaged by being fed dirty data. Given that there is a very simple and straightforward way to deal with this type of data coding, we just want to remind you to take the time and effort to find this type of missing value, and then take the necessary next step: recode the missing value codes to explicitly missing values.
In SPM you accomplish this from the command line using the built–in BASIC programming language. All versions of SPM, and versions of CART, MARS, TreeNet, and RandomForests, include this capability. In the GUI the most convenient way to explicitly code missing values is to go to File > New > Notepad which will open up a plain text window. There you will need to type commands like:
%if X=999 then let X=.
Here the initial % is needed to let SPM know this is a BASIC programming statement. Then the simple 'IF THEN LET' structure does the recoding. You might use a more complex structure like:
%if X=999 then let X=.
%else if X=998 then let X=.
%else if X=997 then let X=.
Or you could write:
%if X=999 or X=998 or X=997 then let X=.
%if X=997 then let X=.
The important point is that somehow you need to replace the missing value codes that appear to be genuine data values and replace them with a missing value code that all Salford data mining engines recognize as a missing value. If it is important to keep track of the reason for missing for a given variable you might want to create reason indicators which are 0/1 variables coded with a value of 1 when the variable is missing for a specific reason. For example,
%let AGE_REFUSED = (age =997)
an example in which a respondent's age is unknown because he/she refused to divulge it in an interview. This variable may then be used as a predictor in any model you attempt to construct.
How does SPM process the properly–coded missing values? This is a complex topic that we have touched on in other posts, and one upon which we will continue to elaborate. Here is a short table providing hints about what the different engines do:
Surrogate or substitute splitters are used in place of the missing splitter; each node in the tree is supplied with node-specific surrogates, making the procedure very flexible.
Special missing value indicators are automatically created and included in the model; the MARS model will develop submodels that are specially adapted to handle each type of missing variable encountered.
Uses missing value indicators to essentially generate 3–way splits for any predictor with missings; all records with missing values for that variable go to their own node, and the remainder go to a left or right child node, as appropriate.
Offers two approaches: the first uses simple mean imputation for continuous variables and mode imputation for categorical predictors; the second method starts with the first but then iteratively updates the imputed values by taking means or modes just from a record's nearest neighbors as defined by forest.
Pluses and minuses of these approaches will be discussed later!