by John Sall, Co-Founder and Executive Vice President, SAS
What if I told you that by adding a very simple feature, you could fit many models more accurately than before? At the same time, I could show you that your previous answers were very biased, but the new ones were much less biased. Furthermore, rather than just fitting the nonmissing data, you could use all of the data, making predictions even when regressors are missing.
This big opportunity comes when the missingness of the data is predictive. Consider the JMP sample data table Equity.jmp, which contains home mortgage default data. The response is whether the customer defaulted on the loan. One of the regressors is DEBTINC, which is the ratio of debt to income. DEBTINC is missing for much of the data. Are the missing values predictive of whether the customer defaulted on the loan?
If I fit the usual model with all 12 of the other variables as regressors, the RSquare is .23, but this is on only 3,364 of the 5,960 rows. I can’t predict the response when any of the regressors is missing.
Suppose I fit the model using the missingness of DEBTINC as an indicator variable. The RSquare in this one-regressor model is better (.25), and I use all 5,960 rows to predict regardless of whether the data is missing. So missingness can be predictive, informative, and even outperform regular regressors.
The new idea is to create not only the missing value indicator variables but also make the original regressors useful without discarding rows of missing data.
In JMP Pro 11, the Fit Model launch window includes a new red triangle menu item called “Informative Missing.” Although the feature is only in JMP Pro, you can, with some effort, achieve the same goal by adding formula columns to the data table, as described in this article.
A traditional way to do this is with imputation. You estimate the missing values. One standard imputation method is to predict values for a missing variable using all the other variables in the model that are nonmissing. When there are missing values in the data, the Multivariate platform supports imputation with the “Impute Missing Data” option. And there are other imputation approaches.
Informative Missing regression coding is much simpler than imputation and also more powerful. For every continuous variable in your model that has missing values, you substitute two variables. The first variable substitutes a mean value for missing data in the column. The other variable is a missing value indicator – 1 if the original column is missing, 0 otherwise.
For example, in Equity.jmp, I create two new variables with formulas for the continuous regressor called DEBTINC. Here is the script for creating two new variables with formulas:
New Column( “DEBTINC Or Mean if Missing”, Numeric, Continuous,
Formula( If( Is Missing( :DEBTINC ), 33.78,:DEBTINC ) )
);
New Column( “DEBTINC Is Missing”, Numeric, Continuous,
Formula( If( Is Missing ( :DEBTINC ), 1, 0 ) )
);
“33.78” in the formula is the mean of DEBTINC, though it can be any value you like.
Now instead of using DEBTINC in the model, you use the two new formula variables DEBTINC Or Mean if Missing and DEBTINC Is Missing. The extra predictor – the indicator variable – can be strongly predictive of the response. In this example, the act of leaving the field missing is a strong clue about the risk of a loan to the applicant.
With these two new predictors, you have two variables that are never missing, so they can be used in all the data. The missing value indicator variable might be important in the fit, thus giving you a chance to improve the fit.
What about the other new regressor, DEBTINC Or Mean if Missing? It appears to be a primitive way to impute DEBTINC using just the mean. Replacing missing values with the mean does not affect the estimate of that variable. Rather, the parameter for the missing value indicator estimates the difference between the prediction and the mean value of that regressor.
You could substitute zero instead of the mean for missing DEBTINC. The parameter would then estimate the difference between the prediction and having a zero for that covariate. The plug-in value for missing only affects the interpretation of the indicator parameter estimate.
This technique is not imputation. You do not substitute any predicted values for the missing values. Instead, you construct a coding system to recover information when the regressors are missing. Furthermore, you use missing values for their predictive value. It is assuming that having a missing value there is not some random event, but is for some cause that might have predictive value. That is why we call this coding technique “Informative Missing” regression coding.
Table 1 shows the dramatic differences between using imputation and Informative Missing for the Equity.jmp data.
Table 1. Comparing Imputation to Informative Missing
Method | Number of Observations Used | Rsquare |
Old full model with row-wise missing exclusion | 3,364 | 23.2% |
One-variable model with DEBTINC is Missing | 5,960 | 25.6% |
New full model with Informative Missing | 5,960 | 45.6% |
Figure 1 shows that missing value indicator variables are very significant, with DEBTINC Is Missing being more significant than any other regressor in the model.
Figure 1. Parameter estimates sorted by significance
As demonstrated here, you can create formula columns (or transforms) to regress with. With JMP Pro 11, you just have to select the “Informative Missing” red triangle menu option in Fit Model. Another option is downloading the “Informative Missing Coding” add-in from the JMP File Exchange.
Missingness can be your friend if you treat it as being valuable.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.