It’s not just what you say, but what you don’t say: Informative missing values
Sometimes emptiness is meaningful. If a loan applicant leaves his debt and salary fields empty, don’t you think that emptiness is meaningful? If a job applicant leaves this previous job field empty, don’t you think that emptiness is meaningful? If a political candidate fills out a form that has an empty field for last convicted felony, that emptiness is hopefully positively informative. Missing values are values too — they are just harder to accommodate in our statistical methods. You can’t do arithmetic on them to find means. But you can still put them to use in practical ways.
Some data sets are full of missing values. The standard methods have to drop a whole observation if any of the X’s are missing. If you model with lots of variables and one or another of the variables is missing, you may end up with no data to estimate with. Furthermore, even if you do end up with enough data, that part of the data may be a biased sample of the data whenever the mechanism to make the missing values is related to the response. The results are biased.
One approach to fixing the problem is imputation – you estimate the missing values, or in some cases you provide a distribution of the missing values, producing a distribution of answers. Imputation is a rich area of statistics with a large literature. Some kinds of imputation are easy — like estimating the missing values by predicting them from all the X variables that are not missing. Other kinds are hard and involve getting into the mechanism of why the values are missing.
But suppose we are just data miners – we just want something simple to throw into large predictive models to recover all those observations we would otherwise have to throw away.
Also, suppose that missing values are informative. When you are studying the data to predict if a loan is bad and the loan applicant leaves his income or his current debts missing, we are probably better to assume that the applicant is being evasive here for a reason. The act of leaving the field missing is a strong clue about the risk of a loan to the applicant.
Instead of trying to predict the missing field, we will code for it. If the field is categorical, we just treat missing as another category. If the field is continuous, we have to do something else. For that variable, we make two columns in the design matrix for each continuous regressor X:
X Continuous = if(IsMissing(x),x,xbar)
X Missing = IsMissing(x)
The first column is just X except when X is missing, in which case we substitute the mean of that X. The second column is the indicator of missing.
Each of these design columns has a parameter estimated for it. The first column will be the estimate for all the non-missing Xs. The second will be the how much the predicted Y has to change if X is missing, where the change is measured from what an average X would predict for Y.
This missing-value coding has been in use as a practical method for a long time. In JMP, it was introduced with the new Neural platform in JMP 10. Now with JMP 11, we are introducing it for most of the other Fit Model platforms.
In the past, we supported data mining fits mainly with Neural and Partition platforms, which had missing value handling. The regular modeling platforms suffered when there were many missing values in the data. Now we have an option “Informative Missing” that specifies using the missing value coding system whenever a continuous covariate has missing values, and also uses a missing value level for categorical missing.
The sample directory has a data set called Equity that is a standard predictive modeling example with lots of missing values. For example, the column DEBTINC has 22% of its values missing.
Without Informative Missing, the logistic fit uses only 3,364 out of the 5,960 observations. However, with Informative Missing, all the data is fit. Without Informative Missing, the Rsquare was .23, but with Informative Missing, the Rsquare is .45. We are fitting a lot more observations, and also fitting a lot better.
Notice in the reports that many of the continuous regressors have two parameters: for example, MORTDUE Or Mean if Missing, and MORTDUE Is Missing. If the parameter estimate on the indicator “Is Missing” term is not significant, then using imputation by the mean would have worked. But if the value is significant, then the missing values are saying something different compared to the mean value.
You could do this technique before with lots of effort making formula columns to regress with. With JMP 11, you just have to click a menu item called “Informative Missing.”
Now the Fit Model platforms are ready to join the traditional predictive platforms to make predictive models. By the way, Informative Missing is in the Fit Model platform in JMP 11 for most fitting personalities.
Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.