Subscribe Bookmark RSS Feed
john_sall

Staff

Joined:

May 27, 2014

It’s not just what you say, but what you don’t say: Informative missing values

Sometimes emptiness is meaningful. If a loan applicant leaves his debt and salary fields empty, don’t you think that emptiness is meaningful? If a job applicant leaves this previous job field empty, don’t you think that emptiness is meaningful? If a political candidate fills out a form that has an empty field for last convicted felony, that emptiness is hopefully positively informative. Missing values are values too — they are just harder to accommodate in our statistical methods. You can’t do arithmetic on them to find means. But you can still put them to use in practical ways.

Some data sets are full of missing values. The standard methods have to drop a whole observation if any of the X’s are missing. If you model with lots of variables and one or another of the variables is missing, you may end up with no data to estimate with. Furthermore, even if you do end up with enough data, that part of the data may be a biased sample of the data whenever the mechanism to make the missing values is related to the response. The results are biased.

One approach to fixing the problem is imputation – you estimate the missing values, or in some cases you provide a distribution of the missing values, producing a distribution of answers. Imputation is a rich area of statistics with a large literature. Some kinds of imputation are easy — like estimating the missing values by predicting them from all the X variables that are not missing. Other kinds are hard and involve getting into the mechanism of why the values are missing.

But suppose we are just data miners – we just want something simple to throw into large predictive models to recover all those observations we would otherwise have to throw away.

Also, suppose that missing values are informative. When you are studying the data to predict if a loan is bad and the loan applicant leaves his income or his current debts missing, we are probably better to assume that the applicant is being evasive here for a reason. The act of leaving the field missing is a strong clue about the risk of a loan to the applicant.

Instead of trying to predict the missing field, we will code for it. If the field is categorical, we just treat missing as another category. If the field is continuous, we have to do something else. For that variable, we make two columns in the design matrix for each continuous regressor X:

X Continuous = if(IsMissing(x),x,xbar)

X Missing = IsMissing(x)

The first column is just X except when X is missing, in which case we substitute the mean of that X. The second column is the indicator of missing.

Each of these design columns has a parameter estimated for it. The first column will be the estimate for all the non-missing Xs. The second will be the how much the predicted Y has to change if X is missing, where the change is measured from what an average X would predict for Y.

This missing-value coding has been in use as a practical method for a long time. In JMP, it was introduced with the new Neural platform in JMP 10. Now with JMP 11, we are introducing it for most of the other Fit Model platforms.

In the past, we supported data mining fits mainly with Neural and Partition platforms, which had missing value handling. The regular modeling platforms suffered when there were many missing values in the data. Now we have an option “Informative Missing” that specifies using the missing value coding system whenever a continuous covariate has missing values, and also uses a missing value level for categorical missing.

The sample directory has a data set called Equity that is a standard predictive modeling example with lots of missing values. For example, the column DEBTINC has 22% of its values missing.

Without Informative Missing, the logistic fit uses only 3,364 out of the 5,960 observations. However, with Informative Missing, all the data is fit. Without Informative Missing, the Rsquare was .23, but with Informative Missing, the Rsquare is .45. We are fitting a lot more observations, and also fitting a lot better.

Notice in the reports that many of the continuous regressors have two parameters: for example, MORTDUE Or Mean if Missing, and MORTDUE Is Missing. If the parameter estimate on the indicator “Is Missing” term is not significant, then using imputation by the mean would have worked. But if the value is significant, then the missing values are saying something different compared to the mean value.

You could do this technique before with lots of effort making formula columns to regress with. With JMP 11, you just have to click a menu item called “Informative Missing.”

Now the Fit Model platforms are ready to join the traditional predictive platforms to make predictive models. By the way, Informative Missing is in the Fit Model platform in JMP 11 for most fitting personalities.

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.

7 Comments
New Contributor

Very helpful.

I want to understand how Informative Missing Works, but I am constrained by having JMP7/8 as my licenced software.

 

I am trying to run the Robert Anderson Partition model of Bands_Data.jmp using only JMP7 to replicate the approach.

 

This would have been very helpful when dealing with Regression of Insurance data and Manufacturing variables with many missing data values hindering Stepwise regression.

 

David Ashling

0783 400 6111

Hi David,

 

as John described in the middle part you can create new formula columns for each of the variables missingness occurs. You just have add for a continuous variable two new formula columns (right click on your new column and select formula):

 

X Continuous = if(IsMissing(x),xbar,x)  

and second column just the indicator:

X Missing = IsMissing(x)

 

Please be aware that I believe in the formula of John's blog is a bug as he has x as a consequence of missingness and else xbar. I think this should be changed as in the formula I have above, as you want to add the column mean in case there is a missing variable. Btw. you can change the column mean by any imputation formula you find appropriate as not always the mean iss the best imputation (but this is a seperate topic for itself).

 

Now you have two columns and you can add these two instead of the messy original column to your regression as factors. 

As this can be quite an effort in case you have lots of predictors as of JMP 11 this option was included. 

 

Is there any specific reason why you cannot update your license to a newer JMP version (current JMP 13.1)? There has been added much more which helps to model messy data you may want to take a look at.

Regards,

Martin

New Contributor
Dear Martin,I retired from active consultancy in Lean Six Sigma before I could convert clients from Minitab to JMP, so I never realized the benefits of JMP's capability to handle large numbers of predictors and missing values converted to informative missing.
So many of the analyses I ran on historical data were clumsy as rows with missing data had to be massaged to uncover possible effects. The only advantage this gave me was as a prompt to project leaders and analysts to use DoE to discover what were true effects and interactions and response surface terms and what was attempts to fit noise.
We had a lot of measurement systems with artificially induced cut-offs that again would have been useful to treat differently while at Kodak up until 2003.
Now I have retired my finances don't stretch to a JMP Pro licence, which is what I would buy if I were still consulting.
Occasionally I am tempted to look for work in this field, which is why I keep up with the seminars on topics that illustrate JMP 13 (Pro's) capability.
In JMP7 it was a bit laborious to convert missing values in Bands_Data.jmp to informative, but I had a go last night.
Here is my first draft of a JMP7 informative missing file from the sample data file.

New Contributor

Hi Martin,

In JMP7 I create the continuous X with column mean (x) values using ColMean(x) in:

 

IfMissing(x)=>ColMean(x);

                    =>x

 

And X Missing = IsMissing(x)

 

Laborious, but it works to reveal the effect of the missing data values on Bands? (Y)

 

David

Hi David,

the sample file you mentioned has not been attached. However the formula should work out and now you need to add both new columns to your regression instead of the original one. If the missingness turns out to be informative, the indicator column should turn out to be significant. In the informative missing option in newer JMP versions the output will be generated in a way that the significant variable will be displayed as "Col 1 or mean if missing" or Col 1 is missing" as you can see in John's screen shot or the final report.

Hope that helps,

Martin

New Contributor

Dear Martin,

I went over the Bands Data.jmp file again and use the two column formulas to make missing contiunous variables into mean of column recorded data, and added the Indicator column of 0/1 to specify Is Missing/Non Igresss Missing data, and re-ran the regressions with the 2 new columns and compared to regression of the original data with missing values.

 

It's still a long-winded way of doing it compared to JMP Pro, but at least it works for me.

 

Best regards,

David

Hi David,

happy it works out.

Best, Martin

Article Tags