Is all lost because of missing data?

Peter_Bartell · Jun 18, 2018 10:57 AM

Recently a JMP user called me with the following question, “When I try to fit a nominal logistic regression model I get a JMP Alert message telling me that a ‘Response must have at least two levels’. And the platform won’t run. Yet when I select a model with a subset of my predictor variables the model runs. What’s going on? Is all lost?”

Is all lost? Perhaps not.

Figure 1. A JMP Alert message.

The reason for the above JMP Alert message is too much missing data in the data set. JMP’s default operating characteristic is to exclude rows from the analysis that contain missing data. The platform will not run if the magnitude of missingness is severe enough. Sometimes even missing data has information we can use to solve practical problems. How can we explore this possibility if JMP, by default, excludes rows of data with missing data? The answer: Use JMP’s missing data detection, imputation and informative missing modeling capabilities.

Statistical modeling best practices suggest that we should spend some time and energy evaluating the quality of a data set with an eye towards issues such as outliers, nonsense values, missing values and any other characteristics that make analysis problematic. This article provides an example how one can use JMP’s capabilities around data ‘missingness’ to still make sound decisions and take appropriate actions.

Every modeling problem should start with a business problem worth our time to solve. I’ll work through a hypothetical but illustrative problem. A sport fishing tournament organizing committee wants to understand key drivers of participant satisfaction with respect to the tournament to try and drive repeat participation. They ask for information, data and feedback from the past year’s participants around the likelihood they will participate in the same tournament next year. The committee collects information from participants via a survey. However, one characteristic of people who go fishing is…well…they may not always be the most truthful or forthcoming. How many people would give away their favorite, most productive, fishing spots for landing the BIG ONE, to a total stranger? Or the type of bait they used?

The committee asks participants if they are likely to return next year, age of the participant, and the type of bait (artificial or live). Age and bait type are the predictor variables used in modeling. The complete data set is shown below:

Figure 2: The Original Data

Note that there are several missing values in both the Age of Participant and Bait Type? columns. Because JMP excludes rows containing missing data, if we attempt to model the data ‘as is’ we end up with the JMP Alert message shown in figure 1. In fact there are only two rows with no missing values (highlighted in figure 2.)! No wonder JMP is Alerting us!

How should we proceed? Find the respondents and ask ‘again’? Fabricate data? Give up?

Well let’s back up a minute. With quick visual inspection of the data table shown in figure 2., it’s easy to locate the missing data. What if we had a much larger data set? Millions of rows? Hundreds of columns? Visual inspection for ‘missingness’ is an untenable solution. JMP’s Missing Data Pattern and Explore Missing Values platforms come to the rescue. When we find missing value rows…what do we do about the fact that they are missing? From here the Explore Missing Values and Informative Missing capability in JMP come to our aid.

Let’s use the Missing Data Pattern platform to find missing values. The platform produces the Missing Data Pattern table shown in figure 3 (lower right). The Patterns column is the key. Non zero values indicates a pattern of missingness within the original data table. In the Patterns column row entry, each digit in the pattern is a binary indicator of a column in the original data table. A pattern of all zeros (first row) indicates no missing values. The Count column entry indicates that there are 2 rows with no missing values. The second row indicates a missingness pattern where rows have values missing exclusively from the Bait Type? column. The third row, missing values in the Age of Participant column. Using JMP’s signature dynamic-linking feature, selecting rows 2 and 3 in the Missing Data Pattern data table, will select the corresponding rows in the original data.

Figure 3. Missing Data Pattern Data Table and Original Data

Now that we’ve found the missing values how can we answer the fundamental question…What factors influence participant likelihood to return? From here let’s turn to using JMP’s Explore Missing Values platform to impute values for the Age of Participant column. Isn’t this fabricating data? Isn’t this dishonest? Well…not exactly…especially if we disclose we’re creating surrogate values in the interest of not having to throw the baby out with the bath water from a modeling perspective. Turns out…with a bit of modeling magic…if we choose our surrogate values wisely, parameter estimates for the effects will still be meaningful. Within the Explore Missing Values platform there are two imputation algorithms. For this example I’ve chosen the Multivariate Normal Imputation option shown in figure 4 below. Note the surrogate values equal to the mean of Age of Participant have now populated in the rows with missing values in the original data table.

Figure 4. Multivariate Normal Imputation

Now how do we deal with the missing values in the Bait Type? column. One way to handle the missing values for a categorical predictor variable is to invoke JMP’s Informative Missing capability. The basic underlying principle of using an informative missing construct is we are willing to make a presumption there is some theme, commonality, or lurking variable which is behind the fact that the value is missing. Like perhaps the reluctance of a tournament contestant to disclose the kind of bait they used for some undisclosed reason?

From the Fit Model specification window we invoke Informative Missing by selecting Informative Missing from the Model Specification red triangle drop down menu shown in figure 5. below.Figure 5. Model Specification and Informative Missing

By selecting Informative Missing, JMP will assign a new level for the Bait Type? categorical predictor of ‘Missing’ and proceed with the model fitting process using all the values in all the predictor variable columns. Figure 6 below shows the Effect Summary portion of the Nominal Logistic regression model report. Note that Bait Type? does indeed seem to influence the response.Figure 6. Effect Summary Table

Further examination of the JMP Prediction Profiler as shown in Figure 7. indicates that the missing value level of Bait Type? is really driving the Yes response!

Figure 7. Prediction Profiler

Now is where domain expertise comes into play. It sure stands to reason that a tournament contestant that experiences LOTS of success catching fish might certainly be inclined to NOT disclose the bait type they used. Trade secrets? But they’d sure be inclined to return to a tournament where they had success.

Post analysis, a deeper dive into the tournament records revealed additional data that is certainly concurrent with the notion of secretive participants. The Fit Y by X logistic regression plot below in figure 8. of the probability of Come Back Next Year? is a function of Total weight of fish caught…and the participants that caught them sure aren’t telling us much about their secrets. Figure 8. Logistic Regression Plot

So next time you’ve got missing values in a data set, remember that using JMP you can perhaps learn something about the system EVEN when data is in fact missing! All is indeed not lost.