What is ghost data?

arati_mejdal · Oct 10, 2017 10:22 AM

John Sall explains why we need not be afraid of ghost data. John Sall explains why we need not be afraid of ghost data.

Quantitative and qualitative data. Numeric and character data. Expression and unstructured text data. These are all types of data you may already know about and know how to deal with.

But there's another category of data you may not have heard of yet: ghost data. And no, it isn't something I made up for Halloween.

John Sall, co-founder and Executive Vice President of SAS, has been thinking about ghost data, which is data that isn't actually there. He says we ought to understand and appreciate ghost data.

In his role as the head of the JMP division of SAS, he talks to customers and helps develop the software. He also reads and thinks a lot about data, statistics and technology. You may have read his blog posts about why statistics is essential and why the desktop computer is not dead. He is a Fellow of the American Statistical Association and the American Association for the Advancement of Science. And he is a keynote speaker at Discovery Summit 2017 in St. Louis, where he will talk about ghost data. Read on for his answers to my questions about ghost data (including one silly question).

Tell me about ghost data.

Ghost data is any data that is not there, and there are many types. For example, there’s virtual data, which isn’t there until you look at it. It looks real enough on the surface, but it is only materialized as needed. There’s sparse data, whose absence implies a zero. Sparsity is a key computational enabler in several kinds of analysis. There’s also missing data. This data has a slot to hold a value, but that value is unknown, so the slot is empty. Another type of ghost data is pretend data, which is data that is made up. This kind of data is important for simulation, to answer “What if?". There are more types of ghost data, but these four are important for data analysis in JMP.

Should we be scared of ghost data?

There’s no need to be scared of ghost data. It’s just data that isn’t there. It is as natural as real data. Just as we appreciate the data we have, we also need an appreciation of data that isn’t there. We need to know how to handle it, know how to model with it, and put it to work. Handling ghost data properly is very important.

What’s one way to handle ghost data?

Let’s look at missing data. In the past, we used to skip the rows where one of the analysis variables was missing. The problem is that this method introduces biases into the results if the missing values were not completely at random. Now, we incorporate missing values directly into the model, assuming that the missingness is predictive. This is called “informative missing.” Mortgage data for the years before the financial crisis of 2008 is a good example of informative missing. When modeling the probability of a mortgage default, the results are biased if all the rows that have missing values for variables like “income” and “debt-to-income ratio” have been dropped. Often the loan applicant knew that providing those values would result in failure to qualify for the loan, so the values were omitted. In the data table we supply with our sample data, the missingness of debt-to-income ratio is the most important predictor of mortgage default.

If you wanted to go as ghost data for Halloween, what would your costume look like?

Either the Invisible Man, or Harry Potter using the cloak of invisibility, or a metamaterials version of it.

jon_stallings · ‎10-11-2017

This is why every data analyst needs training in design of experiments. The whole point of randomization is to mitigate the impact of ghost data.

arati_mejdal · ‎10-11-2017

Thanks for your comment, Jon!

Peter_Bartell · ‎10-13-2017

@jon_stallings While there are many strategies in DOE to deal with ghost data...I think of blocking as a means to deal with one source of a lurking factor within the conduct of an experiment...missingness is also seen AFTER the execution of designed experiments...how many times do we see missing response values due to all manner of causes in DOE?

Ressel · ‎02-28-2022

Great - love the reference to the '08 crisis!