JMP Blog

Daniel_Valente · Oct 18, 2017 11:28 AM

Virtual, sparse, missing or pretend data -- it's ghost data. Don't be spooked by it, says John Sall. Learn how to deal with it.

Data is everywhere and the velocity, volume and variety presents increasing challenges to the scientist or engineer who is trying to make statistical discoveries.

But there is a whole other type of data that John Sall, executive vice president and co-founder of SAS, thinks data explorers should be aware of: ghost data – data that isn’t there.

Ghost data was the topic of John’s keynote at Discovery Summit 2017 in St. Louis, MO. John leads the JMP development team and remains the principal architect of the desktop statistical discovery software.

He says scientists and engineers often deal with four types of ghost data. Here's what they are and what to do with them.

1. Virtual Data

Virtual data is data that isn’t there until you look for it or need it. This type of ghost data is useful because it doesn’t clutter up your data table. Two examples are transform columns and virtual joins in JMP. Transform columns let you materialize a transformed version of a column (say, taking a log transform or a ratio of two columns) just in time to produce a graph without actually having to produce the new column. This keeps the data table clean and also saves memory.

Virtual Joins let you link tables by keys, which lets you access columns from an auxiliary table from a main table without actually making the join. This helps with table maintainability and memory usage, especially when the physical join involves a wide-by-tall or tall-by-wide data table case.

2. Sparse Data

Sparse data is a situation where a data point’s lack of existence implies a zero. It also is a type of ghost data that when taken advantage of effectively can be skipped over by algorithms to greatly speed calculations up. One example of sparse data is the document term matrix (DTM) in text analysis. The DTM is sparse as individual words are used very infrequently among a corpus of documents. (I feel like there is a joke here about a corpus of ghost data.)

Multivariate methods such as Principal Components Analysis (PCA) can take advantage of this sparsity, and when the sparse method is chosen within the PCA platform in JMP, running the analysis on a DTM of 2,000 words goes from 10 minutes to about 1 second. Sparse methods are throughout JMP including PCA, discriminant analysis, Text Explorer and Mixed Models.

3. Missing Data

Many scientists and engineers report dealing with missing data on a regular basis. This type of ghost data is data we don’t know or doesn’t exist. There are many reasons why data may be missing – sensors fail, networks go down or subjects decline to answer survey questions. But just because data is missing doesn’t mean it can’t be useful. Many of the modeling platforms in JMP can take advantage of this “missingness” to build better models instead of simply throwing out the observation. The Partition platform, Fit Model and Neural, for example, can all use missingness as a predictor by taking advantage of something known as informative missing. In addition to using informative missing in your modeling, you can also take advantage of two modeling utilities: Explore Outliers and Missing Values. With these, you can change values that are in the data but ought to be considered missing (like machine error codes) to missing so they don’t affect model predictions. Or you can even impute missing values to make estimations of what values might be. By triaging and dealing with missing values appropriately, you don’t end up missing out on key insights.

4. Pretend Data

The last type of ghost data is data that we make up. Pretend or simulated data can be used to test hypotheses, determine whether an experiment design has enough power to detect a response, or help to make a process robust to factor variation. Simulation tools to create data to answer key questions are found in a variety of places in JMP. The Profiler has Monte Carlo simulation built in. You can use the simulate responses option in the DOE platforms of JMP along with the expected effect size and estimate of error to see if you have enough runs in your experiment to detect an effect. You can use the right-click simulation functions in JMP Pro to dig deeper in this power analysis or MCMC in Hierarchical Bayes to look at each individual subject and how they differ from other individuals based on their preference structure.

For scientists and engineers exploring data, John has this guidance: “We are detectives in the world of data. We need to learn to work well with that that isn’t there, ghost data. If we don’t, we may be inefficient, we may come to the wrong conclusions, or we may miss discoveries.”

You can watch a recording of John's speech here in the JMP User Community.