Choose Language Hide Translation Bar
Shining a light on dark data

Renowned statistician, educator and author, David Hand, has a gift for making the abstract more concrete and making the invisible visible. In his latest book, Dark Data: Why What You Don’t Know Matters, he categorizes the different kinds of “data we don’t see” and the consequences of ignoring them. 

Screen Shot 2020-08-13 at 11.43.41 AM.pngI greatly enjoyed this book and highly recommend it. The chapter, “Science and Dark Data: The Nature of Discovery,” was of special interest, and I will explain why, but first a bit of background... 

In 2014, I interviewed David for Analytically Speaking, shortly after he was awarded the Order of the British Empire for research and innovation, and the release of his very popular book, The Improbability Principle: Why Coincidences, Miracles, and Rare Events Happen Every Day (which I also greatly enjoyed and recommend). When I asked him what led him to statistics, he said that when he was young, he was fascinated by science and wanted to be a scientist. He conveyed that as a youngster, he thought being a scientist meant you might go on an archeological expedition in the morning, find a cure for cancer in the afternoon, and discover a new star in the evening. Eventually, David realized if he wanted to be a scientist, he had to specialize, but he didn’t want to narrow his focus.

Thankfully, David “accidentally” discovered statistics, which he views as an infrastructure – to other sciences, to government, and essentially to “everyone’s backyard.” Through statistics, David could be the scientist he wanted to be, working in collaboration with other scientists by bringing his statistical expertise to bear on a wide variety of things. 


Because JMP users are primarily scientists and engineers – many of whom are doing a phenomenal job working on timely issues related to the pandemic (where there are a lot of dark data challenges!) – JMP is happy to offer the excerpt “Dark Data and the Big Picture” from the chapter “Science and Dark Data: The Nature of Discovery.”

In this excerpt, David explores the reproducibility issue, emphasizing that science is a process, and despite the challenges, the process is not broken (though we would all certainly benefit “if fewer incorrect conclusions were initially drawn”). Of course, there are many factors contributing to irreproducibility, which David covers well before discussing some tools for avoiding drawing incorrect conclusions. Among those tools are better study design and more recent tools, like Yoav Benjamini and Yosef Hochberg’s work on the false discovery rate.

(By the way, John Sall has a nice blog post on this topic, Not just filtering coincidences: False discovery rate and a short video (see the 2:30–8:50 segment). John used the term “ghost” data in his 2017 Discovery Summit talk and did a blog interview about four higher-level categories of data that aren’t there.)

David does a very thorough job of more granularly categorizing 11 types of data that aren’t there. It’s important to know which kind of dark data you might have, so we hope you get the Dark Data chapter excerpt and the book to help you address your dark data challenges. Because as David says, “Dark data are everywhere.” For more examples, see David’s Dark Data blog, How Dark Data lead Us astray.