Oct 9, 2017 1:00 AM
| Last Modified: Aug 1, 2018 6:51 AM
Jorge Camoes says you can learn a lot about/from the data during the preparation stage.Jorge Camoes has a passion for data visualisation. Following a successful career in business analytics at Merck, Camoes founded WiseVis, a data visualisation consulting business based in Oeiras, Portugal. Camoes is recognised widely as author of the popular blog Excel Charts, in which he shares strategies for getting the most out of data with common tools like Excel.
He just published his first book, Data at Work, where he explores these themes further, presenting readers with a thoughtful discussion of the potential of data visualisation and strategies for implementing executive dashboards. He recently presented at the JMP Explorers seminar in Marlow, UK. You can read a summary of his presentation on his blog.
I asked him a few questions on exploring and visualising data.
Step 1: Data preparation. How good or complete must my data be in order to be usable?
If you can easily connect to the organisation’s information infrastructure and get exactly the data you want and the way you want it, to answer a question truthfully, thoroughly and in a timely manner, then congratulations, you probably have good data to work with. And you’re a beautiful unicorn.
Jorge Camoes just published his first book, Data At Work.The real world is uglier: In each organisation, there is a secret dark basement (unknown to top management), home to a few creatures who lurk there and whose job description includes “data preparation,” and other tasks of similarly unsavoury nature. OK, I can’t prove these basements are real, but data preparation is indeed a resource-intensive, time-consuming, unsexy, underappreciated but fundamental task in any data analysis project.
While we are getting more and better ETL tools, we often have to deal with problem that we shouldn’t have, in the first place, like converting PDF files. Here is an example of a good practice: The UN Population Division publishes population projections and estimates in regular Excel spreadsheets that take some work to convert. It also makes the data available in nicely formated CSV files. This dual format addresses the needs of most users and should be a standard practice. If you need to know why well-structured data is a cornerstone of data analysis, Hadley Wickham's Tidy Data (PDF) is an essential reading.
I’m assuming that there will be no problems with concepts when using the data, but that’s not always the case. For example, the USDA publishes data on food availability. According to the USDA, this is a good proxy to actual food consumption, making it a great source for analysing changing patterns in the American diet. Your project should explicitly explain this and make sure the audience knows the difference. Often you'll have to manage less benign assumptions.
Final tip: I wouldn’t see data preparation and data usage as two distinct steps. You can learn a lot about/from the data during the preparation stage.
Step 2: Which graph should I choose for my data? Which graphs should I not choose for my data?
There are two things that you should consider: first, the graph itself, and second, the graph in a specific situation. Think of a line chart: Spotting a trend is very easy. A table with the same data will tell you a few interesting things, but becoming aware of that trend will be much harder. Data visualisation is, to some degree, like a cooking robot: It will not make you a better cook, but it’ll take the table and optimize some of the more time-consuming steps, allowing you to focus on result.
In other words, data visualisation preprocesses the data, offloading some cognitive tasks and allowing the brain to focus on higher-level tasks. If you want to design an effective graph, you should select the graph type and the formatting options you believe will maximise this. Never forget that this is situation-specific. People tend to think of “effectiveness” in purely rational terms, but you may want to factor in emotions or a more attention-grabbing design. You should expect some trade-offs.
No chart should be removed from your data visualisation toolkit. A bar chart is often seen as a safe bet, but it can be terrible choice. Pie charts are maligned, but they are great to display aggregate proportions.
A graph shouldn’t be used if its ability to preprocess the data is low, but there is more to consider. When using data visualisation to communicate, a graph is part of that conversation, so it shouldn’t feel like it doesn’t belong there. If you think this graph is the right one to communicate this message, just use it. Explain it, if your audience is unfamiliar with it, just like you would do with a less familiar word. The opposite (a silly graph and a complex subject) can undermine your message: I will not take you seriously if you use a 3D pie graph when discussing big data. There is a cognitive dissonance that I can’t untangle.
“Pie graphs” are the default and often expected answer to this question. Here is what I believe is a balanced approach: Less-effective graphs like pies and gauges can be used as an entry point, and always followed by a more complex view of the same data. However, if pies or gauges are used extensively as standalone graphs, I will suspect that the organizational numeric and visual literacy is low (and something should be done about it).
Step 3: What common pitfalls should I look out for during data exploration?
Let’s assume you have the right data and you turned your data exploration mode on. First, don’t expect to follow a linear path whereby you locate the answers to a predefined questions, one after the other. That’s a simple table update, not exploration. And the more you know about a subject, the more inclined you’ll be to entertain the idea of a single path. Resist that: Leave room to some random exploration, even if it eventually leads nowhere.
This helps fight the confirmation bias (we tend to select data that confirm our preexisting beliefs), one of the many personal biases that plague our exploration. It’s impossible to be aware of all of them, so the sooner our results are exposed to a wide audience, the better. Often some things become obvious only when you use a different technique or change your point of view. The Anscombe Quartet is a brilliant example of how different exploratory techniques can complement each other: The distributions are apparently identical, but only the graphs reveal their very different profiles. But, now that you know those profiles, you can choose better statistical indicators (the median instead of the mean, for example).
Exploring a distribution is a necessary first step because its center provides a mental anchor point, and the outliers are often meaningful. But when you start exploring the relationships between variables, things get much more interesting. I love the linked data functionality in JMP, for example. But there is a reason why a site like Spurious Correlations is so funny: We are wired to find patterns and relationships. Truth is, we often don’t understand the nature of those relationships and are willing to go from an apparent association to a strong causality.
Finally, and this should go without saying, don’t force the data to say what it not there: Don’t bend the truth, don’t cherry-pick, don’t use formatting options to send a wrong message.
Step 4: Do you see any underlying or unifying principles in data visualisation to guide best practice?
Think of a poem and an essay: They can share the same language, but they don’t have much more in common, and even that can be broken in the name of “artistic licence.” The same happens with data visualisation, which, to some extent, is also a language. The way I see it (and following Jacques Bertin), “data visualisation” means simply the visual transcription of an underlying table. In that sense, there is no unifying purpose or shared principles. You’ll have to segment it and try to identify some clusters, like data art, visual statistics, media infographics or business visualisation.
Trying to find underlying principles valid across all those clusters is a wild goose chase. Even if you focus on the common features of the eye-brain system (eye physiology, Gestalt laws, color processing), each group will use them differently. Some of these clusters will prioritise communication effectiveness, others will prefer the aesthetic experience, and some will be found somewhere in between. JMP users will probably value effectiveness, but note that this is not a licence to make ugly graphs.
If you’re like me and don’t have a single graphic designer bone in your body, do your best to understand the impact and justify each design option. Take a complex issue like color: You can approach it from a functional perspective (managing stimuli intensity or symbolic meanings) and then applying a professionally designed palette (from Colorbrewer, for example).
Step 5: Do you have tips on presenting data to those who are not data-minded (e.g., management)?
It all comes down to minimising the cost of good decision making. Data visualisation shouldn’t be sold as a panacea. Many managers don’t take data visualisation seriously because all they see is saturated colors and shiny 3D effects, empty calories that can be very addictive but have zero positive effects on decision making. To make things worse, good data visualisation, like good design, is invisible and obvious. To recognise its merits, the manager must be aware of them. Call it A/B tests, before/after, guerrilla (make a point to design a better graph for each ineffective one you find), but showing concrete improvement is key to get people to buy your ideas, get rid of application bells and whistles, graph defaults, and fight inertia.
Some managers tend to prefer hard data in tables. Make sure they can easily get access to well-formatted tables, but at the same time, design graphs that address pain points and insights that are nearly impossible to get from tables (useful relationships, groupings, quick outlier identification). Don't assume top managers are interested in the data visualisation tools or that they are eager to spend one hour exploring the data using your carefully crafted dashboard. Again, the key idea is that your visualisation should minimise the cost of acquiring insights.