Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Every once in a while, I run across a bar chart on a log scale, and it always feels wrong. At first glance, I compare the bar lengths and start making comparisons. But eventually, I notice the log scale on the axis and try to convince my brain to forget everything it just saw and just compare the tops of the bars against the axis scale. In that sense, bars on a log scale are a special case of bars without a meaningful baseline.
Here’s a recent example I saw, comparing speeds for reading CSV files (comma-separated value text files).
The source of the comparison is a white paper from the vendor for the coral-colored tool, ParaText from wise.io, showing how fast it is. The company can hardly be accused of deception in the visualization since using a log scale only makes the competitor speeds look closer to its own speeds. It's about 10x faster than R readr but looks only 2x faster. The only advantage ParaText gets from the log scale is that its speed looks very close to the black I/O bandwidth bar (the upper limit) when, in fact, the speeds are about half the I/O bandwidth.
Like any other non-trivial endeavor, data visualization often involves conflicting constraints that must be balanced. Yes, using bars on a log scale certainly interferes with gaining insight from the graph, but it’s possible that all the alternatives are worse. That’s why I always look at alternatives when making assessments of data visualizations.
Log scales are most useful when the underlying data is very skewed or varies by many orders of magnitude. This speed data is both skewed and varied, but not terribly so. The maximum variation is about 200:1, which is only two orders of magnitude. Immediately, we can try two variations on this chart:
Keep the bars and change the scale to linear.
Keep the log scale and change the bars.
Here’s a straightforward conversion to a linear scale. Using JMP, I’ve scaled all the values to be relative to the I/O bandwidth, so the black bars are not shown since they would all be at 100%.
I haven’t labeled the bars with values. I don’t think all the bars need labeling with exact values (I'd rather have a supporting table for that). But if I were sharing this in a report, I would try labeling the highest bar or two in each category for some grounding. I find all the rotated labels in the original to detract from the visual representation of the data and take too much effort to read.
The linear scale is not bad, and I already like it better than the original in that it portrays the speed differences among the products directly. One weakness of both charts is that the product labels are separated from their bars. Rotating the bars at least puts the bars and the legend labels into the same arrangement.
A different grouping hierarchy lets us label the product bars directly.
Comparing across tests is now less direct, but I’m thinking that’s a less important comparison.
Now let’s go back to the beginning and try keeping the log scale and changing the data elements. Here’s a view using points and lines instead of bars.
The points themselves are enough to carry the position information, but the lines add connection information, which helps simplify the labeling. In general, line segments carry three connotations:
Pattern recognition (continuous or categorical)
Interpolation doesn’t make much sense here since our x-axis is categorical, so that’s a detraction here. But connection is very valuable, and pattern recognition is informative, too. For instance, we notice a couple products have the same up-down-up pattern.
With the lines labeled in place, the color is not as necessary. While the color does help distinguish intersecting lines and help the data lines stand out from the grid lines, there is enough separation that we can try using color for technology group (R, Python or specialty) rather than individual labels.
That makes the chart less busy but keeps the advantages of color.
The chart looks nice, but does it work? We still have a log scale, which still requires more thinking. But at least now the data elements are not in such conflict with the scale, and we have more room to show grid lines that reinforce the non-linearity. The log scale makes it easier to understand the differences between values across the entire range. In particular, we can see how the low values differ from each other better than we can on a linear scale.
It’s interesting to me that the data itself makes such a big difference in the usefulness of each chart option. The linear scale is at its limit of usefulness with differences around 10x. If the differences were more like 1000x, a linear scale would be useless. And if the values were too similar across products, the points would be obscuring each other and less useful.
Having seen a few possibilities, which is most effective for understanding the performance? Or would something else entirely be better?