Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
May 24, 2016 2:59 PM
| Last Modified: Jan 24, 2017 6:16 AM
Recreating graphs is a hobby of mine. It both helps me test the limits of JMP and sharpens my own data handling and visualization skills. This time, there was a third benefit: finding a significant data error in the published chart.
I recently saw this interesting mosquito trends chart as part of an article, “When the mosquitoes will be biting in your state,” on the Washington Post’s Wonk Blog. Its shows Google search trends for the word “mosquito” by state, with each state on a different scale.
It’s not a typical analytical graph, but I thought the layout would be a good test of Graph Builder’s small multiples grouping, and I was intrigued by the overall lack of geographic pattern. For instance, the article mentioned that Tennessee is so unlike its neighbors, and the same can be said for other states. The main point of using a map is show geographic patterns, but the connections are pretty weak here.
Getting the Data
The article nicely includes a link to the Google Trends page for mosquito trends for the whole United States, and that page nicely has a Download as CSV menu item.
However, that’s where the niceness ends for getting the data. Each state trend is on a separate page, there’s not a separate URL for the download, and the CSV file is not really a pure CSV file. At least the state URLs followed a pattern, so I wrote a script to open each state in a separate tab in a web browser. Then I had to manually click the Download as CSV menu item for each tab. Each “CSV” followed a regular pattern which included some descriptive text and multiple embedded tables. An example snippet:
Web Search interest: Mosquito
Wisconsin (United States) 2014-2015
Interest over time
2014-01-05 - 2014-01-11,4
2014-01-12 - 2014-01-18,3
2014-01-19 - 2014-01-25,4
2015-12-27 - 2016-01-02,4
Top metros for Mosquito
Duluth MN-Superior WI,100
Fortunately, JMP’s text import wizard let me tell it how many lines to skip and how many to read. After doing it once in the wizard, I looked at the generated script and was able to put it in a loop to read the other files the same way:
After doing all that and splitting the first field into separate start and end dates, I could concatenate all the table into one big table containing all the state data, which is two years of weekly data.
First Look at the Data
We can see all the data fairly well in this small multiples line chart overlaid by year.
We don’t have the geographic arrangement, but we can already see enough to compare against the original, and two features stood out to me:
Sometimes the years were quite different within a state (especially Arizona and Hawaii).
Tennessee does not have the spikiness of the original.
The first observation suggests we probably need more years of data to establish what the article calls a “typical” year, and I haven’t pursued that, yet. The second item was more of a mystery. I initially thought it was an artifact of the binning since the original chart is showing data by month instead of by week, but that didn’t hold up as I looked at the data closer. I double-checked my graph against the Google Trends page for Tennessee, and they agreed. I continued on, hoping to discover the source of the discrepancy.
Arranging the States
The above chart uses the Group Wrap role in Graph Builder to lay out the states in a grid alphabetically. For complete control, I assigned each state a row and column value and used those values in the Group Y and Group X roles .
In an effort to approximate the original chart, I switched from the overlaid lines to smoothed area charts. (I didn’t do bars because I still haven’t tried converting weeks to months – or maybe there's a way to get monthly data from Google Trends.) It was enough to notice some of my states looked like other states in the original. The original Tennessee and Ohio look a lot like my South Dakota and North Dakota; the original Pennsylvania looks a lot like my Oregon. Luckily, a pattern occurred to me. The original chart’s states are off by one alphabetically!
Almost, anyway. After further study, I realized only the states after District of Columbia were off by one. Our charts agreed on the other states. Coincidentally, I also had a similar error in my data where I had downloaded the DC data twice, and my initial charts were off by one before DC. Weird. After some Twitter messages, the author confirmed my findings and quickly updated the Wonk Blog post graph and commentary.
Scaling the Data
In the data from Google Trends, each state’s interest levels are scaled so the maximum value is 100. That means the magnitudes are not comparable from state to state as you would expect with a small multiples chart. Just the patterns (e.g., spikiness) can be compared in this case.
How could I convert the state data to be on the same scale? The Google Trends page for the whole US includes a list of summary levels for each state. If those summaries represent each state’s average value, we can make the adjustment with a scale factor. Here’s the result with all the states on a common scale.
Notice I also used a different state layout of my own design, trying to give my home state of North Carolina a truer positioning. Within the constraints of equal-sized rectangles, it's impossible to preserve all the geographic properties. The layout in the original chart is from Jon Schwabish at PolicyViz and has the nice property that the overall shape resembles the overall shape of the country.
Given the thinness of many of the curves, I added a background color based on the state's average value, which besides allowing a big picture view of the yearly pattern, helps anchor the areas in their cells and make the labeling clear.
Though the Wonk Blog article makes it clear that the Google Trends data may not be representative of any real-world trends, by redoing the analysis I did gain a few insights I couldn't get from the original article and graph:
Some states have significant year-to-year variation.
Monthly versus weekly aggregation may be an issue.
The state data was normalized by the maximum value.
The Dakota spikes are really high (if my re-scaling is right)
I realize it’s not always practical to do our own remakes of graphs and analyses, but it’s a great way to really understand the nuances of the data.