Visualizing COVID-19 Global Cases: Getting the time series data and building cha...
Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Visualizing COVID-19 Global Cases: Getting the time series data and building charts
Feb 12, 2020 7:38 AM
| Last Modified: May 19, 2020 2:30 PM
Visualizing Coronavirus COVID-19 Global Cases
Using a Time Series Animated Bubble Plot
I've gathered resources here to help you with the visualizations. View the video from the March 26, 2020 live webinar, read why I chose these analyses, access the data or the JMP Live reports, try my attached scripts, and/or use the data try do your own analyses.
There are a lot of updates in this continuing effort to communicate data about the spread of coronavirus COVID-19. I the latest version I have included some graph builder plots looking at the global summary as well as the local summaries. Also the bubble plots are augmented with color to represent day to day changes in counts. There is a tab box to toggle between log and linear scales in the graphs, and in the local counts figure, the linear scale plot features a split axis graph to show both the lower and higher count regions together.
The coronavirus and all the issues caused by its spread are serious and affecting a lot of people, some in horrible ways. It is not my intent to trivialize the seriousness of this by creating this figure. Neither is it my goal to promote hysteria or xenophobia.
In my script I recode location names that the original authors use. I don't have a political agenda. When I use the google map API to geocode I need it to give me country locations. Taiwan, Taiwan results in a restaurant location, while Taiwan, China result in a geocode near the center of Taiwan. Again, this isn't political it's functional.
All the data surrounding the reporting and detection of Coronavirus is massively contentious. I'm not vouching for the accuracy of this data; it is however, the best aggregation of publicly available data (in English) that I know of. If you have something better, please let me know.
Its not my data. The folks at Johns Hopkins University, Whiting School of Engineering, in the Center for Systems Science and Engineering published a sweet looking dashboard along with the data. Huge thanks and kudos to this team for making their data available. Read their Blog and definitely check out their visualization.
Why Use a Bubble Plot with a Time Series Animation?
Bubble plots have a high degree of utility in that a large number of dimensions can be represented in a single plot. With the addition of a time series animation, the changes in relationships between these dimensions can be visualized and communicated without a whole lot of complex pre-attentive processing by the viewer. (aka: They look cool and are easy to understand)
Note that I said a bubble plot is useful for looking at the relationships of multiple dimensions animated over time, not just one dimension (variable) animated over time. I've seen people make a run/control chart-like plot with a single dimension on the X-axis and time on the Y-axis, and then animate by time. This is called a tedious run/control chart, please don't do this.
For the figure I'm going to construct here. The X and Y axis are on a geodesic scale, so that the points will line up with a map. We can see the spatial relationships of points on the graph, which are given context by the background map shape. Next I add a size dimension for the markers, which is the cumulative count of cases by time period for each location. Last, I use the time period sequence to animate the plot which results in bubbles that grow or appear over time. Each of the points are labeled by Country and Region. In this figure, it is possible to collapse all the individual points into their Country ID or split them out to their region ID within the Country ID.
The Script to generate the graph is chasing a moving target. The authors of the data keep making changes, so this might only work for a couple of days if there are changes to the source data. As of the morning of 2/12/2020, it works. At the end of the script is an option to publish or refresh the figure in JMP Live. This won't work for you because you don't have my top secret API Key for accessing the package in JMP Live. It might be useful as an example of how make publish and refresh work if you have JMP Live.
How the script works.
Sets up all the variables I need later in the script
Downloads the data and formats it
there are some weird things in this part. At several points I save the data locally, close the table and then re-open it. I was having trouble forcing the script to run sequentially and this is a way to make it happen. Sometimes a table wasn’t finished updating before the next step started and I got wonky tables.This brute force method matches my character and style.
Generate the bubble plot
Figure out if need to publish or refresh, and then publish or refresh the figure
In looking at the script, you may note that I wrapped sections in expressions. This is a handy way of de-cluttering a long script when its combined with the code folding option (check preferences for the script editor).
Earlier this week data set I was using was deprecated. And this morning the deprecated data were moved to another location. Needless to say this project is truly a moving target! Johns Hopkins CSSE has another data GitHub repository that is updated with data daily. The new script below, "Compile COVID-19 Time Series Data" will check to see which files are in the repository and then download and assemble them. This data set has some interesting "features". I used recode to resolve many of them, and some creative formulas to resolve the rest.
Some of the issues and the resolution
-Locations for each site are not consistent, so I used the most current location and replaced that everywhere else.
-Cumulative counts should either increase or stay the same. When the cumulative count decreases, the previous day's value is used. This prevents negative numbers in the daily counts that are derived from the cumulative counts.
-The column names in the tables aren't consistent over time. The labels are unified to the 3/24/20 table column names.
-There are some typos inconsistencies, and trailing/leading spaces in some location names. Recode resolves these.
-The locations are in Value Ordered and Color Ordered by the Sum of Confirmed Counts at each location.
-The final table contains no formulas. The table can be sorted without affecting sort dependent calculations.
I updated the script today. Please note that its going to take about a minute or a little more to run. A lot of stuff is happening in the clean up script and some of the formulas take a little bit of time to finish.
This version includes a section at the end that removes duplicate rows along with a couple of other tweaks
New column formulas for making the Cum Counts vs. Counts by day bubble plots. These are moving average by location columns, rather than taking the mean of 5 or 7 days. The moving average gives me a few more points to work with.
Useful articles that have informed the graphs:
•Notes on what to plot in a timeseries, especially a bubble plot. Note that cumulative counts make ever increasing bubbles, but counts by day show hot spots. (Joel Selanikio)