Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
The opening ceremony of the London 2012 Summer Games begins today and while looking for news about the games, events and the schedule, I found an interesting set of blog posts on The Guardian website, which made available several data sets relating to the games.
The author of this article, who provided the Name, Sex, Sport and Country for every athlete participating in the Summer Games also ended the post with a challenge: “We’ve just scratched the surface of this dataset – you can download it below. What can you do with it?”
So I decided to bring these data into JMP and see what I could find. Getting the data table into JMP was easy. The table was hosted on Google Docs, so I exported the table as Comma Separated Values (.csv) by clicking File -> Download as -> Comma Seperated Values (.csv, current sheet).
After opening the data set in JMP, I can see that there are 10,960 records of athletes in the games. The first thing I like to do when starting my exploration is to bring all my columns of interest into the Distribution platform.
In this set of distributions, we can see that 44.4% of the participants are female and 55.6% are male. Most of the participants fall in the Athletics (Track and Field) sport name. When dealing with categorical variables as in this data set, one way to help us understand the counts within this distribution is to set the order to sort ascending and to display the percentage of values that fall within a given category.
By clicking on the hotspot (red triangle) in the Sport name distribution, I can change the graph to Order By -> Count Ascending and also Show Percents in the Histogram Options.
By doing this, I can see that the top five sport names for number of athletes are Athletics, Swimming, Football, Rowing and Hockey, with Athletics accounting for 11% more athletes than the next Sport name, Swimming.
Now, let's have some fun. I am interested in focusing on the athletes that are swimmers. A great way to easily subset my data is to double-click a histogram bar. In my case, I’ll double-click the Swimming bar, and JMP will automatically create a new data table with just the 948 swimmers in the data set.
To explore these data, I’ll use another one of my favorite tools in JMP: Graph Builder. One of the things that Graph Builder lets you do is visualize data using built-in map regions, so I think that would be a great way to look at the numbers of swimmers coming from different countries around the world. To summarize these data into counts, I am going to use Tabulate, which lets you drag and drop columns to create summary tables. In this case, I drag Country Name into the Drop zone for rows, and the number of athletes for each country will be counted.
I can then make this into a data table, which I will use to build my map in Graph Builder. All I have to do to get the map shapes to show up is drag the country name column into the drop zone for Map Shape. And then to color by number of athletes, I can drag the N column into the Color drop zone. Now I can visually see where in the world all of the swimmers are coming from.
But let us take this one step further. I might want to explore the the number of swimmers from each country related to the population of that country. Unfortunately, the data set that we have here only has the country name. A quick Google search brought up this website, which has world statistics on populations. I used the Internet Open feature in JMP to bring these data into a new data table. And then I used the Tables -> Join feature, which joins rows from several sources by matching values.
Now I have a new table, which has the Country Name, number of swimmers and population of that country. To determine the number of swimmers as a function of every 10,000 people in the population, I can create a new column and by using the Formula in the Column Properties, I can calculate this column by using N/(Population/10000). To visualize if there are any countries that have a high number of swimmers based on their population, I can again go to the Distribution platform. Looking at the graph, I can see that there are several countries that have a high proportion of swimmers in the Games based on their populations: The Cook Islands, Palau and the Cayman Islands.
I encourage you to download this data set, bring it into JMP and see what else you can find. If you are not yet a JMP user, you can download a fully featured 30-day trial for free and experiment with several of the tools I have outlined in this post.