When we last left off, I had (hopefully) enticed you with an exploration of “golden cities” and how they told a story of the Olympic history. I had laid out a goal of searching for geographic correlations in athletic performance, with one last hurdle of geocoding left to surmount. Well, I’m happy to report that the geocoding has been completed, and now all Olympic athletes (or at least those with birthplace information) can now have their time in the sun.
A World Full of Stars
First, let’s start with a map of all the athletes in the data set. After all, best practice is to visualize the data first.
A few things jumped out at me when looking at this map.
While the Olympics prides itself on being global, the density of athlete birthplaces is rather Eurocentric. One good way to explain this is by coloring birthplaces by year of participation. Note that there is more global participation in later years, which accurately reflects the fact that the Olympics were predominantly a Euro-American exercise for the early part of the 20th century.
The distribution of athletes tends to follow pretty close to the distribution of the world population, as one might expect.
If the above bullet point holds true, for much of the Southern hemisphere, most of the population is on the fringes of the continents, as opposed to the Northern hemisphere, which tends to be more “filled in.” I’m guessing this has some relationship to the history of colonization, but I’ve already started to digress….
At this point, I found myself in a bit of a conundrum. How could I best explore my question of geographic correlation in athlete performance?
How Big of a Shovel Do You Need?
I first tried using predictive analytics on winning a medal or gold medal: basic regression trees, boosted trees, bagged trees, neural networks, support vector machines, naïve Bayes, k-nearest neighbors … basically all of the wonderful predictive analytics tools JMP had to offer. My goal, crazy as it may sound, was to try and use geographical coordinates and year of competition to predict recent performance, using the games after the year 2000 for validation purposes.
While it was good to see some consistent results among all of these methods, I noticed that they all tended to just repeat what I could see in the data itself. Plus, the resulting models wouldn’t necessarily be winning any data contests for best predictive ability (I was lucky to get an R^2 above 0.2). But I really, really, really wanted these to work, and so I spent several weeks trying and retrying each method to see if it yielded any new insights.
In the end, I finally came to a stunning realization: Sometimes simple is better!! Don’t get me wrong; predictive analytics is a powerful tool set, and the fact that each method I tried gave nearly the same results that didn’t contradict what I saw in the raw data should be proof that they are valid. It just happened that, for my data, I was trying to use a backhoe to dig for insight when I all needed was a simple trowel.
The Thrill of Exploration!
So at this point, I could spend the next several paragraphs explaining in riveting detail the insights I found about which athletes from which places did well in which sports. I could … but where would be the fun in that? Instead, I thought I’d wield the power of JMP dashboards and let you, fearless reader, explore for yourself! You'll find it in the Medalists_Summary table in the attached zip file. Along with the dashboard, there is another interactive map that uses the new density feature in JMP 15 to explore the density of medalists by sport and event. Feel free to explore or even start your own investigations!
Now that’s not to say I won’t provide some summary information. There were some sports where the origin of top athletes were as expected (e.g., Ice Hockey, Baseball and Alpine Skiing), some where the origin was a bit of a surprise (e.g., Weightlifting and Water Polo), and a few more where there seemed to be no real definitive origin (e.g., Triathlon and Football). But don't take my word for it. Go explore for yourself!
High Scores, High Praise
In a previous blog post, I wrote about the dangers of using a raw rate statistic because it is very sensitive to the denominator (there represented by population). I derived a score metric that tried to buffer against unnecessarily inflated rate statistics by multiplying the raw rate by a probability score, essentially weighting the rate so that small populations don’t bias the results. I thought I would try something similar here, with the raw rate representing either the proportion of medals or gold medals. The score I computed here consisted of that raw rate multiplied by the probability score of the number of athletes per event per year .
In another table in the attached zip file, labeled Birthplace_Medalists_Summary, you'll see the medal counts and percentages summarized for each birthplace city in the Medalists_Summary table. With that table, there's yet another interactive plot that allows you to explore medal and gold medal scores over various sports and events. I encourage you to try it out and compare your findings with what you see in the Medalists Dashboard (hopefully they're not far off...).
As you can see, there's a big theme here of interactive plots and maps. That's the beauty of JMP's graphical structure, one which I've always enjoyed and am always learning more about. And now, with the instant graphical gratification of JMP 15, it just keeps getting better!
The Dream I Never Knew I Had
You can see from my discussion regarding predictive analytics above that there isn’t always a strong correlation between geographical location and athletic performance. Obviously, there’s a lot more that’s involved. One key aspect that I hope to explore in a follow-up post is the role of athletic clubs in athletic performance. There’s an amazing story behind that which I’d like to share with you now!
As you can probably tell, this topic excites me and I was happy to share it with my manager, who I’m very appreciative of for listening and encouraging me on it. As it so happens, she was in a meeting with another manager who knew of a SAS employee by the name of Mirko Mueller-Goulsby. What’s so special about this guy you might ask? Well, just the simple fact that he was a former Olympic athlete! That’s right, SAS has its very own Olympic athlete working within its walls. Mirko was a figure skater who competed for Germany in the 1998 Nagano Winter Olympic Games. And the best part of all this? Through my manager’s connection, I was able to sit down and have lunch with him! How awesome is that?! As I told my followers on Twitter (@ckingstats, pardon my shameless plug), I got to fulfill a dream I never knew I had. It was his suggestion to look into athletic clubs, so I look forward to sharing with him and you my results on that front in the near future!