In consulting with companies about building models with their data, I always talk to them about how their data may differentiate itself over time. For instance, are there seasons in which you might expect a rise in flu cases per day, or is there an economic environment in which you might expect more loan defaults than normal? These are examples of key pieces of information that come with a challenge: How do you identify these periods in your data where change occurs? And, can you explain the change?
This topic is always at the top of mind when I work with customers. This week, the game of baseball is also on my mind, as the season is now underway.
Recent discussions with some of my friends who are baseball fans have centered on the history of the game, and how various rules and events affected the game over time. Or had the game been affected? Some thought the rule changes and events could not have had a significant impact, while others were noncommittal.
As a statistician with data and tools to analyze it, I decided to do a bit of research. It occurred to me that this was a nice opportunity to illustrate how we might discern the periods of time that have been affected by events, policy or rules. We could have fun with baseball data while keeping in mind that the same approach could apply to businesses in other industries.
Major League Baseball (MLB) makes its data publicly available through Sean Lahman (SeanLahman.com). From his robust set of files, I built a comprehensive database of SAS data sets featuring baseball data.
The approach to discerning different periods of time (or in the case of baseball, eras) was twofold: First, I would rely on the expert opinion of … myself. And second, I would explore an analytical technique to see if the result would agree and support expert opinion – and also, would it surface more periods of interest? You'll see this second part in a follow-up post.
I like to develop my data using the SAS Data Step, and did so from within SAS Enterprise Guide. In doing so, I developed a simple metric representing the Runs Per Game (RPG), believing that would be the metric that I could use to represent rule changes over time. It’s been said that “runs are the currency of baseball,” and if a rule or event disrupted the normal production of runs over time, then we should discuss it! I built the data set and seamlessly sent it to JMP.
A graph spurs discussion
Using Graph Builder in JMP, I quickly created one of my favorite means of analytical communication: the scatterplot. This one featured the mean RPG versus Year. And, as soon as I built the graph (and shared it), the questions and observations from my friends started to flow:
The graph evolved a bit as we discussed these questions. Here’s the scatterplot of Mean Runs Per Game Through the History of Baseball that triggered these questions and many more.
I added the colors and references lines as the eras of the game were differentiated in our discussions. The majority of the questions directly related to eras as identified by baseball historians.
Some of the questions (and answers) were as follows:
Why were there so many runs scored before 1900?
Why were there so few runs scored from 1900 to 1920?
What happened to increase run production after 1920?
What happened in the early 1940s? It appears runs fell off again.
Questions continued to bubble up…
The discussion continued in many interesting directions, for example:
What’s cool about all this “discovery” is that it happened from the initial scatterplot, and the identification of what appears to be clusters of years with similar RPG. As we identified clusters of years with similar run production, we either explained the reason behind the cluster, or noted it as a period of time having a change due an unknown cause (and looked forward to researching it further!).
Next week, we'll use analytics to try to confirm these eras of the game and possibly uncover more periods worth investigating.
Interested in seeing more? This step-by-step video shows how I created the graph in Graph Builder:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.