Using data visualization to explore the eras of baseball
World Statistics Day was yesterday, but we’re celebrating all week long! This celebration means acknowledging the impact statistics has on our world. Who is your favorite statistician? Share with us who they are and why they top your favorites list.
Using data visualization to explore the eras of baseball
Apr 21, 2016 9:55 AM
In consulting with companies about building models with their data, I always talk to them about how their data may differentiate itself over time. For instance, are there seasons in which you might expect a rise in flu cases per day, or is there an economic environment in which you might expect more loan defaults than normal? These are examples of key pieces of information that come with a challenge: How do you identify these periods in your data where change occurs? And, can you explain the change?
This topic is always at the top of mind when I work with customers. This week, the game of baseball is also on my mind, as the season is now underway.
Recent discussions with some of my friends who are baseball fans have centered on the history of the game, and how various rules and events affected the game over time. Or had the game been affected? Some thought the rule changes and events could not have had a significant impact, while others were noncommittal.
As a statistician with data and tools to analyze it, I decided to do a bit of research. It occurred to me that this was a nice opportunity to illustrate how we might discern the periods of time that have been affected by events, policy or rules. We could have fun with baseball data while keeping in mind that the same approach could apply to businesses in other industries.
Major League Baseball (MLB) makes its data publicly available through Sean Lahman (SeanLahman.com). From his robust set of files, I built a comprehensive database of SAS data sets featuring baseball data.
The approach to discerning different periods of time (or in the case of baseball, eras) was twofold: First, I would rely on the expert opinion of … myself. And second, I would explore an analytical technique to see if the result would agree and support expert opinion – and also, would it surface more periods of interest? You'll see this second part in a follow-up post.
I like to develop my data using the SAS Data Step, and did so from within SAS Enterprise Guide. In doing so, I developed a simple metric representing the Runs Per Game (RPG), believing that would be the metric that I could use to represent rule changes over time. It’s been said that “runs are the currency of baseball,” and if a rule or event disrupted the normal production of runs over time, then we should discuss it! I built the data set and seamlessly sent it to JMP.
A graph spurs discussion
Using Graph Builder in JMP, I quickly created one of my favorite means of analytical communication: the scatterplot. This one featured the mean RPG versus Year. And, as soon as I built the graph (and shared it), the questions and observations from my friends started to flow:
Why were there so many runs before 1900?
Why were there so few runs between 1900 and 1920?
Why did runs fall off in the early 1940s?
Runs didn’t rise as much as I had expected in the 2000s…
What era are we in now?
The graph evolved a bit as we discussed these questions. Here’s the scatterplot of Mean Runs Per Game Through the History of Baseball that triggered these questions and many more.
I added the colors and references lines as the eras of the game were differentiated in our discussions. The majority of the questions directly related to eras as identified by baseball historians.
Some of the questions (and answers) were as follows:
Why were there so many runs scored before 1900?
Until 1887, the batter could essentially call the pitch (i.e., “high or low”), and the pitcher was obligated to comply.
Until 1885, “flat” bats were used.
Until 1883, pitches were launched below the waist and had less velocity.
Until 1877, there were “fair foul hits” where balls that might hit inbounds and “kick out” before first or third base were considered hits (today they are called foul balls).
This era was known as the “19th Century Era.”
Why were there so few runs scored from 1900 to 1920?
Many manufacturers produced baseballs with poor and inconsistent specifications.
Teams used the same ball literally until the cover came off – it became dirty and difficult to see.
This era was known as the “Dead Ball Era.”
What happened to increase run production after 1920?
After Ray Chapman was hit and killed by a pitch, baseball began using clean balls. Witnesses stated that Chapman didn’t even flinch, which led most to believe that he hadn’t seen the ball approaching.
Home run hitters like Babe Ruth emerged.
Consistent manufacturing (with consistent rubber cores) made the baseballs come off the bats more readily.
This era was known as the “Live Ball Era.”
What happened in the early 1940s? It appears runs fell off again.
Replacement players played the games as “the regulars” joined the military during World War II.
This period is not always called an era, but referred to as the “War Years.”
Questions continued to bubble up…
The discussion continued in many interesting directions, for example:
The era from 1947 to 1962 is referred to as the “Integration Years,” as Jackie Robinson joined the Dodgers on April 15, 1947.
The era that begins in 1963 still perplexes baseball historians at what to call it, or even how many eras might exist from 1963 to the present. Here are some of the events and rule changes that have affected the game since 1963:
The league expanded from 16 to 30 teams, effectively diluting the talent among teams and prompting many to refer to this entire period from 1963 to present as the “Expansion Era.”
The American League instituted the “Designated Hitter” into the game in 1973, leading some to refer to the period from 1973 to present time as the “Designated Hitter Era.”
Rumors of players using performance-enhancing drugs surfaced in the mid-1990s, resulting in some calling 1995 to 2009 the “Steroid Era.”
What’s cool about all this “discovery” is that it happened from the initial scatterplot, and the identification of what appears to be clusters of years with similar RPG. As we identified clusters of years with similar run production, we either explained the reason behind the cluster, or noted it as a period of time having a change due an unknown cause (and looked forward to researching it further!).
Next week, we'll use analytics to try to confirm these eras of the game and possibly uncover more periods worth investigating.
Interested in seeing more? This step-by-step video shows how I created the graph in Graph Builder: