In my previous post, I showed how we explored the eras of baseball using a simple scatterplot that helped us generate questions and analytical direction. The next phase was figuring out how I might use analytics to aid the “subject matter knowledge” that had been applied to the data. Could I confirm what had been surmised regarding the eras of the game? And, might we use analytics to surface more periods of time in the game that we should investigate, or at least, attempt to explain?
Generalized Regression is a method that we might use to build a predictive model, but in this case I wasn’t seeking to build a predictive model. I wanted to take advantage of the variable selection capabilities that are available in the Lasso Option of this method. I wanted to see which variables were selected as “significant and important,” and in what order. I thought that what’s selected will ultimately be the years in which I’m interested – that is, I thought I would confirm my assessment of the eras of the game and probably find some interesting information that my “eyeball” method had overlooked.
Longtime SAS associate and friend Tony Cooper is one of the people who talks baseball with me. Besides being a sports fan in his own right, Tony is an expert JSL (JMP Scripting Language) programmer. In mere minutes, he created a script to help me develop a set of dummy variables that enabled better differentiation of run production through the years. It was a key to the results!
In Generalized Regression, I was looking for a method that would help me surface “change,” and I had seen this tool in action in other areas. My expectations were high, and this method did not disappoint. In fact, the results were exciting, indicating that this method could uncover “change” not only in baseball data but also in other areas!
Utilizing the JMP platform for Generalized Regression enabled me to walk through the years of the game and confirm suspicions about the eras of the game. And it also found what might not necessarily be an “era” but a demonstration of the effects of expansion in specific years, or (for example) the act of raising the mound in 1968 and then lowering it in 1969.
The graph and table shown below tells a story: Solution Path shows the order at which variables enter the model. And, in the accompanying table, the estimates tell us if RPG (Runs Per Game) went up or down in the associated year.
Just like the scatterplot, this visualization led to questions as each year was presented. For example:
So, while having some fun playing with data from a favorite game, we were also able to demonstrate the effectiveness of generating questions and analytical direction using the simple scatterplot. We then showed how we could surface information at even greater levels using Generalized Regression. Potential applications to areas outside the arena of sports are endless. How might you use Generalized Regression?
Interested in seeing more?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.