Mar 6, 2017 6:16 AM
| Last Modified: May 8, 2017 12:10 PM
With spring training under way and the baseball season almost here, it is time to do some baseball math. Like many of you, much of my world is creating predictive models that other people have to comprehend and use. I want both simplicity and accuracy – usually competing desires. Often, I know the relationship has to be monotonic (either strictly increasing or strictly decreasing). For instance, to take an example from the baseball world, more wins are good, and any additional win has to improve a team's chances of making the playoffs. But I don’t know whether the relationship is linear or nonlinear, and if it’s the latter, what relationship is the best to capture it. I want to keep things simple, and so keeping the model linear is attractive. But I also want accuracy, and I don’t want to miss out if there is a nonlinear aspect to it. This can be tough.
JMP comes to the rescue with its Knotted Spline capabilities. To stick with a baseball example, let’s say we want to estimate a team’s chances of winning its division. The skill of the team and the skill of the other teams in the division obviously play a big part in this problem. To get at a team’s skill, instead of arguing whether the Red Sox are better than the Yankees, we can turn to an unusual source – Las Vegas. Every year, Vegas comes out with an estimate of the number of wins each team will have in the upcoming year. This may not be perfect, but it is a pretty good way of evaluating the skill of a team before any games are played. Fortunately, I have kept track of these estimates from past years.
I looked at every team’s expected win total for each year from 2004 on. That was the measure of each team’s skill. Again, we know that expecting more wins increases a team’s chances and vice versa, but we don’t know what that relationship is. Are there diminishing returns for additional wins? Maybe there’s a “floor effect” for teams that are really bad. But what we do know is that 90 wins is greater than 89, which is greater than 88 and so on. We know this has to be monotonic. The number of wins a team is expected to get certainly must be a good predictor of winning the division. But whether a team shares that division with the best team in baseball (the Cubs this year) or not also makes a big difference. For simplicity, I used the highest win total for all division opponents as a measure of the division strength. Of course, you can get fancier, but for this illustration, this is good enough.
So, for all teams since 2004, I have the expected number of wins, the expected number of wins of their toughest division opponent, and as a target, whether they won the division or not. We know both of those predictors are going to be monotonic, but we don’t know exactly what those curves will look like. We can bring these two predictors into the regular regression Fit Model in JMP. I select the predictors and then under Attributes, I choose Knotted Spline Effect.
Instead of fitting the target with a line, JMP fits it with the ideal Knotted Spline. A spline is a smoothed polynomial function. We can get an unconstrained glimpse at the relationship between wins and division chances with all the freedom for it to curve as it pleases. From that, we can decide whether to use a line, logistic or whatever. The Knotted Spline Effect pulldown asks how many splines I want to use. The default of 5 works fine for me. I don’t think this curve is going to be bending all sorts of ways – maybe just some flattening and diminishing returns here and there.
I run the regression with my predictors as Knotted Splines and save the formula results into a Column.
The formula is complex, and it is not clear what the wins_expected relationships are. We can easily see that by using Copy Column Properties and pasting them into two new columns. And then we go into the formula in each column and remove everything but the spline of interest. When we do that with Wins_expected, we see the following:
There was not much of any diminishing returns that we had hypothesized. The more Wins_expected, the better a team’s chances are – with seemingly no end in sight. However, at the other end, you can see that there is a bit of a floor effect. One could model this linearly, but after seeing this, I think many people would not be satisfied with that choice. I’m not, so with JMP Nonlinear modeling, I know I can choose a monotonic function to fit this fairly well. I choose a 4p logistic, and you can see that it describes this effect well. Looking at the effect of the Wins_expect_division_oppo, we get:
Again, you could model this linearly and go on, or you could include the flattening by using a sigmoidal or, in this case, a cubic, giving yourself the confidence that you are capturing more of the relationship than just a line would. I save both of these nonlinear fits as their own columns. I think of these fits as smart transformations of my predictor data. I didn’t arbitrarily square them or take the log of them. Instead, I use the data to tell me how they should be bent. With those fits in hand, I can use them as inputs in a simple linear regression (or a logistic regression to ensure no values actually go negative) to predict the chances of winning the division. Keeping it simple with a linear regression, I get the following:
I have a special interest in the Astros as that team is my employer and there is much excitement about the team this year. Vegas has the Astros as the favorite in the division with 87.5 wins. This is in a division that doesn’t have an unusually high expected win total (Texas at 86.5 is the highest division rival). That gives the Astros a 32% chance of the division according to this method. Not bad, and apparently the excitement is warranted. All the teams, using the Tabulate capability in JMP, are listed below along with their expected wins and their division chances. What are your team’s chances?