Subscribe Bookmark
arati_mejdal

Staff

Joined:

May 21, 2014

Soccer Analytics Using JMP

NOTE: This entry comes to the JMP Blog from our colleague Jerome Bryssinck of SAS Belgium. Jerome had seen Jeff Perkinson's examples of basketball analytics using JMP and created his own example using football (or soccer) data. In response to comments from readers, Jerome updated his model on May 26, and this blog post now reflects those changes.


THE QUESTION: Has the game been decided yet? HTGBD


This is the question that most people constantly ask themselves when they are watching a football game. This question can take different forms depending on the circumstances. If you're lucky to support the winning team, you might ask yourself: "How secure is the lead?" And for the less fortunate of us: "Is there still a chance for my team to win?"


THE ANSWER: Analytics


[iframe src ="/jmp/uploads/swf/soccer.htm" width="620" height="370" border="0"]

Graph1: Probability of the game having been decided in function of the elapsed time and the number of goals difference.


Graph1 shows the probablility of the game having been decided in function of the elapsed time and the number of goals difference. It is possible to change the elapsed time and the number of goal difference on the graph by clicking on a different value.


Some interpretation examples:


If Time=45 and Goal Difference=0: The game has been going on for 45 minutes, and the number of goal difference is 0. There is a 23% probability that the outcome of the game won't change. Here, as the teams are even (0 goal difference), this would mean that there is a 23% probability the game will end in a tie.


If Time=45 and Goal Difference=1: The game has been going on for 45 minutes, and one of the teams is leading by 1 goal difference, then we have a 60% probability that the outcome of the game won't change. Here, this would mean that the leading team has a 60% probability to win.


More Details about the Answer


The model used above has been built using data from the UK Premier League from 2002 to 2006. The type of model used is a regression model.


The following representations are useful to understand the underlying data.


Graph in JMP of Has the Game Been Decided vs. Time


Graph2: Has the Game Been Decided vs. Time


Graph2 shows the percentage of the games that have been decided in function of the Elapsed Time. I must say that I wasn't surprised by this graph, which basically states that the Elapsed Time and the HTGBD (Has The Game Been Decided) are directly proportional.


<img width='400' height='291' style="border: 0px; padding-left: 5px; padding-right: 5px;" src="http://blogs.sas.com/jmp/uploads/Soccer1.gif" alt="Graph in JMP of Has the Game Been Decided vs. Time By Goal Difference

" />


Graph3: Has the Game Been Decided vs. Time By Goal Difference


Graph3 shows the percentage of the games that have been decided in function of the Elapsed Time by the number of goal difference. According to this graph, the number of goal difference is an excellent predictor for the HTGBD.


Additional readings:


Similar models are available for basketball. Check out Bill James and Jeff Perkinson if you want to learn more.


This entry was first published in Jerome Bryssinck's blog, Brisink. It is republished here with his permission.

2 Comments
Community Member

Jerome Bryssinck wrote:

>Something does not seem right with your model when the goal difference is large because the predictions begin to exceed 1.

The model used by the Profiler is a Linear Regression. I initially built a Logistic Regression Model but couldn't export it as a Flash profiler. So I reverted to a linear regression, which is of course not totally correct. This is why the predictions can exceed 1 for some values.

By the meantime, I was able to figure out how to export a Profiler for the Logistic Regression so I will update the post shortly.

>At the higher goal differences, it also seems counterintuitive to say the game has been decided when the match is only a few minutes old. My guess here is that there is minimal data on high spread games after just a few minutes and that the predictions here are extrapolated.

You're correct that the model is not valid for games where we don't have much input data (e.g. high spread games after just a few minutes). The good news is that we won't see many such games in practice, so the model will still be accurate when scoring actual games.

However, if we want to be perfectionist, we could:

1. Define for which games (combinations of input variables) the model isn't valid.

2. Find an alternative way of scoring the games for which the model isn't valid (it will be possible to assign a minimal probability based on games for which we have data).

OR

just state that the model isn't valid for those games

Community Member

Michael Joner wrote:

Something does not seem right with your model when the goal difference is large because the predictions begin to exceed 1. My guess is that this is either 1) a problem with the Flash implentation of the Profiler, 2) a result of using something other than Logistic regression. At the higher goal differences, it also seems counterintuitive to say the game has been decided when the match is only a few minutes old. My guess here is that there is minimal data on high spread games after just a few minutes and that the predictions here are extrapolated.