オリンピックのメダル数はその国の豊かさで説明できるのか？　～メダル数を説明する回帰モデルの構築～

Masukawa_Nao · Aug 13, 2024 02:59 AM

The Paris Olympics, where much of the heat was being wrought, has now come to a close. Due to the scorching heat, many people may have stayed indoors and enjoyed watching the games instead of going outside.

Although the Olympics lasted for about two weeks, Japan won a total of 45 medals, the most ever won at an overseas Olympics. However, Japan won 58 medals at the last Tokyo Olympics.

Was the number of medals won this time, 45, an appropriate number? Let's consider this from an economic perspective.

undefined

Quite some time ago, when I was studying economics using the famous book "Mankiw's Introduction to Economics," I came across a column in the book titled "Who will win the Olympics?" which made a strong impression on me.

This column states that the number of Olympic medals a country wins is explained by the country's GDP (gross domestic product) and population . Economically wealthy countries are able to develop the potential of many talented athletes, and the larger the population, the more likely it is that highly skilled athletes will emerge.

In addition, being the host country may also be a factor, as it gives Japan the advantage of competing at home. As mentioned above, Japan won a record 58 medals at the Tokyo Olympics three years ago, which is likely due to the benefits of hosting the games at home.

In this blog, we will create a regression model that explains the number of medals each country won at the Paris Olympics based on the country's population, GDP, and host country (whether it was France or not). We will evaluate the explanatory power of the model and consider countries that won more medals than predicted by the model.

Model to predict medal count

In the regression model, the dependent variable and explanatory variables are set as follows:

Response variable：Total number of medals for each country (gold, silver and bronze combined)

Explanatory variables : GDP (dollars), population, and host country (a dummy variable that is set to 1 if the country is France and 0 if the country is other).

The explanatory variables, GDP (dollars) and population data, were mainly obtained from open data from the World Bank . Note that data on North Korea and refugee teams was not available, so they were excluded from the analysis.

*The analysis should also include countries that did not win medals, but this time we are focusing on countries that won at least one medal.

I created the following data table in JMP. The orange column is the response variable, and the yellow column is the explanatory variable.

undefined

Population and GDP have been transformed into common logarithms. "Host Country" is a continuous column that takes the value 1 for France and 0 for other countries.

When plotting a histogram of the objective variable, the total number of medals, the distribution has a right-leaning tail. Therefore, we treated the total number of medals as count data, applied Poisson distribution and negative binomial distribution (gamma-Poisson distribution), and compared the goodness of fit.

undefined

The fitted curve and AICc show that the negative binomial distribution fits the data better than the Poisson distribution. The negative binomial distribution is good for modeling overdispersion (data with variance larger than the mean), and it fits our data well.

Fitting a regression model assuming a negative binomial distribution for Y

Y is negativeAssumed to follow a linear distributionWe fit a regression model with the expected value E(Y) = μ and the variance Var(Y) as follows:

log (μ) =　β0 + β1 *log10(Population) + β2 *log10(GDP($)) + β3 * (Host Country)

Var(Y) = μ + σ μ^2 (σ is the overdispersion parameter)

This model can be fitted using the " Generalized Regression " function in JMP Pro. In "Fit Model," set the method in the upper right to "Generalized Regression," the distribution to "Negative Binomial," and specify Y and the model effects.

undefined

The coefficient for "log10(Population)" is negative at -0.263487. As you can see from the "Prediction Profiler" report, this model shows that the larger the population, the lower the total number of medals. This result contradicts the idea that "the larger the population, the higher the chance of producing high-ability athletes."

undefined

1. Model explaining population, GDP, and host country

When creating a model, how about selecting variables? In Generalized Regression, you can select Lasso (a method for selecting variables while preventing overfitting) as an estimation method, so we will try fitting the model using this method.

undefined

As a result of the fit, the coefficient for log10(Population) was shrunk to 0, indicating that population is not a useful predictor of medal counts, and after removal, the host country term became highly significant (p-value = 0.0009).

undefined

②GDP, model explained in the host country

We compared a model that includes population (①) and a model that excludes population (②). The AICc, BIC, and generalized R-squared statistics are almost the same, so it is difficult to conclude which model is better.

undefined

In such cases, one approach is to choose a model that can be interpreted in light of reality. Based on the R-squared value, model (②) that excludes population can explain approximately 65% of the medal count with just two variables, GDP and the host country.

There must be many other factors that determine the number of medals, such as a culture that encourages sports, outstanding athletes, whether events that are easy to win medals in are adopted as competitions, etc. Nevertheless, it is interesting that the result can be explained to some extent by only two variables, population and host country.

Here is the forecast profile for model ② with France's GDP entered.

undefined

From this, the predicted total number of medals for France is 51.5. In fact, France won 64 medals, so it can be said that they performed better than expected, even taking into account their economic situation and the advantage of being the host country.

For model ②, here is a report on the "Plot of Actual Values and Predicted Values." The vertical axis is the actual values and the horizontal axis is the predicted values, so countries above the line on the graph (the line where the actual values and predicted values match) are countries that exceeded the prediction, and countries below are countries that fell short. (Both the Y and X axes are logarithmic.)

undefined

Japan won 45 medals, far exceeding the forecast of 27.2 medals. However, many of the top medal-winning countries had actual medals that were significantly higher than the forecast, suggesting that there may be some factor that cannot be explained by GDP alone, which is an economic indicator.

At the Los Angeles Olympics four years from now, how many medals will the United States, which once again has the most medals, take advantage of its status as the host nation to win?

by Naohiro Masukawa (JMP Japan)

Naohiro Masukawa - JMP User Community