Hello, everyone. My name is Kevin O'Donnell.
I'm a JMP Global Customer Reference Intern,
and today I'll be presenting on a personal analytics project
that I've done that is the NBA Player Matchup Model.
So just to get started,
I would like to go over my idea and the motivations for building the model.
I've always been incredibly interested in basketball.
I've been a huge basketball fan for my whole life,
and I was looking for a new analytics project to dive into,
so I ended up asking this question here:
Could we model a player's offensive success
based on their averages and the strength of their opponents as well?
So as most of us know from watching any sport,
player performance varies greatly from game to game.
It's based on a variety of factors
and one of these being a player's average points.
That will be a very strong predictor for points in an NBA game,
but it's far from the only influence on a player's offensive performance.
So going into this project,
I wanted to build a model that predicted the players points in the game
based on the points per game,
but also a lot of other variables which we will look into on the next slide.
So ideally, this will be helpful for coaches and team analysts to,
for example, determine which of their offensive players are likely
to over perform or able to perform really well
based on their match ups and vice versa, which of their players they might want
to avoid on offense because they're being guarded
by some of the best defensive players on the other team.
This could inform play calling and game planning,
so if a particular defender is weaker, coaches may look to target that match up
or look to force a switch onto their best scorer.
In the entertainment realm,
fantasy basketball players could use this data to figure out
who to start or who to pick up off the Waiver Wire,
so it has a broad range of applications.
So here's a little bit of the overview of the data.
I'll go through this quickly, and we'll see this as I go
into the JMP demo a little bit more in detail,
but we're going to predict points per game based on seven main categories of data.
So we're going to use player offensive averages for the season,
this includes things like points per game,
3.2 point percentages, and advanced metrics such as usage rate,
which records how often an offensive player is used on their team.
Player attributes such as height, wing span, vertical leap
will also be included for both the offensive and defensive players
because these physical attributes, I would assume, also contribute
to how many points a player will score.
Finally, in terms of the offensive players,
we'll use career averages.
The same averages that we're using for the season,
we'll also use for the career to add a little more robustness to the model,
and overall team pace and offensive rebounding percentage
is also going to be a predictor because that will determine
how many possessions there will be.
If there are more possessions in the game, it's more likely that any given player
will have more points.
And so, on that same token, defensive team pace
and defensive rebounding percentage will also be used.
In terms of individual defensive stats, things like steals, blocks, fouls,
and other advanced metrics such as defensive win shares, which measures
an individual player's contribution to the team's wins on defense.
Those are also going to be predictors to
probably negate the amount of points that are scored by an offensive player.
And then finally, the defender attributes will be used as I mentioned earlier.
So before we get right into the data and the modeling and JMP,
I would like to go over the matchup data that I was using
for the bulk of the model.
So the NBA recently implemented some new personal match up data collection
and it's based on detailed player tracking.
So it tracks the closest defender at every point,
not just the primary defender on a play.
It only tracks front court time and it will track partial possessions.
So what this means is that a player could be guarded by
as many as five different players on a single play,
and each defender would be awarded the respective amount
of matchup minutes for that possession.
So here we have an example of Terry Rozier for the Hornets
being guarded by four different players on different teams.
In the first row, Steph Curry is guarding him
in only one game for 2.7 minutes and allowed two points.
So if we think of a hypothetical where the Hornets are playing
Steph Curry's Golden State Warriors and Curry guarded Rozier for 10 seconds,
and then Klay Thompson switched on to Rozier for another 10 seconds,
both would be awarded those 10 seconds of matchup time.
However, if Rozier scored a two pointer at the end of this possession,
the points would be marked for Klay Thompson.
This is really cool tracking data and I love how specific it is,
but it did cause some problems when I tried to model the points per minute
for each player and defender combination, which was my original plan.
So I was originally going to use the offensive player stats
and defensive player stats that I mentioned in the previous slide
in every individual matchup.
But since many defenders, like you'll see here,
logged very small amounts of time,
like Stanley Johnson only guarded Rozier for an average of under one minute a game,
that's not enough time
for some of these points per minute measurements to be normal.
So going back to that Klay Thompson hypothetical,
if Klay only guarded Rozier for 10 seconds the whole game
and Rozier having to score two points in that possession,
then Rozier's points per minute against Klay in that row would be 12.
So obviously, extrapolating that to a full game,
even excluding the back court time as this data does, it's very unrealistic.
So I had to go with a slightly different, less ideal approach.
Instead of using the individual defensive stats,
I actually averaged them out
for each combination of player and defensive team.
So instead of using Rozier versus Curry and Rozier versus Thompson,
I would use Rozier versus the entire Golden State Warriors average based on
the amount of matchup minutes that each player defended Rozier.
So for example,
if Steph Curry and Klay Thompson both guarded him for half the amount
of possible match up minutes, then the team average,
let's say steals per game, would just be the arithmetic mean
of Curry steals and Klay steals per game.
The same would go for every defensive variable
and that's obviously a simple example,
but the same goes for every player versus every team.
So obviously this is not ideal because it minimizes
the individuality of the matchup.
But I had to abandon that points per minute approach
because the samples were too small and the response was heavily distorted.
So this model that I've created with the aggregated data is more accurate
than my initial attempts, even though it sacrifices some individuality.
So at this point, I'm going to switch into JMP here
and do a little bit of the demo on how I built the model
and some exploratory data analysis to begin.
So to begin, we're just going to look at some marginal relationships
with points in the graph builder,
and I'm able to choose any I want.
But for now we're going to look specifically at three variables,
the first being points per game.
So we see here, we have a moderate relationship
with points and points per game, and it is positive as we would expect.
The average points that someone puts up in a game,
or over the course of the season rather, is going to obviously influence
how many points they score in a game.
Similarly, something like usage rate,
the advanced statistics that I was discussing earlier,
also has a positive relationship,
just slightly weaker than the relationship between points and points per game.
Finally, if we look at some of the career stats,
if we look at career points per game, then again we see a positive relationship,
but that is a little bit weaker because it's averaged over the course
of a player's career rather than the season that we're currently in.
But hopefully this could help adjust for some major differences
in the points per game totals or averages rather.
So now that we have a sort of an idea of the data itself,
we can move into the simple linear regression
with points by points per game as a benchmark for this model.
So as I mentioned,
you can predict based on just points per game
and it will give you a decent prediction.
But we're looking to improve that prediction by adding
some of these different variables, offensive and defensive.
So if we run this script here,
this is just comparing some runs of a simple linear regression
on the training and validation set using KFolds validation throughout.
And if I run here using the Hidden Validation 2 column
that I'll be using for the remainder of these models,
then we can see the regression here.
So here's a regression plot. It looks pretty scattered.
There's not a clear linear relationship, but the RS quare is moderate,
meaning that 46.5 percent of the variation in points is accounted for
by this average points per game, which is pretty good.
And the Root Mean Square Error is about 5. 27.
The Root Mean Square Error is the standard deviation of the residual,
so it's essentially the variation in how well our model is predicting.
Its standard deviation is around five.
And we can also look at some other measures,
such as the AIC, which is used to compare models for predictability.
So AIC is another important measure,
and it measures how well the model will predict
relative to the number of predictors put in the model,
just to make sure you're not adding way too many.
So in this case, the AIC is very high, and we can come back to this number
as we look at the comparison between this and the multiple linear regression
that I will show later.
So while I was trying to pick the model that I was going to choose,
I decided to use the model screening feature in JMP,
which allows you to select your response variables and all the factor variables.
I ended up putting in all of the numeric variables that I had for a full model
just to see initially which models perform the best.
And I was able to choose from a variety of these different methods,
including XGB oost and Generalized R egression,
and then of course, just the normal Lease Square.
In the interest of time, this would take forever to run
because it's using some of these machine learning algorithms.
So I'm instead just going to pull up a quick screenshot of the model screening.
So here we have the output from this model screening.
It shows the RS quare.
Again, I was using KFold validation with ten folds here.
So this shows the RS quare
and the Root M ean Square Error for each run.
So we can see that the Least Square Fit actually had
a very strong fit compared to some of the machine learning algorithms,
which I was surprised by. But it actually helps for interpretability
because some of these machine learning algorithms,
they're more of a black box. And if I were to use those,
it wouldn't be as easy to interpret the coefficients or see
which variables are truly significant
or which are just being used for prediction.
So because the fit of Least Square and the Lasso Regression,
which I'll get into a little bit later, were so high,
it's actually a good sign that I'm able to use that for better interpretability.
So as we can see, the most accurate model is this multiple linear regression
with Lasso Regularization.
Lasso Regularization is a statistical technique that regularizes the model
and selects features to minimize multicollinearity.
Multicollinearity is the correlation between predictors
and that can negatively affect the model.
So using this technique,
we're able to take the full linear regression model shown here
and remove some of the variables to minimize the multicollinearity
and maybe satisfy some of the linear assumptions better.
So with that information,
I was able to take the variables selected with Lasso Regularization
and then create a multiple linear regression using those variables.
So as you can see here,
the Actual by Predicted Plot looks pretty similar.
It might be slightly closer to the line of best fit here, which is a good sign.
And we can see our effect summaries,
but we're going to scroll a little bit past that
so that we get to the summary of our fit.
So compared to the simple linear regression,
the RS quare and the Adjusted RS quare are very similar and that's probably
because points per game is such a heavily influential factor in this,
as well as the simple linear regression, obviously.
So the fit is not too much different.
However, adding these different variables shown here does improve
the Root M ean Square Error a little bit.
It went from about 5.3 to little over 5.2, which is not a drastic improvement,
but an improvement nonetheless.
But the real difference here is in the AIC and comparing these models in terms
of their predictability, the AIC dropped significantly
from the simple linear regression to this multiple linear regression
with Lasso R egularization, which is definitely a good sign.
We see some of these Parameter Estimates down here and of course the results
for RS quare and Root Mean Square Error in cross validation.
So looking at some of these Parameter E stimates,
the points per game, again very significant,
it's significant at the .05 a lpha level and many below that.
And this is to be expected, as we've explained already.
Turnover percentage is also significant
and this has a negative relationship with points.
As one would expect, if you're turning the ball over more,
that's less opportunities to shoot the ball,
less opportunities to score.
So that checks out with just our knowledge of basketball.
Here we have the defensive pace, or the pace of the defensive team,
and this has a slightly positive relationship with points conditional
on the other factors in the model.
Again, that's to be expected because the faster a team plays,
that's generally more chances.
However, it might not have such a strong effect as turnovers or points per game.
And then finally, defensive rebounding percentage.
This is just how often the defensive team hauls the defensive rebound.
Again, this is preventing second chance points for the offense,
so it should have a negative relation ship, which it does.
So all these things check out,
and then some of these other variables are insignificant conditionally,
but included because they improve the predictability of the model.
So something like usage rate might not be significant at the .05 level,
but it nonetheless improves our predictions.
Now, knowing that points per game might not be available
at the beginning of the season, to the extent that it is
near the end of the season,
meaning that points per game might be a little less reliable
with a smaller sample size.
I'd look to create an alternate model without that
to see if I could still predict
better than the simple linear regression and similar to the Lasso regression,
but without dependency on a points per game for the season measure.
So this alternate model, I created using backward selection by AIC.
And again I left out the season points per game,
so it still includes the other points per game measurements,
or other per game measurements rather,
like field goals attempted from three and from two,
which along with some of these other variables can be a proxy for
that average points per game.
So it' s not completely robust against early season fluctuations
and small sample sizes.
But it's possible that these variables might be a little bit more representative
of how a player is going to perform in the long run.
I'm thinking that maybe
players might be a little more consistent with their attempted stats,
or the rate at which they're shooting the ball,
rather than just the amount of points they get that could vary based on
just a small sample size.
These could as well but...
Just using this model as an alternative,
and it turns out that it actually predicts
pretty similarly to the model with points per game in it.
So it might not be favored necessarily if points per game is available.
Inappropriate but it provides a similar prediction.
So we can see that the RS quare is pretty similar,
may have increased a little bit,
and the Root Mean Square Error is again similar.
The AIC suggests that the other model is a better model for predicting
based on the amount of variables included, so that is definitely something to note.
But each model has its advantages.
This one, particularly,
we can see some of the conditional relationships of the other variables,
particularly those on the defensive side, that we couldn't see as much
in the other one because points per game was dominating so much.
So we see here that two pointers and three pointers attempted
are weighing insignificantly for points, which again,
that makes sense because the more shots you're taking,
the more likely you are to score more points.
And then there are some other variables that make sense,
and some others that maybe are a little confusing at first sight.
So offensive win shares,
that being a significant variable makes sense that's measuring
how much a player is contributing to wins on the offensive side,
it makes sense that that has a positive relationship
as we can see down here.
So offensive win shares right here has a positive relationship.
And also the defensive rebounding percentage down here
decreases the estimation again, so that is pretty consistent
with what we've seen in the previous model.
Average fouls per game, average personal fouls per game down here,
has a negative effect which I thought was interesting.
So a player or team with a higher foul total
is going to negatively affect the points scored for the offensive player.
This might mean they're more aggressive.
I would assume that they are playing more intense defense in limiting points
through steals, blocks, or heavily contested shots,
and as a result, they're getting more fouls called on them.
However, this is not all good for the defense
because since the player can foul out with six fouls,
there is a certain balance to strike in the defensive end.
You don't want to be too aggressive
because then you could be giving up easier points
or you could be leading your players into foul trouble.
So that's an interesting variable, I think, to consider.
And of course, this is a conditional significance,
so it might change slightly based on the removal
or addition of certain other variables.
And finally, a couple of defensive relationships
are particularly confusing at first glance, I think.
Specifically involving blocks and defender height.
So we think of basketball as being very dependent on height.
If you are taller, you're more likely to go to the NBA.
Seven footers, I think you have a 20 percent chance
of just going to the NBA even if you are seven feet.
So this is something that we think, if you block shots more and you're taller,
you're going to be affecting the offense's points negatively.
However, we see that these relationships are actually conditionally positive,
which is very interesting that average blocks per game
and the defender height, as well as the blocking rate,
are all positive relationships.
And so initially, I was confused by this, but I think this has more to say about
the players that these players are guarding rather than
the actual variables themselves.
So what I mean by this is when you consider that taller players
with better blocking stats are big men playing power forward or center,
they're guarding other big men.
And so that makes sense a little bit more.
Guards tend to put up more points in the NBA with the emphasis
on three- point shooting now.
And a lot of offenses are run through some of these smaller players
who tend to be guarded by other smaller players.
Whereas these taller players, with more blocks,
are guarding big men who maybe aren't the focal point of the offense,
aside from certain players like Jokić and Embiid and Giannis.
But this causes a positive relationship,
but it's really more a function of the position
that these players are playing.
So I thought that was an interesting
conditional relationship to highlight within the model,
and it involves a little bit more deeper thinking about the relationship between
blocking, height, and the points that an offensive player puts up.
So in terms of the model overall,
we can see that this one and the Lasso regularized model
are very similar in their predictions.
We'll see that in more detail soon when I flip to the 2021 predictions,
but the choice then might not be too significant.
Both have their advantages.
Of course, this one allows us to see the significance of more of the variables,
specifically the defensive variables,
whereas the first one has a slightly lower AIC
and might be better for predictions if the data is available.
So both leave a little bit to be desired
in terms of predicting much more reliably than the simple linear regression.
I would have liked to see this Root Mean S quare Error decrease more,
and it's something I would look into as I continue this project,
looking to gather better data, trying to make the matchups
more individualized without sacrificing the normality of the response variable,
things like that.
But with that being said, these are the models that I have now,
and we can look to test these on the 2021 season thus far.
So if I switch over here to the matchups for 2021,
we have the same data table just with this year's matchups.
And so I'm just on Josh Hart's matchups right now
because I go to Villanova and he's a Villanova great,
so here we have him as a player, his team, and the defensive teams,
his stats, their stats.
As we've already seen, these are all the variables
that could be included in the model.
Apologize for the quick panning,
but here we have the predictions at the end.
So these prediction variables, or rather these columns,
are the model predictions.
So our first one, our first model, is the Optimal Multiple L inear Regression
that's using the Lasso regression and including the points per game variable.
As you can see in this game,
he is predicted to put up 12.8 points and the residual here is about five.
So in reality this residual is the points minus the prediction.
So he actually put up, this is around 13 and this is around five,
so he put up about 18 points here instead of our predicted 13.
So that's obviously not a great measure.
And then we can look at some of the alternative model.
The alternative model prediction is very similar,
and we will see this in greater detail as I flip back to the PowerPoint and show you
kind of a condensed version of this data table,
because right now I know that this in JMP is a little bit overwhelming
because there's so many variables, so many random numbers
being thrown at you.
So I'm going to switch back
into the PowerPoint to show you some of the predictions
for both Josh Hart here and for Kevin Durant.
All right, so now that we have our two models
and the simple linear regression to compare it to,
we can apply these predictions to games that Kevin Durant has played this season.
So here are four games played in Atlanta, Charlotte, Chicago, and Cleveland.
In the first one, he scored 31 points.
In actuality, both our models predicted close to 26 points,
so the residual is just around five.
However, we see that when Kevin Durant is close to his average points
around 28 or 29 points per game,
when he actually scores that, the predictions are very close
because points per game weighs so heavily in these models.
So, for example, in this Cleveland game,
our first model predicted 26.2 points, he actually scored 27.
Therefore, the residual is less than one.
And similarly here, the residual's less than one
for the alternative model as well.
So we see that the model excels when players perform close to average,
which they will do most of the time. But there's obviously variation,
like in this game against Charlotte, he had a particularly good game.
He scored 38 points.
It's a great game, but particularly good game by Kevin Durant standards.
And so the model predictions are much farther off.
And we see the same thing with Josh Hart.
His points average's a little bit lower because he, again, carries a lighter load.
Kevin Durant is a star player on the Nets,
so he's getting a lot of touches and taking a good portion of those shots.
So here we have Hart's production.
Again, his points average is closer to these 12 and 14 range.
So the residuals here are very small.
When he puts up 14,
our models predicted around 13.5 and the residuals are around .5 .
So we have really good predictions for when he scores
an average amount of points or close to an average amount of points.
But when he scores less or much greater
then the predictions tend to suffer a little bit.
So with all this being said, I know I ran through these models.
Let's take a step back and look at the limitations in detail
and some further study that I could embrace.
So first of all, these are using averages by the offensive player and defensive team
as opposed to offensive player and individual player.
As I've mentioned, that would be the ideal scenario,
but as I was doing it, it didn't work out like that.
So that could be causing some problems because it's duplicating some results.
Instead of being sort of a one to one,
if a player is being guarded by this player,
this is how many points they will score.
If a player is being guarded by a combination of players on this team,
this is how many points they will score.
So it's easier to predict but maybe a little less accurate.
Additionally, the averages are for the entire season,
which means predictions toward the beginning may be less accurate,
as I mentioned, which is something that I think the career values
try to remedy a little bit.
It could be worth adding things like points per game
in the season variables with a lag of maybe five seasons.
So taking the average points per game for the last season, two seasons ago,
three seasons ago, and so on to try to add some more variables
and maybe predict a time or a trend in player performance.
Additionally, player performance is dependent on countless other factors
such as cold or hot streaks,
how well they've been performing as of late,
injuries on their own team, Injuries on the opposing team
that could increase or decrease their role,
minor injuries that they are dealing with themselves
that could decrease or increase their role.
And then things like rest days,
travel time, and a lot of other intangibles
like NBA players are human and we all know some days we're not feeling the best,
other days we're feeling great, more energetic.
Those types of things could lead to better performance or worse performance.
So these intangibles are
not going to be something that I can factor into the model
but it's important to recognize that they could still affect
the points in a given game.
In terms of further study,
I would love to kind of rectify all these limitations
and look to predict a more holistic variable
such as offensive rating.
So offensive rating is going to predict a player's points per 100 possessions
contributed to the game instead of just one aspect of the game in points.
I would love to predict something like this or flip it to the defensive side
and predict a defensive rating based on who they're likely
to go up against on offense.
So something like that I think would be really cool
and it would extend the application of this more toward coaches
and team analysts instead of maybe some of the fantasy basketball players
who are looking for a strict measurement like points or something like that.
So with all this being said,
I'm definitely going to continue working on this project.
It was a lot of fun and I love looking at these models
and interpreting from a basketball standpoint what's going on.
If you have any questions, please feel free to put those questions
in my community post in the comments. I'll be happy to answer them.
And if you have any suggestions as well for further study,
I would also be happy to take those on. Thank you so much your time.