Hello, everyone. My name is Kevin O'Donnell.

I'm a JMP Global Customer Reference Intern,

and today I'll be presenting on a personal analytics project

that I've done that is the NBA Player Matchup Model.

So just to get started,

I would like to go over my idea and the motivations for building the model.

I've always been incredibly interested in basketball.

I've been a huge basketball fan for my whole life,

and I was looking for a new analytics project to dive into,

so I ended up asking this question here:

Could we model a player's offensive success

based on their averages and the strength of their opponents as well?

So as most of us know from watching any sport,

player performance varies greatly from game to game.

It's based on a variety of factors

and one of these being a player's average points.

That will be a very strong predictor for points in an NBA game,

but it's far from the only influence on a player's offensive performance.

So going into this project,

I wanted to build a model that predicted the players points in the game

based on the points per game,

but also a lot of other variables which we will look into on the next slide.

So ideally, this will be helpful for coaches and team analysts to,

for example, determine which of their offensive players are likely

to over perform or able to perform really well

based on their match ups and vice versa, which of their players they might want

to avoid on offense because they're being guarded

by some of the best defensive players on the other team.

This could inform play calling and game planning,

so if a particular defender is weaker, coaches may look to target that match up

or look to force a switch onto their best scorer.

In the entertainment realm,

fantasy basketball players could use this data to figure out

who to start or who to pick up off the Waiver Wire,

so it has a broad range of applications.

So here's a little bit of the overview of the data.

I'll go through this quickly, and we'll see this as I go

into the JMP demo a little bit more in detail,

but we're going to predict points per game based on seven main categories of data.

So we're going to use player offensive averages for the season,

this includes things like points per game,

3.2 point percentages, and advanced metrics such as usage rate,

which records how often an offensive player is used on their team.

Player attributes such as height, wing span, vertical leap

will also be included for both the offensive and defensive players

because these physical attributes, I would assume, also contribute

to how many points a player will score.

Finally, in terms of the offensive players,

we'll use career averages.

The same averages that we're using for the season,

we'll also use for the career to add a little more robustness to the model,

and overall team pace and offensive rebounding percentage

is also going to be a predictor because that will determine

how many possessions there will be.

If there are more possessions in the game, it's more likely that any given player

will have more points.

And so, on that same token, defensive team pace

and defensive rebounding percentage will also be used.

In terms of individual defensive stats, things like steals, blocks, fouls,

and other advanced metrics such as defensive win shares, which measures

an individual player's contribution to the team's wins on defense.

Those are also going to be predictors to

probably negate the amount of points that are scored by an offensive player.

And then finally, the defender attributes will be used as I mentioned earlier.

So before we get right into the data and the modeling and JMP,

I would like to go over the matchup data that I was using

for the bulk of the model.

So the NBA recently implemented some new personal match up data collection

and it's based on detailed player tracking.

So it tracks the closest defender at every point,

not just the primary defender on a play.

It only tracks front court time and it will track partial possessions.

So what this means is that a player could be guarded by

as many as five different players on a single play,

and each defender would be awarded the respective amount

of matchup minutes for that possession.

So here we have an example of Terry Rozier for the Hornets

being guarded by four different players on different teams.

In the first row, Steph Curry is guarding him

in only one game for 2.7 minutes and allowed two points.

So if we think of a hypothetical where the Hornets are playing

Steph Curry's Golden State Warriors and Curry guarded Rozier for 10 seconds,

and then Klay Thompson switched on to Rozier for another 10 seconds,

both would be awarded those 10 seconds of matchup time.

However, if Rozier scored a two pointer at the end of this possession,

the points would be marked for Klay Thompson.

This is really cool tracking data and I love how specific it is,

but it did cause some problems when I tried to model the points per minute

for each player and defender combination, which was my original plan.

So I was originally going to use the offensive player stats

and defensive player stats that I mentioned in the previous slide

in every individual matchup.

But since many defenders, like you'll see here,

logged very small amounts of time,

like Stanley Johnson only guarded Rozier for an average of under one minute a game,

that's not enough time

for some of these points per minute measurements to be normal.

So going back to that Klay Thompson hypothetical,

if Klay only guarded Rozier for 10 seconds the whole game

and Rozier having to score two points in that possession,

then Rozier's points per minute against Klay in that row would be 12.

So obviously, extrapolating that to a full game,

even excluding the back court time as this data does, it's very unrealistic.

So I had to go with a slightly different, less ideal approach.

Instead of using the individual defensive stats,

I actually averaged them out

for each combination of player and defensive team.

So instead of using Rozier versus Curry and Rozier versus Thompson,

I would use Rozier versus the entire Golden State Warriors average based on

the amount of matchup minutes that each player defended Rozier.

So for example,

if Steph Curry and Klay Thompson both guarded him for half the amount

of possible match up minutes, then the team average,

let's say steals per game, would just be the arithmetic mean

of Curry steals and Klay steals per game.

The same would go for every defensive variable

and that's obviously a simple example,

but the same goes for every player versus every team.

So obviously this is not ideal because it minimizes

the individuality of the matchup.

But I had to abandon that points per minute approach

because the samples were too small and the response was heavily distorted.

So this model that I've created with the aggregated data is more accurate

than my initial attempts, even though it sacrifices some individuality.

So at this point, I'm going to switch into JMP here

and do a little bit of the demo on how I built the model

and some exploratory data analysis to begin.

So to begin, we're just going to look at some marginal relationships

with points in the graph builder,

and I'm able to choose any I want.

But for now we're going to look specifically at three variables,

the first being points per game.

So we see here, we have a moderate relationship

with points and points per game, and it is positive as we would expect.

The average points that someone puts up in a game,

or over the course of the season rather, is going to obviously influence

how many points they score in a game.

Similarly, something like usage rate,

the advanced statistics that I was discussing earlier,

also has a positive relationship,

just slightly weaker than the relationship between points and points per game.

Finally, if we look at some of the career stats,

if we look at career points per game, then again we see a positive relationship,

but that is a little bit weaker because it's averaged over the course

of a player's career rather than the season that we're currently in.

But hopefully this could help adjust for some major differences

in the points per game totals or averages rather.

So now that we have a sort of an idea of the data itself,

we can move into the simple linear regression

with points by points per game as a benchmark for this model.

So as I mentioned,

you can predict based on just points per game

and it will give you a decent prediction.

But we're looking to improve that prediction by adding

some of these different variables, offensive and defensive.

So if we run this script here,

this is just comparing some runs of a simple linear regression

on the training and validation set using KFolds validation throughout.

And if I run here using the Hidden Validation 2 column

that I'll be using for the remainder of these models,

then we can see the regression here.

So here's a regression plot. It looks pretty scattered.

There's not a clear linear relationship, but the RS quare is moderate,

meaning that 46.5 percent of the variation in points is accounted for

by this average points per game, which is pretty good.

And the Root Mean Square Error is about 5. 27.

The Root Mean Square Error is the standard deviation of the residual,

so it's essentially the variation in how well our model is predicting.

Its standard deviation is around five.

And we can also look at some other measures,

such as the AIC, which is used to compare models for predictability.

So AIC is another important measure,

and it measures how well the model will predict

relative to the number of predictors put in the model,

just to make sure you're not adding way too many.

So in this case, the AIC is very high, and we can come back to this number

as we look at the comparison between this and the multiple linear regression

that I will show later.

So while I was trying to pick the model that I was going to choose,

I decided to use the model screening feature in JMP,

which allows you to select your response variables and all the factor variables.

I ended up putting in all of the numeric variables that I had for a full model

just to see initially which models perform the best.

And I was able to choose from a variety of these different methods,

including XGB oost and Generalized R egression,

and then of course, just the normal Lease Square.

In the interest of time, this would take forever to run

because it's using some of these machine learning algorithms.

So I'm instead just going to pull up a quick screenshot of the model screening.

So here we have the output from this model screening.

It shows the RS quare.

Again, I was using KFold validation with ten folds here.

So this shows the RS quare

and the Root M ean Square Error for each run.

So we can see that the Least Square Fit actually had

a very strong fit compared to some of the machine learning algorithms,

which I was surprised by. But it actually helps for interpretability

because some of these machine learning algorithms,

they're more of a black box. And if I were to use those,

it wouldn't be as easy to interpret the coefficients or see

which variables are truly significant

or which are just being used for prediction.

So because the fit of Least Square and the Lasso Regression,

which I'll get into a little bit later, were so high,

it's actually a good sign that I'm able to use that for better interpretability.

So as we can see, the most accurate model is this multiple linear regression

with Lasso Regularization.

Lasso Regularization is a statistical technique that regularizes the model

and selects features to minimize multicollinearity.

Multicollinearity is the correlation between predictors

and that can negatively affect the model.

So using this technique,

we're able to take the full linear regression model shown here

and remove some of the variables to minimize the multicollinearity

and maybe satisfy some of the linear assumptions better.

So with that information,

I was able to take the variables selected with Lasso Regularization

and then create a multiple linear regression using those variables.

So as you can see here,

the Actual by Predicted Plot looks pretty similar.

It might be slightly closer to the line of best fit here, which is a good sign.

And we can see our effect summaries,

but we're going to scroll a little bit past that

so that we get to the summary of our fit.

So compared to the simple linear regression,

the RS quare and the Adjusted RS quare are very similar and that's probably

because points per game is such a heavily influential factor in this,

as well as the simple linear regression, obviously.

So the fit is not too much different.

However, adding these different variables shown here does improve

the Root M ean Square Error a little bit.

It went from about 5.3 to little over 5.2, which is not a drastic improvement,

but an improvement nonetheless.

But the real difference here is in the AIC and comparing these models in terms

of their predictability, the AIC dropped significantly

from the simple linear regression to this multiple linear regression

with Lasso R egularization, which is definitely a good sign.

We see some of these Parameter Estimates down here and of course the results

for RS quare and Root Mean Square Error in cross validation.

So looking at some of these Parameter E stimates,

the points per game, again very significant,

it's significant at the .05 a lpha level and many below that.

And this is to be expected, as we've explained already.

Turnover percentage is also significant

and this has a negative relationship with points.

As one would expect, if you're turning the ball over more,

that's less opportunities to shoot the ball,

less opportunities to score.

So that checks out with just our knowledge of basketball.

Here we have the defensive pace, or the pace of the defensive team,

and this has a slightly positive relationship with points conditional

on the other factors in the model.

Again, that's to be expected because the faster a team plays,

that's generally more chances.

However, it might not have such a strong effect as turnovers or points per game.

And then finally, defensive rebounding percentage.

This is just how often the defensive team hauls the defensive rebound.

Again, this is preventing second chance points for the offense,

so it should have a negative relation ship, which it does.

So all these things check out,

and then some of these other variables are insignificant conditionally,

but included because they improve the predictability of the model.

So something like usage rate might not be significant at the .05 level,

but it nonetheless improves our predictions.

Now, knowing that points per game might not be available

at the beginning of the season, to the extent that it is

near the end of the season,

meaning that points per game might be a little less reliable

with a smaller sample size.

I'd look to create an alternate model without that

to see if I could still predict

better than the simple linear regression and similar to the Lasso regression,

but without dependency on a points per game for the season measure.

So this alternate model, I created using backward selection by AIC.

And again I left out the season points per game,

so it still includes the other points per game measurements,

or other per game measurements rather,

like field goals attempted from three and from two,

which along with some of these other variables can be a proxy for

that average points per game.

So it' s not completely robust against early season fluctuations

and small sample sizes.

But it's possible that these variables might be a little bit more representative

of how a player is going to perform in the long run.

I'm thinking that maybe

players might be a little more consistent with their attempted stats,

or the rate at which they're shooting the ball,

rather than just the amount of points they get that could vary based on

just a small sample size.

These could as well but...

Just using this model as an alternative,

and it turns out that it actually predicts

pretty similarly to the model with points per game in it.

So it might not be favored necessarily if points per game is available.

Inappropriate but it provides a similar prediction.

So we can see that the RS quare is pretty similar,

may have increased a little bit,

and the Root Mean Square Error is again similar.

The AIC suggests that the other model is a better model for predicting

based on the amount of variables included, so that is definitely something to note.

But each model has its advantages.

This one, particularly,

we can see some of the conditional relationships of the other variables,

particularly those on the defensive side, that we couldn't see as much

in the other one because points per game was dominating so much.

So we see here that two pointers and three pointers attempted

are weighing insignificantly for points, which again,

that makes sense because the more shots you're taking,

the more likely you are to score more points.

And then there are some other variables that make sense,

and some others that maybe are a little confusing at first sight.

So offensive win shares,

that being a significant variable makes sense that's measuring

how much a player is contributing to wins on the offensive side,

it makes sense that that has a positive relationship

as we can see down here.

So offensive win shares right here has a positive relationship.

And also the defensive rebounding percentage down here

decreases the estimation again, so that is pretty consistent

with what we've seen in the previous model.

Average fouls per game, average personal fouls per game down here,

has a negative effect which I thought was interesting.

So a player or team with a higher foul total

is going to negatively affect the points scored for the offensive player.

This might mean they're more aggressive.

I would assume that they are playing more intense defense in limiting points

through steals, blocks, or heavily contested shots,

and as a result, they're getting more fouls called on them.

However, this is not all good for the defense

because since the player can foul out with six fouls,

there is a certain balance to strike in the defensive end.

You don't want to be too aggressive

because then you could be giving up easier points

or you could be leading your players into foul trouble.

So that's an interesting variable, I think, to consider.

And of course, this is a conditional significance,

so it might change slightly based on the removal

or addition of certain other variables.

And finally, a couple of defensive relationships

are particularly confusing at first glance, I think.

Specifically involving blocks and defender height.

So we think of basketball as being very dependent on height.

If you are taller, you're more likely to go to the NBA.

Seven footers, I think you have a 20 percent chance

of just going to the NBA even if you are seven feet.

So this is something that we think, if you block shots more and you're taller,

you're going to be affecting the offense's points negatively.

However, we see that these relationships are actually conditionally positive,

which is very interesting that average blocks per game

and the defender height, as well as the blocking rate,

are all positive relationships.

And so initially, I was confused by this, but I think this has more to say about

the players that these players are guarding rather than

the actual variables themselves.

So what I mean by this is when you consider that taller players

with better blocking stats are big men playing power forward or center,

they're guarding other big men.

And so that makes sense a little bit more.

Guards tend to put up more points in the NBA with the emphasis

on three- point shooting now.

And a lot of offenses are run through some of these smaller players

who tend to be guarded by other smaller players.

Whereas these taller players, with more blocks,

are guarding big men who maybe aren't the focal point of the offense,

aside from certain players like Jokić and Embiid and Giannis.

But this causes a positive relationship,

but it's really more a function of the position

that these players are playing.

So I thought that was an interesting

conditional relationship to highlight within the model,

and it involves a little bit more deeper thinking about the relationship between

blocking, height, and the points that an offensive player puts up.

So in terms of the model overall,

we can see that this one and the Lasso regularized model

are very similar in their predictions.

We'll see that in more detail soon when I flip to the 2021 predictions,

but the choice then might not be too significant.

Both have their advantages.

Of course, this one allows us to see the significance of more of the variables,

specifically the defensive variables,

whereas the first one has a slightly lower AIC

and might be better for predictions if the data is available.

So both leave a little bit to be desired

in terms of predicting much more reliably than the simple linear regression.

I would have liked to see this Root Mean S quare Error decrease more,

and it's something I would look into as I continue this project,

looking to gather better data, trying to make the matchups

more individualized without sacrificing the normality of the response variable,

things like that.

But with that being said, these are the models that I have now,

and we can look to test these on the 2021 season thus far.

So if I switch over here to the matchups for 2021,

we have the same data table just with this year's matchups.

And so I'm just on Josh Hart's matchups right now

because I go to Villanova and he's a Villanova great,

so here we have him as a player, his team, and the defensive teams,

his stats, their stats.

As we've already seen, these are all the variables

that could be included in the model.

Apologize for the quick panning,

but here we have the predictions at the end.

So these prediction variables, or rather these columns,

are the model predictions.

So our first one, our first model, is the Optimal Multiple L inear Regression

that's using the Lasso regression and including the points per game variable.

As you can see in this game,

he is predicted to put up 12.8 points and the residual here is about five.

So in reality this residual is the points minus the prediction.

So he actually put up, this is around 13 and this is around five,

so he put up about 18 points here instead of our predicted 13.

So that's obviously not a great measure.

And then we can look at some of the alternative model.

The alternative model prediction is very similar,

and we will see this in greater detail as I flip back to the PowerPoint and show you

kind of a condensed version of this data table,

because right now I know that this in JMP is a little bit overwhelming

because there's so many variables, so many random numbers

being thrown at you.

So I'm going to switch back

into the PowerPoint to show you some of the predictions

for both Josh Hart here and for Kevin Durant.

All right, so now that we have our two models

and the simple linear regression to compare it to,

we can apply these predictions to games that Kevin Durant has played this season.

So here are four games played in Atlanta, Charlotte, Chicago, and Cleveland.

In the first one, he scored 31 points.

In actuality, both our models predicted close to 26 points,

so the residual is just around five.

However, we see that when Kevin Durant is close to his average points

around 28 or 29 points per game,

when he actually scores that, the predictions are very close

because points per game weighs so heavily in these models.

So, for example, in this Cleveland game,

our first model predicted 26.2 points, he actually scored 27.

Therefore, the residual is less than one.

And similarly here, the residual's less than one

for the alternative model as well.

So we see that the model excels when players perform close to average,

which they will do most of the time. But there's obviously variation,

like in this game against Charlotte, he had a particularly good game.

He scored 38 points.

It's a great game, but particularly good game by Kevin Durant standards.

And so the model predictions are much farther off.

And we see the same thing with Josh Hart.

His points average's a little bit lower because he, again, carries a lighter load.

Kevin Durant is a star player on the Nets,

so he's getting a lot of touches and taking a good portion of those shots.

So here we have Hart's production.

Again, his points average is closer to these 12 and 14 range.

So the residuals here are very small.

When he puts up 14,

our models predicted around 13.5 and the residuals are around .5 .

So we have really good predictions for when he scores

an average amount of points or close to an average amount of points.

But when he scores less or much greater

then the predictions tend to suffer a little bit.

So with all this being said, I know I ran through these models.

Let's take a step back and look at the limitations in detail

and some further study that I could embrace.

So first of all, these are using averages by the offensive player and defensive team

as opposed to offensive player and individual player.

As I've mentioned, that would be the ideal scenario,

but as I was doing it, it didn't work out like that.

So that could be causing some problems because it's duplicating some results.

Instead of being sort of a one to one,

if a player is being guarded by this player,

this is how many points they will score.

If a player is being guarded by a combination of players on this team,

this is how many points they will score.

So it's easier to predict but maybe a little less accurate.

Additionally, the averages are for the entire season,

which means predictions toward the beginning may be less accurate,

as I mentioned, which is something that I think the career values

try to remedy a little bit.

It could be worth adding things like points per game

in the season variables with a lag of maybe five seasons.

So taking the average points per game for the last season, two seasons ago,

three seasons ago, and so on to try to add some more variables

and maybe predict a time or a trend in player performance.

Additionally, player performance is dependent on countless other factors

such as cold or hot streaks,

how well they've been performing as of late,

injuries on their own team, Injuries on the opposing team

that could increase or decrease their role,

minor injuries that they are dealing with themselves

that could decrease or increase their role.

And then things like rest days,

travel time, and a lot of other intangibles

like NBA players are human and we all know some days we're not feeling the best,

other days we're feeling great, more energetic.

Those types of things could lead to better performance or worse performance.

So these intangibles are

not going to be something that I can factor into the model

but it's important to recognize that they could still affect

the points in a given game.

In terms of further study,

I would love to kind of rectify all these limitations

and look to predict a more holistic variable

such as offensive rating.

So offensive rating is going to predict a player's points per 100 possessions

contributed to the game instead of just one aspect of the game in points.

I would love to predict something like this or flip it to the defensive side

and predict a defensive rating based on who they're likely

to go up against on offense.

So something like that I think would be really cool

and it would extend the application of this more toward coaches

and team analysts instead of maybe some of the fantasy basketball players

who are looking for a strict measurement like points or something like that.

So with all this being said,

I'm definitely going to continue working on this project.

It was a lot of fun and I love looking at these models

and interpreting from a basketball standpoint what's going on.

If you have any questions, please feel free to put those questions

in my community post in the comments. I'll be happy to answer them.

And if you have any suggestions as well for further study,

I would also be happy to take those on. Thank you so much your time.