My name is Jianfeng Ding.
I'm a research statistician developer at JMP IND.
Today, I'm going to show you how I use JMP to explore the tennis data
and find out who is a goat, the greatest of all time of men's tennis.
First,
I would like to give you some background information
why I choose this topic.
When I heard that the JMP Discovery Summit 2023 will be held in Indian Wells,
I got excited as tennis is one of my favorite sports
and my youngest son plays varsity tennis at his high school.
I have watched a lot of tennis over the years.
Indian Wells is a home to Indian Wells Master,
which is often called the Fifth grand slam.
I thought it would be fun to use JMP to explore and analyze the tennis data
and present the results to our user at Indian Well Discovery.
The second motivation come from JMP.
JMP has grown bigger and richer in many ways.
There are so many wonderful features created by my colleagues.
I would like to keep myself updated with these new cool features
by applying them to the project.
Currently, you are seeing one of them, Application Builder.
Instead of using PowerPoint,
I'm using JMP Application Builder for today's presentation.
My presentation mainly include two parts.
I will take you on a tool to explore the ATP data from the year of 2022.
ATP stands for Association of Tennis Professional,
which is the world governor body for men's tennis.
Then we will look at a 25-year combined ATP data to find out who is the GOAT.
First, let's see where do I get the data?
I get the data from the web and GitHub,
which was created and maintained by Jeff Secman.
He is a software developer working in the field of sports statistics.
On this web, it contained the ATP match data
from year of 1968 to the current year.
We can also get women's tennis data from this web as well.
What data looks like?
Here is the data from the year of 2022.
It consists of about 49 variables with about 3,000 observations.
Each observation represent matches play on ATP tours.
The yellow section contains a variable about the tournaments
and the blue section contains a variable about the players.
Each observation is a match,
so usually the variable comes with two, one for the winner and one for the loser.
Let's look at all those variables about the tournament first.
I build the graph builder on tournament's name
and a tournament's surface and a tournament's level.
From the tournament name,
the country with more player were sitting on the top.
Sorry, the tournament with more player would sit on the top.
Grand Slain, Australia, Roland Gallos , US Open, and Wimbledon
are the largest and most prestigious tournaments.
In last year, there are about 145 tournaments.
We also can see there are typical five surface for the tournaments.
They are clay, grass, and hard
and usually they are more hard surface tournament than the grass and clay.
A lso there are five levels of these tournaments.
The definition is defined here.
A, D, F, G, N.
G stands for the grand slam, and N stands for the Masters.
Indian Well Master is a master-level tournament.
D stands for Davis Cup, and A is the ATP Tour.
Next, let's look at the variable about the players.
I run the graph builder again.
The plot on the left actually show me which country has the most player.
On the right, it shows those players' hands.
Do they use the right hand or they're using left hand?
You will see the player most are right handed.
I also would like to find out which country has more top ranked player.
I created this, the winners rank and I can slide.
The country with more top-ranked player will pop up.
I'm interested to see what about top 100 and US sitting on the top.
That means US has more top ranked player than the other country.
Then what about the top 10?
Look, you can either slide or you also can type in the number.
From this, Spain popped up at the top
and I hover over, I saw Carlos and I also saw the Nadal.
As I click the US and I see the player, Taylor Fritz, who ranked number nine.
You also can see from the hand side
and Nadal within this top 10 player, Nadar is left handed.
He's one of left handed in this top 10 player.
Now let's move on to check the players' age, height, and ranking.
The tournaments, the range can be ranged for the last year,
they actually can range from 17-42.
In this graph, I only listed the top 10 with their average ranking.
From this I find, their average height is around 6'2,
which is very common for males tennis player.
I also find Raphael Nadal and Novak Dj okovic
are the oldest in this list.
Now, let's look at the winning statistics
because I would like to see who win the most matches in 2022.
I find out Tsitsipas list as the number one.
Then something is missing.
Where is Rafael Nadal, and Djokovic?
I couldn't find them in this top 10 list who win the most matches.
This remind me maybe I should look at their winning ratio
instead of just number of matches they won.
I did some summary statistics and I find out their winning ratio.
Yes, you immediately see,
Novak Djokovic, Rafael Nadal , and Carlos Akras,
they have a pretty high,
they are the top three player who has the highest winning ratio.
Although their number of winning for the matches is not as high as Tsitsipas.
I also noticed there are two players
who has pretty decent, pretty good winning ratio,
but they don't play many matches.
They only won three matches.
Who are they and what type of tournament do they play?
I drilled down into the data
and I find out one player's name is Kovacevic
and all his three matches coming from tournament A level
and the player,
Safwa his all three matches coming from Davis Cup.
From this graph, you definitely know
the tournament level will affect the winning.
Ultimately, you care about who won the most championship or tournament wins.
This graph put all three relative statistics in one plot.
The down you will see how many matches they win
and the second,
the green bar means what are their winning match-win ratio?
The top will show you
how many total championship they won in 2022.
I see, Djokovic, Carlos Alcaraz and Rafael Nadal.
I also see one guy
who I'm not familiar with, and his name, hard to say, but let me call him FAA.
FAA doesn't have amazing winning ratio, but he did won five titles.
Again, I drill down to the data and find out
all FAA's winning title coming from A-level tournaments.
You look at Djokovic or Alcaraz and Nadal,
they are championship not only from A-level tournaments
and also from grand slam and a Master level.
Again, we show tournament level effect winning.
Let's look at the seed.
What does seed play in the players' winning?
I have to point out the players' seeds actually will vary over the years.
But in general, the higher seeded players
tend to win more matches and more tournaments.
Grand slam winner usually are highest seeded players.
But in 2022, only two people are exception.
One is Carlos Alcaraz and the other is Taylor Swift.
Sorry, it's about Taylor Fritz.
You can see here, Carlos,
he succeed, start low, but he won the Miami Masters.
This helped him move to the top.
In the end, year of 2022,
he was ranked as the number three seed,
and he was able to win the US Championship.
Taylor Fritz, he actually won the Championship of Indian Well, Master 2022.
We can see the seeds definitely affect the winning.
Now, let's look at the comparison between the winner and the loser.
In this ATP data, there is a section list
about to serve statistics and come with a winner and a loser.
There are seven variables related to the serve statistics.
I'm interested in this first one. What it is?
The first one means number of points won on first serve.
I click and build a plot.
Instead of I plot all those absolutely the number of the point
I use the ratio
because the point will depend on how long you played your matches.
With the ratio would make more sense.
The blue colored represent the first serve percentage won
coming from the winner and the pink is coming from the losers.
Actually, majority of the first serve percentage won between 60% and 90%.
But the blue color shaded more to the right,
indicating winner have higher first serve percentage won .
Next I would like to be interested to see the variable is BPs saved and BP faced.
BP faced means a breaker point faced.
For if you serve and you face the breaker point,
that means you give your opponent opportunity to break you.
You better not t o face the breakpoint.
Instead of plotting separately, my son suggested me to convert them to be
breakpoint converted, which is a variable defined as
the difference between the B P faced and BP saved.
Then again, we can see the blue color shaded more towards the left,
indicating winner face less breakpoint and save more breakpoints.
The pink one indicates that loser
tend to face more breakpoint and save less breakpoints.
With all these statistics and variable I have shown you,
but ultimately I would like to know, can I build a model?
Can I predict who is going to win and how many they can win?
I build a summary table and as I shown you,
all these ATP data come with matches.
A player can have many matches
so I just use a tabulate to do the summary statistics.
I got the tournament wins for each player
and I got the average their winning match ratio
and their height and their average, their seed.
I wanted to find the correlation between the variable to the tournament wins.
Clearly you can see
the match winning ratio is highly correlated with tournament wins
and so is winner's seeds.
Also I defined one variable I call the div rank,
which I know when you face a weak player, opponent or strong opponent,
your winning odds could be differently.
I do the subtraction, I introduce this variable into the model.
You also notice the height,
there is the correlation between the variable.
I just happen to notice when you're higher or you're taller
and you tend to have a better ACE rate
and you have better, like the first one, serve one.
Definitely the taller player has advantage at serving.
I bring all these model into the fit model platform.
I first run a Least Square model
and I get the conclusion that the winning ratio and the winner's seed
are definitely affect how many tournament you can win.
I also think, oh, this is a count of data. How many tournaments you will win.
Maybe I should use [inaudible 00:19:36] distribution
and I run and I actually also get the similar conclusion
that winning ratio and winner seed is very important variable.
But I have to point out, although I show you early about
the tournament level plays a very important role on the winning,
but because the data, the format itself made me hard to put it into the model.
I need a lot of data manipulation.
Plus, I feel like instead of just looking at the one year's ATP data,
maybe I should look at more
in order to build a complete or good predictor model.
I will keep this in mind for my future research.
With all these statistics and a variable, I show you so far.
That's back to the topic, who is the GOAT ?
I actually created a script
and I wanted to get the data in the past 25 years
as Federer started early.
I wanted to include all the matches, all of them have played.
I would like to find out
who won the grand slam title and who won the Indian Wells.
This script actually is able to go to the Jeff Sexel web
and fetch the data and do the analysis and generate the report.
You can see 2023,
Alcaraz won both Indian Wells and Wimbledon
and Novak won Australia Open and Roland Gallos.
As the list moved down,
you pretty much see their name, Djokovic Nadal and Federe r, so on.
It's almost for the last 20 years, these three are dominant.
As I keep moving to the bottom, finally, I see Andre Agassi and Pete Sampras,
who are my favorite player in '90s.
Also you see these three guys,
Djokovic, Federer, and Nadal, they sit on the top.
This include a grand slam title and Indian Wells title.
I truly believe these three guys, they move the modern tennis to high level.
Now, let's look at again,
look at the match wins, winning ratio, tournament, and Grand Slang title.
I would like to see the more detail.
The green bar here, the bar itself represent their match winning ratio.
But I like Graph Builder's feature.
It allowed me to put their number of winning matches on the top.
Then you can see,
although their winning ratio is very close,
they all like above 80.
But Roger Federer won the most matches over 1,263.
You move to the top and you will see those green bars
means how many tournament championship each of them have won.
Again, Federer won the most.
Then you look on the blue top,
you will see that Djokovic won the most, 23 grand slam titles.
Next, I want to check on their ranking.
These four lines not only show their ranking over the years
but also show their incredible professional tennis career.
Federer started early in 2001.
It took him about three years to move to the top,
but he stayed at the top for a long time, 18 years.
You look, Nadal and Djokovic,
they move very quickly to the top
and also they stay at the top for a long time.
The dip here usually either means they had injury or had a surgery to recover.
I know Nadal is right now in the recovery period
because he just had a surgery and Djokovic continue to play.
I truly believe that those two lines will continue to grow for a while.
For Alcaraz, he just started.
We will see if he will follow the same trajectory as the big three.
I would like to show you more detail about the individual grand slam matches.
Look at this plot on the left.
This show in the past 25 years,
how many grand slam matches Federer has played.
Total 434 grand slam matches.
He won 373 matches and he lost 61 matches.
That bring him to the winning ratio is 86 %.
It's amazing.
The right-hand plot, actually a plot, his opponents ranking.
I want to show it's difficult.
Usually when your opponent has a high ranking,
that means tough to win the match.
The red dot here all represent the winning matches
and the blue dot here represent the losing matches,
and the square indicate the final matches.
These are all grand slam matches.
You look, most of the Federers' opponent is all high rank player
and only the few, I guess he was lucky.
He was able to play the opponent with low rank.
We also can look like how his performance in each grand slam
as I click Wimbledon, you will see, Federer won a lot in Wimbledon.
Then let me click the one for the Roland-Gallos
and in Federers' entire career and he only won once in the Roland-Gallos.
That was the year 2009.
The other day, he pretty much lost to Nadal.
Let's see what happened in 2009.
I bring Nadal's record and I particularly look at Roland Gallos.
You pretty much see all the red square.
That means he's the championship of the Roland Gallos.
He only lost four matches, included this one in 2009,
in the semifinal, he lost.
That was the year, actually, Federer was able to win the championship.
I will skip, Novak and Carlos, and I will bring you the overview
of all these four guys' performance in all the four grand slams.
If I look at each one for the Australian,
you pretty much see Novak Djokovic is dominate.
Then if you look at the Roland-Gallos, Nadal is dominate.
For the US Open, they all have won the US Open.
I guess US Open provide opportunity for all of them.
If you look at the Wimbledon, I think Federer and both Djokovic,
they both did pretty well in Wimbledon, but Federer still win more than Djokovic.
I wanted to finally look at their gra nd slam winning ratio.
From this plot, it shows me, yes, Djokovic won the most grand slam title.
Also you look at the winning ratio,
overall, Djokovic has highest or similar like the Rafael Nadal.
Almost in every category,
you can see Djokovic has higher winning ratio,
except for the Clay, the Roland-Gallos, Nadal, is the best.
I would say just based on winning most grand slam title
and highest match ratio, Djokovic is the goat.
Next, we would like to find out
who is the youngest among four of them winning the grand slam title?
That was Nadal.
I think he was only 18.9, he won his first g rand slam title.
Alcaraz at age 19.3 won his US Open.
Although, Djokovic and Federer won their first title in their 20s.
But you look at their long, amazing career, even at age 36,
both of them still were able to win the grand slam title.
I think that Djokovic will continue to win.
I think he will have more title under his belt.
I also look at, they definitely played with each other.
I wanted to see their net win with each other.
Rafael Nadal, if you look at Rafael Nadal against Roger Federer,
so Rafael won 24 and then Roger won against Rafael is 17.
That bring their net...
Rafael has seven net wins against Roger.
Novak Dj okovic has five net wins over Federer and one net win over Nadal.
Even based on net wins, I think Djokovic is a goat.
I still would like to see their serve statistics because from that ATP data,
this is the data more related to their techniques.
I put all these variables into the one way and utilize the fit group.
With such, you can see there's a lot of the data, the sample size is bigger.
With all the data together, it seems that
Djokovic has a better serve statistic than the rest of them.
But I realize this is big sample size.
Sometimes the large sample size can transform a small difference,
become a statistically significant difference.
I would rather to see the subset.
I look at like a small sample size and I look at Wimbledon.
Yeah, and in Wimbledon, I still can draw the conclusion that
Federer is a little bit better than the rest of them.
But once I look at the other grand slam,
like the Australian Open, and I cannot draw the same conclusions.
Overall, I think their technique is very, very similar.
The successful rate for serving, they have very similar statistics.
With all the statistical variable,
I show you according to statistics of winning most grand slam title
and the highest match winning ratio, Djokovic is the GOAT.
However, statistics don't paint the entire picture
as a player can have a much larger impact than just statistics,
such as the way they play the game, the love for the game,
and especially who this player inspire.
Such as the young kids,
who aspire to be just like their idols, including my son, whose dream is to play
Eastonball, a prestigious tournament for youth at Indian Wells.
In the end,
it was just an honor and a privilege to watch these three great player
to play the game, play the tennis all at the same time,
and the future looks bright for more great tennis to watch.
As other player such as Carlos Alcaraz,
and others look to follow in the Big Three's footsteps.
I had so much fun doing this project
by using features such as graph builder, dashboard and application builder in JMP.
This feature allowed me to easily explore big data set
and quickly identify the atypical observation.
Dashboard not only can put a different analysis in one report,
but also allowed me to stay in the report and rerun analysis after the modification.
Application builder allows me to present to the project
without having to use PowerPoint.
Although this project mainly analyze ATP men's tennis data,
the analytical tools and the flow can be easily applied to women's tennis data
as well as any data set that have patterns in other fields.
If you have any questions, please feel free to contact me.
Thank you.