Choose Language Hide Translation Bar

Who is the GOAT of Men’s Tennis? Tennis Data Analysis and Visualization Using JMP - (2023-US-PO-1510)

Tennis is one of my favorite sports. The 'big three' of Federer, Nadal and Djokovic are my favorites. They elevated modern tennis to new heights in their rivalry. But it looks like Alcaraz is in line to take the throne. It is fun to compare their records and present the results at JMP discovery summit in Indian Wells which is the home to Indian Wells Open.

 

I started with Association of Tennis Professionals (ATP) data from 2022 for my project. Next, I combined over 25 years of data that include all match records of Federer, Nadal, Djokovic and Alcaraz. Using JMP’s powerful analytical and visualization tools, this report provides insight into the questions of who won the most matches of a year? What are the factors that influence players winning matches? Most importantly, who is the GOAT?

 

 

Hi,  everyone.

 

My  name  is  Jianfeng  Ding.

I'm  a  research  statistician developer  at  JMP  IND.

Today,  I'm  going  to  show  you  how  I  use  JMP to  explore  the  tennis  data

and  find  out  who  is  a  goat,   the  greatest  of  all  time  of  men's  tennis.

First,

I  would  like  to  give  you   some  background  information

why  I  choose  this  topic.

When  I  heard  that  the  JMP  Discovery  Summit 2023  will  be  held  in  Indian  Wells,

I  got  excited as  tennis  is  one  of  my  favorite  sports

and  my  youngest  son   plays  varsity  tennis  at  his  high  school.

I  have  watched   a  lot  of  tennis  over  the  years.

Indian  Wells  is  a  home to  Indian  Wells  Master,

which  is  often  called  the  Fifth   grand slam.

I  thought  it  would  be  fun  to  use  JMP to  explore  and  analyze  the  tennis  data

and  present  the  results  to  our  user at  Indian  Well  Discovery.

The  second  motivation  come  from  JMP.

JMP  has  grown  bigger and  richer  in  many  ways.

There  are  so  many  wonderful features  created  by  my  colleagues.

I  would  like  to  keep  myself  updated  with  these  new  cool  features

by  applying  them  to  the  project.

Currently,  you  are  seeing  one  of  them, Application  Builder.

Instead  of  using  PowerPoint,

I'm  using  JMP  Application  Builder for  today's  presentation.

My  presentation  mainly  include  two  parts.

I  will  take  you  on  a  tool  to  explore the  ATP  data  from  the  year  of  2022.

ATP stands  for   Association  of  Tennis  Professional,

which  is  the  world  governor  body  for  men's  tennis.

Then  we  will  look  at  a  25-year  combined ATP  data  to  find  out  who  is  the  GOAT.

First,  let's  see  where  do  I  get  the  data?

I  get  the  data  from  the  web  and  GitHub,

which  was  created  and  maintained   by  Jeff  Secman.

He  is  a  software  developer   working  in  the  field  of  sports  statistics.

On  this  web,  it  contained  the  ATP  match  data

from  year  of  1968  to  the  current  year.

We  can  also  get  women's  tennis data  from  this  web  as  well.

What  data  looks  like?

Here  is  the  data  from  the  year  of  2022.

It  consists  of  about  49  variables with  about  3,000  observations.

Each  observation  represent matches  play  on  ATP  tours.

The  yellow  section  contains  a  variable about  the  tournaments

and  the  blue  section  contains  a  variable  about  the  players.

Each  observation  is  a  match,

so  usually  the  variable  comes  with  two, one  for  the  winner  and  one  for  the  loser.

Let's  look  at  all  those  variables about  the  tournament  first.

I  build  the  graph  builder   on  tournament's  name

and  a  tournament's  surface and  a  tournament's  level.

From  the  tournament  name,

the  country  with  more  player were  sitting  on  the  top.

Sorry,  the  tournament  with  more  player  would  sit  on  the  top.

Grand  Slain,  Australia,  Roland Gallos , US  Open,  and  Wimbledon

are  the  largest and  most  prestigious  tournaments.

In  last  year,  there  are  about 145  tournaments.

We  also  can  see  there  are  typical five  surface  for  the  tournaments.

They  are  clay,  grass,  and  hard

and  usually  they  are  more  hard  surface  tournament  than  the  grass  and  clay.

A lso  there  are  five  levels   of  these  tournaments.

The  definition  is  defined  here.

A,  D,  F,  G,  N.

G  stands  for  the   grand slam, and  N  stands  for  the  Masters.

Indian  Well  Master is  a  master-level  tournament.

D  stands  for   Davis  Cup,  and  A  is  the  ATP  Tour.

Next,  let's  look  at  the  variable about  the  players.

I  run  the  graph  builder  again.

The  plot  on  the  left  actually  show  me which  country  has  the  most  player.

On  the  right, it  shows  those  players'  hands.

Do  they  use  the  right  hand or  they're  using  left  hand?

You  will  see  the  player most  are  right  handed.

I  also  would  like  to  find  out which  country  has  more  top  ranked  player.

I  created  this, the  winners  rank  and  I  can  slide.

The  country  with  more top-ranked  player  will  pop  up.

I'm  interested  to  see  what  about  top  100 and  US  sitting  on  the  top.

That  means  US  has  more  top   ranked  player  than  the  other  country.

Then  what  about  the  top  10?

Look,  you  can  either  slide  or you  also  can  type  in  the  number.

From  this,  Spain  popped  up  at  the  top

and  I  hover  over,  I  saw Carlos  and  I  also  saw  the  Nadal.

As  I  click  the  US  and  I  see  the  player, Taylor  Fritz,  who  ranked  number  nine.

You  also  can  see  from  the  hand  side

and  Nadal  within  this  top  10  player,   Nadar  is  left  handed.

He's  one  of  left  handed in  this  top  10  player.

Now  let's  move  on  to  check the  players'  age,  height,  and  ranking.

The  tournaments,   the  range  can  be  ranged  for  the  last  year,

they  actually  can  range  from  17-42.

In  this  graph,  I  only  listed  the  top  10 with  their  average  ranking.

From  this  I  find,  their  average  height  is  around  6'2,

which  is  very  common for  males  tennis  player.

I  also  find  Raphael  Nadal   and  Novak Dj okovic  

are  the  oldest  in  this  list.

Now,  let's  look  at  the  winning  statistics

because  I  would  like  to  see who  win  the  most  matches  in  2022.

I  find  out  Tsitsipas   list  as  the  number  one.

Then  something  is  missing.

Where  is  Rafael  Nadal,  and  Djokovic?

I  couldn't  find  them  in  this  top  10  list  who  win  the  most  matches.

This  remind  me   maybe  I  should  look  at  their  winning  ratio

instead  of  just  number  of  matches  they  won.

I  did  some  summary  statistics and  I  find  out  their  winning  ratio.

Yes,  you  immediately  see,

Novak  Djokovic,   Rafael Nadal ,  and  Carlos  Akras,

they  have  a  pretty  high,

they  are  the  top  three  player who  has  the  highest  winning  ratio.

Although  their  number  of  winning  for the  matches  is  not  as  high  as  Tsitsipas.

I  also  noticed  there  are  two  players

who  has  pretty  decent,   pretty  good  winning  ratio,

but  they  don't  play  many  matches.

They  only  won  three  matches.

Who  are  they and  what  type  of  tournament do  they  play?

I  drilled  down  into  the  data

and  I  find  out  one  player's  name  is  Kovacevic

and  all  his  three  matches  coming  from  tournament  A  level

and  the  player,

Safwa his  all  three  matches coming  from  Davis  Cup.

From  this  graph,  you  definitely  know

the  tournament  level   will  affect  the  winning.

Ultimately,  you  care  about  who  won the  most  championship  or  tournament  wins.

This  graph  put  all  three   relative  statistics  in  one  plot.

The  down  you  will  see   how  many  matches  they  win

and  the  second,

the  green  bar  means   what  are  their  winning  match-win  ratio?

The  top  will  show  you

how  many  total  championship  they  won  in  2022.

I  see,  Djokovic,  Carlos Alcaraz   and  Rafael  Nadal.

I  also  see  one  guy

who  I'm  not  familiar  with,  and  his  name, hard  to  say,  but  let  me  call  him  FAA.

FAA  doesn't  have  amazing  winning  ratio,   but  he  did  won  five  titles.

Again,   I  drill  down  to  the  data  and  find  out

all  FAA's  winning  title  coming  from  A-level  tournaments.

You  look  at  Djokovic  or  Alcaraz  and  Nadal,

they  are  championship  not  only  from  A-level  tournaments

and  also  from   grand slam  and  a  Master  level.

Again,   we  show  tournament  level  effect  winning.

Let's  look  at  the  seed.

What  does  seed  play in  the  players'  winning?

I  have  to  point  out  the  players'  seeds actually  will  vary  over  the  years.

But  in  general,  the  higher  seeded  players

tend  to  win  more  matches and  more  tournaments.

Grand slam  winner  usually are  highest  seeded  players.

But  in  2022,  only  two  people  are  exception.

One  is  Carlos  Alcaraz and  the  other  is  Taylor  Swift.

Sorry,  it's  about  Taylor  Fritz.

You  can  see  here,   Carlos,

he  succeed,  start  low, but  he  won  the  Miami  Masters.

This  helped  him  move  to  the  top.

In  the  end,  year  of  2022,

he  was  ranked  as  the  number  three  seed,

and  he  was  able  to  win  the  US  Championship.

Taylor  Fritz,  he  actually  won  the Championship  of  Indian  Well,  Master  2022.

We  can  see  the  seeds  definitely  affect  the  winning.

Now,  let's  look  at  the  comparison between  the  winner  and  the  loser.

In  this  ATP  data,  there  is  a  section  list

about  to  serve  statistics and  come  with  a  winner  and  a  loser.

There  are  seven  variables related  to  the  serve  statistics.

I'm  interested  in  this  first  one. What  it  is?

The  first  one  means number  of  points  won  on  first  serve.

I  click  and  build  a  plot.

Instead  of  I  plot  all  those   absolutely  the  number  of  the  point

I  use  the  ratio

because  the  point  will  depend  on   how  long  you  played  your  matches.

With  the  ratio  would  make  more  sense.

The  blue  colored  represent  the  first  serve  percentage  won

coming  from  the  winner and  the  pink  is  coming  from  the  losers.

Actually,  majority  of  the  first  serve percentage  won  between  60%  and  90%.

But  the  blue  color   shaded  more  to  the  right,

indicating  winner  have  higher   first  serve  percentage  won .

Next  I  would  like  to  be  interested  to  see the  variable  is  BPs  saved  and  BP  faced.

BP faced  means  a  breaker  point  faced.

For  if  you  serve  and  you  face  the  breaker  point,

that  means  you  give  your  opponent  opportunity  to  break  you.

You  better  not  t o  face  the  breakpoint.

Instead  of  plotting  separately, my  son  suggested  me  to  convert  them  to  be

breakpoint  converted, which  is  a  variable  defined  as

the  difference  between   the B P  faced  and  BP  saved.

Then  again,  we  can  see  the  blue  color shaded  more  towards  the  left,

indicating  winner  face  less  breakpoint   and  save  more  breakpoints.

The  pink  one  indicates  that  loser

tend  to  face  more  breakpoint and  save  less  breakpoints.

With  all  these  statistics   and  variable   I  have  shown  you,

but  ultimately  I  would  like  to  know,   can  I  build  a  model?

Can  I  predict  who  is  going  to  win and  how  many  they  can  win?

I  build  a  summary  table   and  as  I  shown  you,

all  these  ATP  data  come  with  matches.

A  player  can  have  many  matches

so  I  just  use  a  tabulate   to  do  the  summary  statistics.

I  got  the  tournament  wins  for  each  player

and  I  got  the  average  their  winning  match  ratio

and  their  height   and  their  average,  their  seed.

I  wanted  to  find  the  correlation  between the  variable  to  the  tournament  wins.

Clearly  you  can  see

the  match  winning  ratio  is  highly correlated  with  tournament  wins

and  so  is  winner's  seeds.

Also  I  defined  one  variable   I  call  the  div  rank,

which  I  know  when  you  face  a  weak  player, opponent  or  strong  opponent,

your  winning  odds  could  be  differently.

I  do  the  subtraction, I  introduce  this  variable  into  the  model.

You  also  notice  the  height,

there  is  the  correlation  between  the  variable.

I  just  happen  to  notice   when  you're  higher  or  you're  taller

and  you  tend  to  have  a  better  ACE  rate

and  you  have  better, like  the  first  one,  serve  one.

Definitely  the  taller  player  has  advantage  at  serving.

I  bring  all  these  model into  the  fit  model  platform.

I  first  run  a  Least  Square  model

and  I  get  the  conclusion  that the  winning  ratio  and  the  winner's  seed

are  definitely  affect   how  many  tournament  you  can  win.

I  also  think,  oh,  this  is  a  count  of  data. How  many  tournaments  you  will  win.

Maybe  I  should  use [inaudible 00:19:36]   distribution

and  I  run  and  I  actually  also  get the  similar  conclusion

that  winning  ratio  and  winner  seed   is  very  important  variable.

But  I  have  to  point  out, although  I  show  you  early  about

the  tournament  level  plays  a  very  important  role  on  the  winning,

but  because  the  data,  the  format  itself made  me  hard  to  put  it  into  the  model.

I  need  a  lot  of  data  manipulation.

Plus,  I  feel  like  instead  of  just  looking   at  the  one  year's  ATP  data,

maybe  I  should  look  at  more

in  order  to  build  a  complete or  good  predictor  model.

I  will  keep  this  in  mind for  my  future  research.

With  all  these  statistics  and  a  variable, I  show  you  so  far.

That's  back  to  the  topic,  who  is  the GOAT ?

I  actually  created  a  script

and  I  wanted  to  get  the  data  in  the  past  25  years

as  Federer  started  early.

I  wanted  to  include  all  the  matches, all  of  them  have  played.

I  would  like  to  find  out

who  won  the   grand slam  title and  who  won  the  Indian  Wells.

This  script  actually  is  able  to  go  to  the  Jeff  Sexel  web

and  fetch  the  data  and  do  the  analysis  and  generate  the  report.

You  can  see  2023,

Alcaraz  won  both   Indian  Wells  and  Wimbledon

and  Novak  won  Australia  Open and  Roland  Gallos.

As  the  list  moved  down,

you  pretty  much  see  their  name,  Djokovic Nadal  and  Federe r,  so  on.

It's  almost  for  the  last  20  years, these  three  are  dominant.

As  I  keep  moving  to  the  bottom,  finally,   I see  Andre  Agassi  and  Pete  Sampras,

who  are  my  favorite  player  in  '90s.

Also  you  see  these  three  guys,

Djokovic,  Federer,  and  Nadal,  they  sit  on  the  top.

This  include  a   grand slam  title  and  Indian  Wells  title.

I  truly  believe  these  three  guys,   they  move  the  modern  tennis  to  high  level.

Now,  let's  look  at  again,

look  at  the  match  wins,  winning  ratio, tournament,  and  Grand  Slang  title.

I  would  like  to  see  the  more  detail.

The  green  bar  here,  the  bar  itself represent  their  match  winning  ratio.

But  I  like  Graph  Builder's  feature.

It  allowed  me  to  put  their   number  of  winning  matches  on  the  top.

Then  you  can  see,

although  their  winning  ratio is  very  close,

they  all  like  above  80.

But  Roger  Federer  won  the  most  matches  over  1,263.

You  move  to  the  top   and  you  will  see  those  green  bars

means  how  many  tournament  championship each  of  them  have  won.

Again,  Federer  won  the  most.

Then  you  look  on  the  blue  top,

you  will  see  that Djokovic won  the  most,  23   grand slam  titles.

Next,  I  want  to  check  on  their  ranking.

These  four  lines not  only  show  their  ranking  over  the  years

but  also  show  their  incredible  professional  tennis  career.

Federer  started  early  in  2001.

It  took  him  about   three  years  to  move  to  the  top,

but  he  stayed  at  the  top for  a  long  time,  18  years.

You  look,  Nadal  and  Djokovic,

they  move  very  quickly  to  the  top

and  also  they  stay  at  the  top  for  a  long  time.

The  dip  here  usually  either  means  they had  injury  or  had  a  surgery  to  recover.

I  know  Nadal  is  right  now  in  the  recovery  period

because  he  just  had  a  surgery and  Djokovic  continue  to  play.

I  truly  believe  that  those  two  lines will  continue  to  grow  for  a  while.

For  Alcaraz,  he  just  started.

We  will  see  if  he  will  follow the  same  trajectory  as  the  big  three.

I  would  like  to  show  you  more  detail about  the  individual   grand slam  matches.

Look  at  this  plot  on  the  left.

This  show  in  the  past  25  years,

how  many  grand slam  matches  Federer  has  played.

Total  434   grand slam  matches.

He  won  373  matches  and  he  lost  61  matches.

That  bring  him  to  the  winning  ratio  is  86 %.

It's  amazing.

The  right-hand  plot, actually  a  plot,  his  opponents  ranking.

I  want  to  show  it's  difficult.

Usually  when  your  opponent  has  a  high  ranking,

that  means  tough  to  win  the  match.

The  red  dot  here   all  represent  the  winning  matches

and  the  blue  dot  here  represent  the  losing  matches,

and  the  square  indicate  the  final  matches.

These  are  all   grand slam  matches.

You  look,  most  of  the  Federers'  opponent  is  all  high  rank  player

and  only  the  few,  I  guess  he  was  lucky.

He  was  able  to  play the  opponent  with  low  rank.

We  also  can  look  like  how  his performance  in  each   grand slam

as  I  click  Wimbledon,  you  will  see, Federer  won  a  lot  in  Wimbledon.

Then  let  me  click  the  one  for  the  Roland-Gallos

and  in  Federers'  entire  career and  he  only  won  once  in  the  Roland-Gallos.

That  was  the  year  2009.

The  other  day,   he  pretty  much  lost  to  Nadal.

Let's  see  what  happened  in  2009.

I  bring  Nadal's  record and  I  particularly  look  at  Roland  Gallos.

You  pretty  much  see  all  the  red  square.

That  means  he's  the  championship of  the  Roland  Gallos.

He  only  lost  four  matches,  included  this  one  in  2009,

in  the  semifinal,  he  lost.

That  was  the  year,  actually, Federer  was  able  to  win  the  championship.

I  will  skip,  Novak  and  Carlos, and  I  will  bring  you  the  overview

of  all  these  four  guys'  performance in  all  the  four   grand slams.

If  I  look  at  each  one  for  the  Australian,

you  pretty  much  see  Novak  Djokovic  is  dominate.

Then  if  you  look  at  the  Roland-Gallos, Nadal  is  dominate.

For  the  US  Open,   they  all  have  won  the  US  Open.

I  guess  US  Open   provide  opportunity  for  all  of  them.

If  you  look  at  the  Wimbledon, I  think  Federer  and  both  Djokovic,

they  both  did  pretty  well  in  Wimbledon, but  Federer  still  win more  than  Djokovic.

I  wanted  to  finally  look  at their gra nd  slam  winning  ratio.

From  this  plot,  it  shows  me,  yes, Djokovic won  the  most  grand  slam  title.

Also  you  look  at  the  winning  ratio,

overall,  Djokovic  has  highest or similar  like  the  Rafael  Nadal.

Almost  in  every  category,

you  can  see  Djokovic  has  higher  winning  ratio,

except  for  the  Clay, the  Roland-Gallos,  Nadal,  is  the  best.

I  would  say  just  based  on  winning  most  grand  slam  title

and  highest  match  ratio, Djokovic is  the  goat.

Next,  we  would  like  to  find  out

who  is  the  youngest  among  four  of  them winning  the  grand  slam  title?

That  was  Nadal.

I  think  he  was  only  18.9, he  won  his  first g rand  slam  title.

Alcaraz  at  age  19.3  won  his  US  Open.

Although,  Djokovic  and  Federer  won  their  first  title  in  their  20s.

But  you  look  at  their  long, amazing  career,  even  at  age  36,

both  of  them  still  were  able to  win  the  grand  slam  title.

I  think  that  Djokovic  will  continue  to  win.

I  think  he  will  have  more  title under  his  belt.

I  also  look  at,  they  definitely played  with  each  other.

I  wanted  to  see  their  net  win  with  each  other.

Rafael Nadal,  if  you  look  at Rafael  Nadal  against  Roger  Federer,

so  Rafael  won  24  and  then  Roger won  against  Rafael  is  17.

That  bring  their  net...

Rafael  has  seven  net  wins  against  Roger.

Novak Dj okovic  has  five  net  wins over  Federer  and  one  net  win  over  Nadal.

Even  based  on  net  wins, I  think  Djokovic  is  a  goat.

I still  would  like  to  see  their  serve statistics  because  from  that  ATP  data,

this  is  the  data  more  related  to  their  techniques.

I  put  all  these  variables  into  the  one  way  and  utilize  the  fit  group.

With  such,  you  can  see  there's  a  lot of  the  data,  the  sample  size  is  bigger.

With  all  the  data  together,  it  seems  that

Djokovic  has  a  better  serve  statistic  than  the  rest  of  them.

But  I  realize  this  is  big  sample  size.

Sometimes  the   large  sample  size  can  transform  a  small  difference,

become  a  statistically significant  difference.

I  would  rather  to  see  the  subset.

I  look  at  like  a  small  sample  size and  I  look  at  Wimbledon.

Yeah,  and  in  Wimbledon, I  still  can  draw  the  conclusion  that

Federer is  a  little  bit  better than  the  rest  of  them.

But  once  I  look  at  the  other  grand  slam,

like  the  Australian  Open, and  I  cannot  draw  the  same  conclusions.

Overall,  I  think  their  technique is  very,  very  similar.

The  successful  rate  for  serving, they  have  very  similar  statistics.

With  all  the  statistical  variable,

I  show  you  according  to  statistics  of  winning  most  grand  slam  title

and  the  highest  match  winning  ratio,   Djokovic  is  the  GOAT.

However,  statistics  don't paint  the  entire  picture

as  a  player  can  have  a  much  larger  impact  than  just  statistics,

such  as  the  way  they  play  the  game, the  love  for  the  game,

and  especially  who  this  player  inspire.

Such  as  the  young  kids,

who  aspire  to  be  just  like  their  idols, including  my  son,  whose  dream  is  to  play

Eastonball,  a  prestigious  tournament for  youth  at  Indian  Wells.

In  the  end,

it  was  just  an  honor  and  a  privilege   to  watch  these  three  great  player

to  play  the  game, play  the  tennis  all  at  the  same  time,

and  the  future  looks  bright for  more  great  tennis  to  watch.

As  other  player  such  as  Carlos  Alcaraz,

and  others  look  to  follow in  the  Big  Three's  footsteps.

I  had  so  much  fun  doing  this  project

by  using  features  such  as  graph  builder, dashboard  and  application  builder  in  JMP.

This  feature  allowed  me   to  easily  explore  big  data  set

and  quickly  identify the  atypical  observation.

Dashboard  not  only  can  put  a  different  analysis  in  one  report,

but  also  allowed  me  to  stay  in  the  report and  rerun  analysis  after  the  modification.

Application  builder  allows  me  to  present  to  the  project

without  having  to  use  PowerPoint.

Although  this  project  mainly  analyze ATP  men's  tennis  data,

the  analytical  tools  and  the  flow  can  be easily  applied  to  women's  tennis  data

as  well  as  any  data  set that  have  patterns  in  other  fields.

If  you  have  any  questions, please  feel  free  to  contact  me.

Thank  you.