Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Choose Language Hide Translation Bar
What does a winning thoroughbred horse look like?

In a previous post, I wrote how pedigree might be used to help predict outcomes of horse races. In particular, I discussed a metric called the Dosage Index (DI), which appeared to be a leading indicator of success (at least historically). In this post, I want to introduce the Center of Distribution (CD) as a metric that can help us predict a horse’s potential regarding speed and distance.

Specifically, with the Belmont Stakes set to be run on Saturday, I want to combine DI and CD to analyze what some horse racing fans refer to as the concept of Dual Qualifiers. Historically speaking, there is a general rule of thumb that if a horse has a DI below 4.0 and a CD below 1.0, then it has an advantage; the horse qualifies as a favorite relative to both metrics, making it a Dual Qualifier. However, I wonder if we might use analytics to improve on this historical rule of thumb.

Explanation of Key Metrics

“Dosage,” as I explained in my earlier post, is a system that attempts to explain the horse’s potential that might be due its pedigree. It is described in terms of the Dosage Profile, Dosage Index and Center of Distribution. In this discussion, I am most interested in the following terms:

Dosage Index (DI). A horse with a high DI has been bred for speed, and since this is racing, that seems important! But, these high-stakes races are relatively long, and the Belmont Stakes is the longest of the Triple Crown races. Owners, trainers and fans must also consider the stamina necessary to run 12 furlongs (or 1.5 miles).

Center of Distribution (CD). The CD ranges between -2 and +2 and represents the balance of speed and stamina. Thoroughbred racehorses are bred with distance in mind, and CD points to that distance.

We can use the historical rule of thumb (regarding DI and CD) to quantify a horse’s ability to “run fast,” and also to have enough stamina to “run long.” But can we improve it?

I looked at Belmont Stakes races from 2005 to 2015 and analyzed these metrics, seeking to develop a profile that would enable us to understand with greater clarity what a winning horse looks like.

Developing the Profile

Using the Clustering platform in JMP, I analyzed DI and CD using the default settings regarding the K-Means method. Three clusters were returned with the following parameters:

K Means Cluster

With the new clusters identified, I plotted all the data as a Scatterplot Matrix (an option within the K-Means analysis platform).


Next, I wondered how horses performed relative to their clusters. I used a Local Data Filter to isolate those horses finishing “in the money” and found something interesting: 21 of 34 horses were assigned to Cluster 2 (which is appropriately colored green).


Then, I took the analysis a step further. To which clusters were the race winners assigned? Another Local Data Filter revealed that in the previous 11 races, nine of the winners were assigned to Cluster 2.


My next question was: Relative to DI and CD, what are the parameters such that I can determine the true profile of a Cluster 2 horse? The  Distribution platform in JMP provided extensive insight.



The Distribution analysis revealed the parameters for this cluster range from 1.8 to 3.62 for DI and from 0.5 to .92 for CD. Summary statistics offered greater precision for each metric.


Developing a profile of horses is a fun exercise, but I have found similar approaches to be of the great value regarding human behaviors, whether for marketing, risk, healthcare, education, crime or athletics. The quantification of your subject of interest is a first step in my preferred approach to analytics, and yields the opportunity for truly targeted modeling, and thus better results.

The Clustering platform in JMP allows you to save the formula used to create your clusters, so you can readily apply it to the wider population or your new customers. In the case of the Belmont Stakes, I used my research on the previous 11 races to determine my set of Cluster 2 horses for the 2016 race.


So thanks to this analysis, I know what my Belmont Stakes horses look like for 2016. Do you know what your customers, patients and students look like?

Article Labels

    There are no labels assigned to this post.


Louis Valente wrote:


So are the contenders Exaggerator, Creator and Governor Malibu?


Sam Edgemon wrote:

Yes - and speaking from an analytical perspective only; add Forever D'oro and Stradivari to the list of "cluster 2" horses.


sean s wrote:

Nicely done at 16-1.