The World Statistics Day celebration continues here in the Community. We all need reliable data for sound decision making. Do you have a data source that you trust most? Head over to Discussions to tell us about it.
Choose Language Hide Translation Bar
Putting DC Metro ridership numbers in context

A few weeks ago, the Washington Post made a widely circulated chart of Washington, DC, Metro morning ridership numbers for recent inaugurations and the Jan. 21 Women’s March. Big event attendance numbers can be difficult and contentious to calculate, and the ridership counts bring an appealing objectivity to the conversation even if the counts are only roughly correlated with attendance counts. Some attendees don’t ride the Metro, and some riders don’t attend the event; however, when the counts are exceptional, something must be going on.

To see if they’re exceptional, we need to put the counts in context, so I set about looking for daily ridership data for the DC Metro. The best I found was the daily ridership estimates at Open Data DC for the years 2004 through 2015. While not the latest, it’s enough to give some context to the cited numbers both by looking at typical counts, three inaugurations and two Saturday rallies held in 2010: Restoring Honor (Glenn Beck) and Restore Sanity and/or Fear (Jon Stewart and Stephen Colbert). I attended the Restore Sanity rally and rode the Metro that morning, and both were body-against-body crowded, so I have that as an experiential baseline.

An initial view of the data

Here’s a first look at the count data over time in JMP. A few outliers and patterns jump out even in this coarse view.



We can already see one prominent outlier corresponding to the 2009 Inauguration. The low and high clusters suggest there are at least two categories of days, and grouping by day of week verifies that weekdays are the biggest cluster.


However, there are still a significant number of outlier counts on weekdays, especially Mondays. Hovering over a few of them confirms my suspicions that those are holidays. Splitting the graph by days of the week was a simple right-click transformation, but JMP doesn’t know US holidays, so I wrote a script to tag the official holidays, and then combined them with Sundays for an overview graph.


I labeled a few notable days and added a smoother to show the apparent seasonal pattern (more on that later). With the categorization, we can see that the 2009 MLK holiday, which was the day before the 2009 Inauguration, and 2013 Inauguration/MLK holiday were also exceptional in the context of holidays.

The recent ridership counts of interest are not in this data set. But the count for the 2017 Inauguration was 570,000, according to the Washington Metropolitan Area Transit Authority (WMATA) cited in a Washington Post article, while for the 2017 Women’s March it was just over 1 million, according to a tweet by WMATA.

The Restore Sanity rally stands out well from the typical Saturday ridership level (800,000 vs. 300,000); however, it’s not easy to determine a proper baseline for the inauguration dates. Inauguration Day is a local holiday for federal workers. To estimate its effect, we can use the Bureau of Labor Statistics 2013 data, which says DC had about 740,000 workers, and 200,000 of them were federal workers. The typical workday ridership is about 700,000, which is almost one ride per worker (presumably, about half the workers took the Metro two times each). Following that ratio, there would be about 200,000 fewer worker riders on Inauguration Day, and likely some non-federal workers get Inauguration Day off, too. That would put the baseline Inauguration Day ridership in the range of 400,000 to 500,000.

In case you’re wondering about some of the unlabeled outliers, you can try this interactive version of the chart and hover over points of interest. The other high Saturday ridership counts are mostly National Cherry Blossom Festival days in early April. Many of the high holiday counts were for Independence Day. The low winter days were bad storms.

Watch out for the mind bugs!

After I shared the above chart on Twitter, I looked more into the seasonal pattern and realized I had fallen for all three of the cognitive traps Alberto Cairo described as "mind bugs" in his book The Truthful Art: patternicity, storytelling and confirmation.

I saw an up-and-down pattern in the dots, created a story in my mind that ridership was tied to the seasons, and confirmed it by adding the spline smoother, which had a sine wave appearance. Fortunately, I wasn’t totally sold on that confirmation and realized that the spline smoother is constrained to be smooth no matter what the data looks like, and 11 years of fluctuations may be too much for it to sort out. So I set out to take a closer look at the data by day of year. This graph shows workdays for all 11 years.



Now I could see the pattern is not so sinusoidal. Rather than ridership flowing with the seasons, we have a few localized effects. The biggest is the drop-off during the last couple weeks of the year, and that smaller bump around day 100 is probably the National Cherry Blossom Festival in the spring. I imagine tourists account for the summer bumps, but what about the dip in August? Is that when the locals take vacation to escape the heat and humidity? Or is it because Congress is usually away in August? (DC residents: Let me know in comments!)

After realizing my seasonal story was wrong, I redid my overall chart with a simpler smoother that just shows the long-term trend for ridership over the years.



I’m still looking for more complete data, but this data set has been interesting to explore. I learned a lot about how people use the DC Metro! Experiencing the mind bugs firsthand provided a good lesson on analytical exploration. In the end, I think I came up with a chart that provides some helpful context.

Article Labels

    There are no labels assigned to this post.

Level VI

Interesting looking at the interactive HTML version.  There are some extreme low count days.  Being able to see the dates, looks like many of them are due to weather related events.  In particular, the "zero count"  and near zero count days in Oct 29 & 30, 2012, was for Hurricane Sandy, and the federal goverment closed on the 30th.  Another day, in 2010 was very low and that was for a "blizzard event".   As with any moderate to big data, the influence of these points may not be very large in your analysis, but I always like to understand where those outlier come from.  A more detailed data scrubbing might look for all severe weather events and group them with the weekends and holidays group.  

Thanks @MathStatChem. Every outlier tells a story, it seems.