Graph Makeover: Where same-sex couples live in the US
Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
The map coloring shows the proportion of same-sex couples in each county in 2010. The numbers are necessarily approximate; the Upshot uses Gary Gates’ adjustments of the raw American Community Survey (ACS) data to account for coding errors.
One feature of the map that struck me as remarkable is the amount of variability throughout the country. However, it also reminded me of Howard Wainer’s chapter on “The Most Dangerous Equation” in his book Picturing the Uncertain World. He calls it “de Moivre’s equation,” which dictates that smaller samples have increased variability. In particular, the standard deviation about the mean of a group of samples is inversely proportional to the square root of the sample size. He gives examples based on disease rates and school performance. In each case, smaller population samples yield the highest and the lowest rates.
It’s not hard to see why with an example from this data set. Douglas County, South Dakota, has one of the highest proportions of same-sex couples in the country at 17.4 per 1,000, while nearby Hanson County has the minimum of 0. Each of these counties has fewer than 1,500 households, and given the sampling rate of the ACS for South Dakota, we can estimate that fewer than 30 households were sampled in each county. So one same-sex couple in 30 respondents for Douglas County looks like a relatively large proportion even after Gates’ adjustments. Meanwhile 0 same-sex couples in 30 looks like none for all of Hanson County.
There are “small area estimation” techniques for dealing with some of these problems. For instance, averaging nearby counties together can help smooth the extremes at the expense of possibly losing information. Another technique is to combine successive years of data, but in this case, 2010 was the first year the survey asked about unmarried partners.
My interest, though, is in finding a way to better see “where same-sex couples live,” which is the title of the Upshot article. The text of that article is careful to compare only rates for large counties, but the map has no such qualifications. Can we show the uncertainty somehow?
One graphical technique for understanding proportions with different sample sizes is a funnel plot. A funnel plot is scatterplot of the proportions versus their corresponding sample sizes. For low samples sizes, you expect more variation. Assuming all of the samples are from the same population, we can draw curves that correspond to where we expect most of the proportions to fall. Dots far outside the curves are likely outliers, possibly not really part of the same population after all.
Here’s a funnel plot of the same-sex couple proportions, with some of the points labeled. For counties containing large cities, I’ve used the city name as the label.
The orange line shows the (weighted) grand mean, and the curves show the confidence intervals based on de Moivre’s equation. With 3,100 counties, we’d expect a few just outside the 99.8% interval assuming a normal distribution. But instead, we have many such counties, some of them far outside the interval. This suggests there is something systematically different about them, not just common random variation. Looking at the labels, we can see many large cities and a few college towns.
We can see that Douglas County has the sixth-highest proportion of same-sex couples, according to the adjusted data -- but it’s well below the 95% line, and so it's not that remarkable. How can we represent that on a map? One idea is to color each county by its distance above or below the mean relative to the curves. That is, we color it by a z-score, the number of standard errors above or below the mean. The inner curve represents a z-score of 1.96, and the outer curve represents a z-score of 3. Here is the resultant map.
I made a custom blue-gray-red color scale with extra dark colors at the extremes and mapped to the z-score range -4 to 4, based on the mean of the non-outliers. All extreme outliers show up as the same dark red or blue, which loses some information at the edges so the middle doesn't get washed out. Of course, the proportion can’t go below zero, so only counties with large sample populations could show up as low extremes, but none do.
From this map, we can see the following:
There are some extreme high outliers, mostly around big cities.
If you know your US geography, you can notice that some smaller cities like Madison, Wisconsin, and Asheville, North Carolina, have high z-scores. (See my JMP table for an interactive version with hover labels.)
Most of the country is grayish and in the range of unremarkable variation, neither obviously higher or lower than the mean. South Dakota certainly has less extreme variation than before.
We can still see a cluster of high values in New England though it's more concentrated than before.
The cluster of higher values in northern Wisconsin is still there but less pronounced.
The map also reminds us of some shortcomings of choropleth maps in general. In addition to the inaccuracy of color, the areas are bound to political borders and have irregular sizes. Some counties, such as San Francisco in California are so small that we can barely see them, and others, such as in the Southwest, have big areas that are dominated by localized populations.
If we really want to know where the same-sex couples live, we might pair the map with a chart of the highest- and lowest-scoring counties. Here are the counties with a z-score above 4. There are none below -4 or even -3; Utah County, Utah, has the lowest z-score at -2.98.
This has been a challenging data set to work with. While it’s interesting and potentially insightful, we have to remember that the sampling rates are low in places and the proportions are small enough that they can be affected by coding errors and other behaviors. Furthermore, proportions don't behave quite like "regular" numbers, but the Central Limit Theorem and the large number of counties gives us some comfort in using a normal distribution. With cleaner, integer data, we could have used a Binomial distribution for the confidence curves, as in Rick Wicklin's post on funnel plots.
My data file with graph scripts is available in the JMP User Community, and an add-in for making funnel plots in JMP will be available soon.