Visual Data Quality with Named Colors in JMP, Part 2
Mar 10, 2010 1:29 PM
After my previous exploration of colors and names revealed inconsistencies in the Wikipedia color data, I looked around for a more authoritative source of color names. No luck finding an oracle, but I did find another interesting data set.
Where the Wikipedia table provided color values for a given set of names, this data set provides names for given color values. Dolores Labs created a Mechanical Turk task to have people give names to the color swatches they were assigned to. The results are provided in a CSV file, which JMP opens without any trouble at all.
As you might expect, there are lots of duplicate names. Out of about 10,000 color values, 1,000 were assigned the name "blue." Conversely, sometimes the same color value would be assigned different names, such as "purple" and "violet." And since the names were entered by the workers, instead of picked from a list, there is some variation in spelling. Column Recode is really handy for fixing up some of these problems. Here is a subset of the names after applying Trim Whitespace and Title Case in Col : Recode.
Color names before and after
I also used Find and Replace to change "Drk" to "Dark" and other abbreviations. That still left about 1,500 unique names, most of them occurring once. I wanted to see how much variation there was in colors of the same name, so I first filtered out the names with fewer than 15 color values. To do that, I used Summarize on the color name column, which told me how many colors had each name. Then I could sort the table by the counts and select the rows with fewer than 15 colors. Since the tables are linked, that also selects those in the original table. After excluding those, I plotted the remaining in Graph Builder as Hue by Name.
The graph is long, so I'll leave it to the end, but if you look closely (click the graph to see it full size), you might notice things like a purple called green and a green called blue. The names are ordered by average hue, which sometimes doesn't work too well for the reds, which often legitimately straddle the edge between 0° and 360°. So this is another example where a visualization helps detect data quality issues.