Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
I used JMP to explore a recent FAA drone data set, inspired by a weekly data visualization challenge called 52Vis. The data set contains "reports of unmanned aircraft (UAS) sightings from pilots, citizens and law enforcement." I decided to focus on exploring the time data. I'll describe how I prepared the time data, noticed some quality issues in it and categorized it accordingly. Fortunately, JMP makes it easy to do visual data cleanup.
Preparing the data
The data is provided in two separate Excel files, and JMP's Excel import UI had no problem importing them with the default settings. However, the column names are a bit messy and didn't even match each other.
I cleaned up the column names to match and then used Concatenate under the Tables menu to append one data table to the other. Concatenate even handled the columns being in different orders.
Finding quality issues
I started playing with the data and noticed some quality issues with the time field. I created a new column that extracts the time of day from the DateTime column and made a histogram for it. I called it Time of Day Table because I'll be computing a different time of day later.
The most noticeable feature is the spike at midnight. It turns out there are a lot of times reported as exactly 12:00 AM, apparently as a missing value code. A second feature is the smaller but significant dip at noon. Likely, some of the noon times have had the AM/PM part miscoded, and some of the noon-hour times are reported in the midnight-hour slot (in addition to the missing value times).
While the distribution looks nice otherwise, it does seem to be a more shifted toward the evening hours than expected. If only there were some way to quality-check the time values...
In fact, there is a way to quality-check the time values! The Report column contains free text describing the incident and often includes a report time in 24-hour GMT (note the Z suffix).
So let's extract the GMT time and compare with the time in the table. I'll make a new formula column based on a regular expression to extract the digits between "Time:" and "Z" in the report text.
And another one to convert it to a JMP time value (number of seconds).
The new time column is called Time of Day Extracted Z, and we can compare it with the table time in a scatterplot to reveal some interesting patterns.
I added two diagonal reference lines. The orange one corresponds to EST, and the diagonal clusters near it are other US time zones. The gray diagonal is where the time in the table is the same as GMT. Lots of those! Strangely, few of those occur before 1 PM (1300Z) hours, which seems reasonable for GMT sighting times in the US but not for local times.
At this point, we can think of the times as falling into three categories:
Missing -- there was no GMT time within the free text
Identical -- the GMT time was the same as the "local" time
Offset -- the GMT time was at an offset from the local time, usually something reasonable for the US
I added a couple columns to compute that classification and then plotted a histogram for each category.
Now we can see how the Identical category was skewing the local times into the evening. The Offset category has a more expected peak during daylight hours and matches well with the times where the GMT times were Missing. Both of the latter suffer from the 12:00 AM spike and 12:00 PM drop, due to coding issues. It's unclear why the Identical category has a mostly uniform distribution. It could be a combination of two offset normal distributions, one for the East Coast and one for the West Coast.
By considering the extracted GMT time, we can filter out suspicious time values for further time-based analysis. That is, we have more trust in the local times that are offset from GMT. And since those missing a GMT time have a similar distribution, we can use those with similar confidence.
The 52Vis challenge is geared toward scripting solutions, and fortunately JMP offers both styles of interfaces: interactive and scripting. Though my initial explorations used the interactive interface, I made a script in the JMP Scripting Language (JSL) for the challenge that reproduces all of my steps. If you're interested, you can see the JSL in an earlier version of this post on GitHub.
A couple more things...
Even with only a few columns of data, there's still more to explore. You can use the geocoder add-in to convert the city/state to longitude and latitude for mapping.
And did you notice the altitude data in the report text I showed? A good number of the reports have altitude data and sometimes information on the appearance or size of the unmanned aircraft.