Subscribe Bookmark RSS Feed
bernd_heinen

Staff

Joined:

Jul 4, 2014

Sailing and the art of data quality assessment

Gerhard Svolba is a colleague at SAS who is not only an experienced analyst and a caring father, but also an author for SAS Press and an enthusiastic sailor. He has done valuable research about detecting data quality problems and their consequences for data analysis.
sailing_regatta

The start of the regatta (Photo used courtesy of Gerhard Svolba)

 

As a statistician, I’m well aware of the importance and the burden of a solid data quality assessment. And because I am into wind-surfing and scuba diving, Gerhard and I have always a lot to talk about when we meet. Recently, he told me that he recorded his last sailing regatta and offered to provide me the log file. I accepted, and a few days later I received a JMP data table (very friendly!) with the data from his GPX logger, along with some recommendations for what to look for in the data.
The Data Table with Logging Data

The data table with logging data

 

To me, there is no data file as boring as a GPX log. It’s just a timestamp, coordinates, speed and compass heading.

 

What can you expect from a file like that? Even worse, my standard initial analysis – bringing the data into Graph Builder – just showed some odd lines.

map of Austria

Austria and Lake Neusiedl

 

Well, zooming way out of the picture revealed that the regatta took place in a lake – in the eastern-most area of Austria. And yes, Gerhard is an Austrian, and you should not only listen to what he has to tell but also hear his wonderful accent.

 

Back to the regatta itself. With sailing, it is interesting to see how the race went, as there are only some buoys to pass but no prescribed track between the buoys. With help of the Local Data Filter, I could use the timestamp to follow the moves of the sailing ship.

The points in Graph Builder, the Local Data Filter and Graph Options

The points in Graph Builder, the Local Data Filter and Graph Options

 

In the graph showing all the waypoints, I first set “lock scales” from the hot spot because I didn’t want to zoom into the selected area but follow the route over the entire course.

Parts of the course displayed at different slider positions

Parts of the course displayed at different slider positions

 

With the slider, I selected a tiny time slice and moved it over the scale. I could see how they kept their boat at the western part of the area, then made a long leg toward the southeast, and then a few tacking maneuvers before a sharp bend to the north (probably around a buoy). They went straight north, northwest, and then started a second round with a different tacking strategy.

Time range with no recordings

Time range with no recordings

 

To my big surprise, all points vanished after the second round, and I found a reasonable time period without any activity at all. Nothing is more fun than showing colleagues their mistakes! So I called Gerhard and told him about my findings.

 

“Well,” he said; “I forgot to tell you: The data is from three races, one after the other.” Good to know.

 

Now I wanted to identify the different races by a variable in the data set. There’s nothing easier than that with the interactive capabilities of JMP. I just moved the left slider handle to the origin of the scale. Now both rounds of the first race were shown. From the context menu, I picked “Name Selection in Column,” named the column and assigned the value 1. Then I did the same for the next two races with numbers 2 and 3, respectively, and calculated the sum of all in a fourth column. Now I was able to overlay the races in one graph.

Graphical subsetting and labeling of data points

Graphical subsetting and labeling of data points


Overlay graphs with three identified races

Overlay graphs with three identified races

  

So far, after just playing around with the slider and making a phone call, I learned that I had data from three races. Once I added some subject-matter knowledge, I was also able to learn about things about which I have no data. I know where the buoys were placed, and I have pretty good information about which direction the wind was blowing.

 

A sailing boat should get the highest speed over the ground when it is sailing with wind from behind. The data has compass heading and speed, so I looked at these.

North-northwest courses selected in Distribution

North-northwest courses selected in Distribution

 

The wind blew in northern to northwestern directions. Taking into consideration that a sailing boat does not provide solid ground, I selected a range of compass headings around the north course, including the high degrees from 300 to 360 degrees and the ones below 20 degrees, to find the fastest parts of the course. To double-check my selection, I looked at the positional graph and corrected my selection a bit so that I really covered the complete distance that was traveled without tacking.

North compass readings on southbound tacks?

North compass readings on southbound tacks?

 

Imagine my surprise when I found that some of the selected data points appeared on southbound legs of the course, since I thought I had selected only northern courses. Now I had another reason to call Gerhard, and this time the answer was not that easy. The GPX logger exports its data as an unformatted stream. Usually headings are reported with two decimal digits. But if the direction logged was exactly at a whole number, then that logger simply skipped the decimal places. The data import first ignored that behavior and always interpreted the last two digits as decimals. For example, 34815 was correctly imported as 348.15 degrees, but 348 was falsely imported as 3.48 degrees.

 

What is interesting about this finding is that it was only revealed by the combination of some logic (selecting northern courses) and a graph (showing the sailboat’s positions). Logic alone would not have found this error. Without the interactive graphs in JMP, we could not have uncovered this problem. And for the logic part, there was no need to write code; all it took was sliding some bars in the histogram.

 

I will not conclude that with JMP it is fun to assess the quality of a given data set. But at least, JMP makes it quite easy. It offers insight you can’t get using other strategies or tools, and it is fast. And as I was reminded with this data, finding the problem is often easier than identifying the cause.