Subscribe Bookmark



Jun 23, 2011

Visualizing Data Table Differences

The JMP Scripting Guide provides a sample script for comparing two similar data tables. For trickier cases, you can export the tables as text and use an external text diff tool. When there aren't too many columns, you can also try a visual comparison in JMP.

I recently ran across a Wikipedia page for the famous Fisher Iris data and decided to use it to test JMP's HTML Import feature. The import worked fine, but in checking summary statistics, I could see the data wasn't quite the same as JMP's built-in sample data file. The row order wasn't the same, so a simple row-to-row comparison wouldn't work (even after sorting, since the bad data would throw off the sorting).

I ended up using a 3-D scatterplot, as shown here:

Differences in Iris data sets

I used Tables > Concatenate to combine the two tables into one, turning on the Create Source Column option so I could tell which row came from which table. I used Rows > Color or Mark by Column to color the rows based on the source column. I gave all four numeric columns to Scatterplot 3-D and turned on transparency and high-quality markers. (Scatterplot 3-D only shows 3 variables once, of course, but you can interactively switch among the 4 columns with the controls beneath the graph.)

Points from one source are red; points from the other source are blue; and points in both sources are purple (because of the transparency). I then selected the red and blue points and checked them manually against Fisher's original paper. It turned out the JMP data was correct, and the Wikipedia data had 6 numerical coding errors. I have since corrected the data on Wikipedia.

A 2-D scatterplot matrix with transparency would achieve a similar result and have the advantage of showing all dimensions at once. However, I like the 3-D view in this case since I can rotate the image to better see overlapping points.

Article Tags