Remember to visualize your data

Report Inappropriate Content · Jun 11, 2023 4:28 AM

Introduction

I still think JMP's one of the most powerful features is it's capability to quickly create different visualizations. This post is just a reminder/discussion starter/wish list item about that visualizing your data is extremely important. This post will include two different "simple" yet very powerful examples of why you should always visualize your data and not just trust summary/descriptive statistics.

Reminder part

Anscombe's quartet

Anscombe's quartet (wikipedia) is fairly popular data set to demonstrate why data should be visualized. The data set consists of four different data sets that have almost identical descriptive statistics. The data set was constructed by Francis Ancombe in 1973. More information can be found from the article where they are demonstrated (Anscombe, F.J., 1973. Graphs in statistical analysis. The American Statistician, 27(1), pp.17-21.).

JMP also has Anscombe.jmp as one of the sample data tables. The data table includes table script "The Quartet" which will use Fit Y by X to demonstrate the data set.

Datasaurus Dozen

I could see this as a bit nevere Anscombe's Quartet. You can find more information from Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics th... . The idea of the article isn't exactly to demonstrate that visualization is important but rather how to create such datasets. The article uses Datasaurus dataset created by Alberto Cairo as the baseline for creating different plots.

Datasaurus dozen from Same Stats, Different Graphs (autodesk.com)

Wish part

I hope that JMP would add either datasaurus dozen as dataset to JMP or even better, possibly use the paper by Matejka, J. and Fitzmaurice, G., 2017, May. Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1290-1294). to implement the algorithm which would allow users to create this type of datasets.

JMP already has JMP Man Dozen.jmp sample data which has been built from JMP Man by using methods suggested demonstrated in Matejka, J. and Fitzmaurice, G., 2017, May. Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1290-1294).

JMP Man Dozen dataset visualized in Graph Builder

Demo

Note: Demo does require JMP16 or newer to run.

I have also committed this to my github page. It has slightly different script and different text than this post here (jthi0/visualize_your_data (github.com).

I "quickly" wrote a demo script which can be used to show Anscombe's Quartet and Datasaurus Dozen. This has been attached as a zip file to this post. Unzip visualize_your_data.zip to one folder and run demo.jsl and it should open a window with both data sets visualized.

The demo window includes slider which can be used to make markers more visible on all graph builders. Images on the left side have markers hidden and the ones on right side have transparency set to 1 (100%).

I have also posted this to my github (github.com/jthi0) but the github version doesn't include datasaurus set, as I didn't bother checking out if I could freely share that (but you can freely download it from https://www.autodesk.com/content/dam/autodesk/www/autodesk-reasearch/Publications/pdf/SameStatsDataA... , I used DataSaurusDoze.tsv as the baseline for the demo script. Convert it to .jmp file and save to same folder as demo.jsl and support scripts).

This script could be fairly easily converted to add-in and there are quite a few improvements which could be done (such as exporting data from the user interface).

Discussion

Have you faced similar situations where visualization has saved you or your organization from a lot of headache?

-Jarmo

SDF1 · Oct 19, 2022 08:40 AM

Hi @jthi ,

This is a great post and reminder of the importance of visualizing the data (literally looking at structures in the data). I don't know how many times I've come across people sharing a spreadsheet and they start talking about trends in the data when there are so many columns and rows of numbers that you can't make heads or tails of what is actually happening with the data. Or, when discussions (and potentially decisions) are being made solely from one or two data points or perhaps simply the mean of a few data points. Without additional context and especially visualization, this is such a risky path to go down.

One thing I would add to the discussion is to try and visualize the data in multiple different ways -- heat maps, scatterplots (often in 3D across multiple variables to cycle through and see how the data might group or arrange across different variables), bar graphs, box plots, parallel plots, etc. Any kind of way that brings about a different perspective and helps to understand the data better is always good to try. And, you might come across a new visualization technique that really brings the message across to your audience in a better way. Sometimes, the better visualization might require some additional steps like transformations, looking at outliers, or clustering in order to see how aggregated groups relate to one another -- I've found the Constellation Plot in the Hierarchical Clustering platform to be quite useful at this. Of course, much of it depends on the specifics of the data you're working with and what makes sense for your organization.

Thanks for the post!,

DS

hogi · Oct 19, 2022 12:19 PM

I like this one (from sample data)

dt = Open( "$SAMPLE_DATA/JMP Man Dozen.jmp" );
dt << runScript("Trellis with Summaries");

jthi · Oct 19, 2022 01:27 PM

Seems that JMP Man Dozen.jmp data table is JMP's implementation of datasaurus dozen using same paper I mentioned in my post. Didn't even know that existed, so thanks for showing that @hogi !

-Jarmo

WebDesignesCrow · Oct 21, 2022 03:45 AM

Agreed. Graph Builder feature in JMP is a true Champion. Sometimes, people tends to straight away perform sophisticated statistical analysis without visualizing / exploring the data. They missed a lot of opportunities. JMP has advantage on this.