Scagnostics JMP Add-In: A new way to explore your data
Aug 13, 2014 3:38 PM
Scagnostics, scatterplot diagnostics, was discovered by John and Paul Tukey and later popularized by Leland Wilkinson in Graph-Theoretic Scagnostics (2005). These analyses were redefined in High-Dimensional Visual Analytics: Interactive Exploration Guided by Pairwise Views of Point Distributions (2006).
The beauty of scagnostics is the ability to visually explore a data set. JMP has the inherent feature called Scatterplot Matrix (SPLOM), which allows the user to simultaneously compare the relationship between many pairs of variables.
However, SPLOMs lose their effectiveness when the number of variables gets too large. Figure 1 shows a portion of the SPLOM report.
Figure 1. SPLOM for Drosophila Aging Data
Let's explore the Drosophila Aging data (in JMP Sample Data), which has 48 observations and 100 numeric variables. Notice in Figure 1 the substantial number of variables in this data set. This can be overwhelming, and our ability to visually observe the data is flawed. In Figure 1, only about 15 percent of the actual SPLOM is shown. In a world where data sets are growing every day, we need to be able to extract meaningful information from the relationships between our variables. That’s where scagnostics comes in! Scagnostics assesses five aspects of scatterplots: outliers, shape, trend, density and coherence.
This summer, I wrote a JMP add-in (which you can download from the File Exchange if you have a free SAS profile) that allows you to interactively explore data using nine graph-theoretic measures. The add-in combines three current features of JMP: Distribution, Scatterplot Matrix and Graph Builder. Each point in the scatterplot represents a 2D scatterplot. When you select a point in the scatterplot matrix in the bottom left, Graph Builder shows the respective scatterplot for the two variables in the bottom right.
As an example, one point has already been selected in the SPLOM in Figure 2. The corresponding variables are log2in_Tsp42Ej and log2in_CG6372. For this pair of variables, there are two discernible clusters of data. This is noted in a high Clumpy value.
Figure 2. Scagnostics for Drosophila Aging Data – Clumpy Example
Figure 3 below shows us that if we select a point with a high monotonic value, we can observe a clear association and a strong linear relationship between the variables, log2in_alpha_Cat and log2in_CG3430der.
Figure 3. Scagnostics for Drosophila Aging Data – Monotonic Example
Another key aspect of Scagnostics is outlier detection. Review the Graph Builder plot in Figure 4 below. When we inspect the two variables log2in_CG18178 and log2in_BcDNA_GH04120, we see two data points that visually appear to be outliers. Results with a substantial outlying value, as well as a relatively high skewed value, support the notion that this pair of variables has major outliers overall.
Figure 4. Scagnostics for Drosophila Aging Data – Outlying Example
As we compare the original SPLOM report in Figure 1 to the recursive SPLOM and Graph Builder reports in Figures 2, 3 and 4, we uncover much more informative and enlightening analyses.