From picture to data
Intro
What do you do when you have only an image of a graph, but you need the data behind the graph? It can be a painstaking process if you decide to do it by hand, one point at a time. However, you can pull data from an image into a data table using JMP, and it only takes a few steps.
Step 1: Import the image
The first step is to get the entire image as information into a JMP table. While this could be done via scripting, it can also be done using the excellent Image Analyzer Add-In. I'll use an image of the population in Reykjavik as an example, but any image of a line graph should work.
After downloading and installing the add-in, selecting Add-Ins > Image Analyzer allows me to select the image and import it as a data table. The data table includes the RGB data (as well as some other information) for each pixel in the image.
Step 2: Fix the scale
Once the table is open, using the Graph Builder to recreate a black and white version of the image is simple enough. In the image below, I've used the I column (Intensity) as a color column, and I reversed the Y scale to make it match the way images are saved.
Now I can see an immediate problem: the scale on the graph doesn't match the scale in JMP. This is really easy to fix; it involves using some built-in features in JMP, along with an intermediate data table.
- Create a blank data table with four columns (X, Y, Desired X, and Desired Y).
- Choose a data point in the lower left of the graph at a known location, for example, where the X-axis is the year 1896 and the Population is 0. Enter these values in the new data table in Row 1 of the Desired X and Desired Y columns.
- In the Graph Builder window, select the Crosshair tool, then hover over this data point. Note the X and Y values given. Write those in Row 1 of the X and Y columns.
- Repeat this process for a known point in the upper-right corner of the graph, recording both the current and desired values for X and Y. Your table should look like the example below.
The next step is to change the scale X and Y values in the image data table to match what is shown in the graph. Essentially, there's a need to apply an offset and a scaling factor to each column. Fortunately, this is just fitting a line, which is a basic function of the Fit Y by X platform. In the picture below, I've fitted a line to the relationship between the current values and the desired values for X and Y.
The key here is to look at the formula for the Linear Fit. It gives both an offset and a scale, in the form of a y = mx + b equation, where m is the scale and b is the offset. These parameters can be used with the New Formula Column tool in JMP to make quick work of fixing the axes.
- In the image data table, right-click on the X column, select New Formula Column > Transform > Scale Offset...
- Enter the values for the scale factor (written here as Multiplier) and the Offset, copied from the formula in the Fit Y by X window.
- Repeat this process for the Y column. Be sure to enter the negative number for the multiplier.
Now, new columns for X and Y, with axes and scales that match the image, are in the data table. Changing the axes to these new columns in the Graph Builder shows how these new axes align with the axes in the image.
Step 3: Extract the data
The only step remaining is to eliminate all the rows in the image data table that aren't data points on the line of interest. To do this, K Means Clustering is a fast and nearly automatic tool. (Special thanks to @landonbw1 for showing me this trick.)
The original image appears to have four colors: white, blue, gray, and black. It should be a simple thing, then, to ask JMP to separate the R, G, and B columns into four clusters, and the blue line should be one of those clusters. Due to the way image compression, color blending, and other factors work in image files, there are actually over 500 separate colors in this image. Still, we can separate the main colors easily enough. Generally, setting the number of clusters to twice the number of visible colors (colors that a regular human would say are in an image) creates clusters of individual graphic elements from an image file.
- Select Analyze > Clustering > K Means Cluster.
- Enter the R, G, and B columns into the Y, Columns selection, then select OK.
- Enter 8 in the Number of Clusters section, then select Go.
- Select each cluster in the Cluster Summary table in turn, while looking at the Graph Builder, to determine which cluster the line belongs to. In the picture below, highlighting Cluster 5 highlights only the data points in the line from the graph.
- Select all the clusters except the cluster of interest (5, in this case), and delete these rows from the table. Now, the only remaining rows are the data points from the graph in the image. Changing a few Graph Builder options results in an interactive graph of the data in JMP, where there was just a static image before.
Conclusion
There's a certain irony in the fact that it takes longer to read how to extract data from an image into a data table than it takes to actually do the process. Irony aside, however, there are also many opportunities to use this process, including:
- At the beginning of a research project, when studying all available literature.
- Reverse-engineering a process or product.
- Digitizing stress-strain curves for use in the Fatigue Model platform.
What opportunities does this open up for your work?
Population Graph - Final.jmp