Introduction
Most of our data in R&D comes from databases, Excel or CSV files, or is entered directly into JMP. But supporting data often comes in other formats, most notably PDF files. For example, think of chemical reference tables, vendor information sheets, or even historic company data that now only exist as in PDF. This type of supporting information can be very helpful to complement experimental data, and thankfully, JMP has a PDF import wizard that enables importing this type of data. Because these tables were not formatted for statistical software like JMP, importing data from PDFs often results in messy data tables that sometimes require additional cleanup steps.
In my R&D past, I was aware of the PDF import wizard in JMP and used it sometimes. There were also instances where I found PDF data to augment my research project, but I could not figure out how to clean up the table after importing the data into JMP. In those cases, I had to abandon the project and forego the additional information, which was very frustrating.
Fast forward a few years, and now I feel I have the tools to deal with any PDF file, no matter how messy the data is after import. Like many messy data problems, column formulas are the way to go to clean up imported PDF data. This is especially true when recording the steps in Workflow Builder or when capturing the script is desired to automate the process.
Who should read this blog?
Keep reading if you:
- Want to learn more about the basics of PDF import wizard
- Are getting stuck with data clean up after importing PDF data
- Want to learn more about using flag columns and column subscripts to clean up data
- Want to learn how to loop PDF clean up steps over multiple pages
Note that you do not have to read through all of it, you can jump to the section of interest below.
Overview
I find I come back to two constructs when cleaning data imported from PDFs, which I use either by themselves or in combinations. They are:
- Flag columns: Creating a binary flag to identify rows that meet a certain criterion. It can be “missing vs not missing,” “zero” versus “not zero,” and so on. Flagging columns can also be used in conjunction with selecting rows based on a condition.
- Row subscripts: Compare one cell to the next (or the previous one) and use this to generate a flagging column or to concatenate cell values from sequential cells. Use Row(), Row() +1, Row()-1 as a column subscript (used between subscript brackets “[ ]”).
I also have a number of helpful formulas that I often use.
Helpful Formulas:
- Ismissing()
- Contains()
- If ()
- Length()
- Concatenate (“||”)
In this blog post, I’ll examine the following types of issues typically associated with importing PDF data:
For each topic, you'll find both a short discussion as well as a video.
Before we take a look at these issues and how to clean them up, remember that in JMP there are almost always multiple ways to get to the same end result. However, I will pick solutions that are recorded in Workflow Builder to make the cleanup steps repeatable with a workflow. Typically, this means I will be using formula columns instead of the “quicker” right-click option (if available) or using an add-in option. Setting up the cleanup steps in a way where each step gets recorded allows me to use the workflow on other data tables that need the exact same clean-up steps. For example, check out the final example of a PDF table that spans 13 pages. On importing this PDF into JMP, 13 tables are created. As you can see in the last video, I recorded the cleanup steps for the first table with formula columns, then looped the clean-up steps in JMP to apply them to all 13 tables (done with the Group with For Each function in Workflow Builder in JMP 18), which makes cleaning the 13 tables easy! Jump to the videos with the Workflow Builder Looping examples here.
Let’s start with a simple example. If you are unfamiliar with the PDF import wizard in JMP, watch this video first to see how I use the import wizard to get data into JMP this pdf.
Issue 1: Missing cell values due to merged row header in PDF.
Often tables in PDF files are formatted as if several rows were merged and contain a single value, such as a subheader. Upon importing to JMP, the table has empty rows instead of repeating the value for each row that contains data (Figure 1).
Figure 1: Missing cell values in JMP (right side of image) due to merged row subheaders (left side of image).
One quick possible solution for the cleanup is to select the column with the missing values (e.g., Column 3 in Figure 1), then right-click on a cell and select Fill, followed by Replace Missing with Previous Value. In all versions of JMP up to JMP 18, this right-click action does not yet get recorded in Workflow Builder or in the advanced log. It will be added in JMP 19, but for earlier versions, a more robust option for replacing missing cell values with the previous value that is recorded in Workflow Builder or in the advanced log is to use a column formula. I used a column formula that, in the formula itself, assigns a variable called “var” if Column 3 is not missing, and then uses this value for the next missing value (Figure 2). This method works well for replacing previous values, because the column formula is executed from the top down. In other words, every time there is a value in Column 3, the variable “var” gets updated and the updated value is used for subsequent rows until a new update occurs for a non-missing row.Figure 2: Replacing missing with previous value column formula.
Watch me demonstrate this step in the next video:
Issue 2a: Cell value is split over multiple rows
Another example of merged rows in a PDF table that require cleanup in JMP is when a value, such as a subheader, is split over several rows in one column (Figure 3).
Figure 3: A cell value, in this case a category subheader (left), is split over multiple rows on import to JMP (right).
To clean this up, the parts of the strings must be identified and then strung together. The whole string has to be repeated for the appropriate rows and then the remaining partial strings must be deleted. One way to solve this issue is to create a Flag column that identifies missing versus non-missing, or that identifies “next row missing” versus “next row not missing.” For the example in Figure 3, the following is a possible sequence of steps (all of which are shown in Figure 4):
- Identify rows that have text and rows without text with a Flag column (Step 1).
- Use the Flag column and a formula with row subscripts to identify rows with text and concatenate with text in the next row (Step 2).
- Once there are cells with the whole string, create a flag column to identify the rows with partial text strings and delete these partial strings (Steps 3 and 4).
Replace missing values for empty cells with the entire string (See solution for issue 1).
In the example of Figure 3, I used the formulas shown in Figure 4.
Figure 4: Series of column formulas to find and correct a subheader split over two rows.
You can see me demonstrate these steps here:
Issue 2b: Cell value is split over multiple columns
For the next example, I use a different PDF file, which I found here. In this case, cell values are split over two rows AND over two columns. Take a look at the left image in Figure 5. In this example, the chemical name “Oxy-1,4-phenylenesulfonyl-1,4-phenyleneoxy-1,4-phenelyneisopropylidene-1,4-phenylene” (indicated by the blue box) is not only split over two rows but also over two columns. It happened because, higher up in the table, two values for the glass transition Tg are listed rather than one, making it impossible for a column divider to avoid cutting either the chemical name or the Tg values (see the red vertical column divider).
Figure 5: Thermal Transitions table demonstrating a cell value that is split over multiple columns and rows.
(Source: https://www3.nd.edu/~hgao/thermal_transitions_of_homopolymers.pdf)
When the column divider is placed, as shown in the left image of Figure 5, there are two cells in the Tg column that have cut-off text that belongs in the Repeating Unit column. In this case, we can use the fact that the cut-off text is letters, instead of the numbers that are in the other cells. Therefore, one way to correct this issue is to follow these steps, as shown in Figure 6:
- Create a column with the first character (Step 1).
- Use the Replace String option of the Recode platform to replace anything that is a number with nothing. For this, I used the regular expression [0-9-] to find all numbers including the “-”. (Step 2).
- Use the “Ismissing” formula to locate the text string and concatenate it with the original (cut-off) string (Step 3).
- Clean up the Tg column using conditional statements and the Text Flag column. While this method is a bit clunky, it does the job. If you know a regular expression, you could use that here as well under the column option Recode. I used the conditional statements shown in Figure 7.
After all this cleanup, I identified where the cell contents are split vertically over two rows: Rows 39/40, 41/42 and 43/44. The cells that are split have the first part of the string ending with a “-”, so that could be used as an indicator. Next, a formula construction like the one shown in Figure 4 can be used to add the text strings together. Eventually, empty data cells can be deleted to avoid duplicate rows.
Figure 6: Finding and concatenating split text strings.
Figure 7: Removing text string from numbers column.
The next video shows how I used these steps so you can see one possible solution for this issue.
Issue 3: Rows with headers or subheaders are mixed in with the data
For the last example, I go back to the PDF that lists the fatty acid compositions for fats. In this case, you can see that the data table contains rows with headers, units, or subheaders that, upon importing into JMP, are now mixed in with the data (Figure 8).
One way to identify such rows is to use the “Contains” formula to identify a symbol or text string in the header/unit row that is not found in the actual data. In Figure 8, for example, we use the formula “Contains(“%”). This formula will either result in a “0” if this symbol does not occur in the cell or in the actual location of the character in the string (resulting in an integer > 0). With a simple conditional statement, a Flag column can be generated.
For this example, a possible formula is shown in Figure 9. To remove the rows with a “1” (indicating the “%” is present), select rows that meet this criterion by using the “select where” option from the Row/Row Selection menu (Figure 10). Next, delete the selected rows. The advanced log and Workflow Builder capture this sequence of steps and can automate it for repeated use. Note that there are a limited set off selection criteria directly available under the “select where” menu, including “Contains,” so you could set this up directly and forego the Flag column.Figure 8: Rows with headers or subheaders are mixed in with the data and appear as rows in the JMP table.
Figure 9: Conditional statement with "Contains" formula to create a Flag column that identifies if the "%" character is present.
Figure 10: Adding conditional row selection criteria using a Flag column.
In this video, you can see the cleanup steps used to identify and remove unwanted rows in the data table (in this case, the subheader rows).
Bonus Videos: Import Examples with WorkflowBuilder loop
In this video, watch how all the cleanup steps performed for the first page can be applied to a second page with just a few clicks. I use the “Group with For Each” loop in Workflow Builder for the fat.pdf example shown in the discussion above. Next, I demonstrate a more extreme case where I have 13 pages in a wide table format. Each page has to be imported separately and must be cleaned up. However, once I used the PDF import wizard to import the table from each page, I can simply record the cleanup steps for one table. I can then use the "Group with For Each" option in Workflow Builder to automatically run the cleanup steps on the other 12! How is that for efficient!?!
Finally, I am 100% sure there are shorter or neater ways to address some of these issues that I don't know of...yet. Feel free to share other solutions in the comments below.