Data extraction from PDF files
I often analyze publicly available data, and sometimes I want to turn a table in a PDF file into a data table. In particular, data published by government agencies is surprisingly often not in Excel or CSV format, but is embedded as tables in PDF files.
In such cases, until four or five years ago, I would select the relevant table in the PDF file, copy it, and paste it into Excel or JMP, but there were many cases where it was not recognized as data properly, and in the worst case, There were times when I gave up and created the data manually.
In such a case, there is a new feature added in JMP version 15. "PDF Loading Wizard" was a blessing to me. Creating data from a PDF, which used to take time, can now be done in just a few seconds! ! This frees up your time for deep analysis of your data.
Last time I wrote Blog article (gender gap index) quotes a table in a PDF report, but without the PDF import wizard, I would have been frustrated by the hassle of creating the data.
Therefore, in this blog, Let me introduce you to the great features of the "PDF Load Wizard". Even if you are already using this function, there may be some things you don't know about, so please read it carefully.
What you can do with the PDF import wizard
"PDF Import Wizard" is a function that allows you to adjust the data to be imported while referring to the preview before importing the PDF file.
When you select [File] > [Open] from the JMP menu bar and select the PDF file to import, the "PDF Import Wizard" will automatically start (*1).

A preview will be displayed on the left side of the wizard as shown above. It will automatically recognize tables in the PDF and display them in the "Table Preview" on the right.
Check if the table you want to load is recognized here, and if there are no problems, click the [OK] button and it will open as a JMP data table.

In this example, we can easily load the experimental data described in the paper.
After this, I will introduce two amazing things along with actual examples.
Amazing part 1: Tables with the same column names are automatically concatenated into one
The following is a PDF of the statistical table of employment status of general national civil servants published by the Cabinet Secretariat (*2). Suppose you want to use this table to create a data table in JMP that shows the number of part-time employees in each ministry and agency, the difference from the previous year (persons), and the year-on-year change (%).

A function that automatically recognizes tables in a PDF may be useful, but in practice, there are not many cases where you want to read all the tables in a PDF, but only a specific table.
In such a case, at the top right of the wizard [Ignore all tables] Click the button to cancel automatic selection.
After that, move to the page with the table you want to load using the preview on the left, and click the red triangle button at the top left of the page. [Auto-detect this page] If you select , only the tables on that page will be automatically detected.

However, automatic detection may not work for some tables. In that case, create a rectangle by dragging where you want the table to be, and the table will be recognized within the rectangle frame. In practice, this method of dragging and selecting tables is convenient.

After selecting the relevant table, on the right side of the preview "Concatenate tables with matching column names" Select and click the [OK] button.

Then, the tables that were displayed in two parts will be combined into one table and loaded.
Normally, you would need to load two tables and then use [Concatenate] to combine them into one, but this wizard does it for you.
Therefore, by simply modifying the created data table a little, we were able to create histograms of the number of employees, year-on-year changes, and year-on-year changes, and examine the ministries and agencies that were outliers.

Awesome part 2: Combine tables that span multiple pages into one
This PDF file shows the results of the ski jumping national team competition (*3). The results from 1st to 8th place are shown, but they are not contained on one page but over two pages. The table format is organized by country, but I would like to combine these tables into one table.

This PDF is being previewed. Select the target table and "Concatenate all tables into one" Check and click the [OK] button.

It loads multiple tables into one. Unlike the previous example, the PDF file used here has some tables that do not have column names, so I used "Concatenate all tables into one".
After this, some data processing was required, but I was able to create a score plot for each team (for 4 players) without spending too much time. There are teams where the scores of the four players vary, and teams where they don't, which is an interesting result.

We will continue to make full use of this wonderful function and work hard on data analysis! !
by Naohiro Masukawa (JMP Japan)
Citing PDF files
*1: Rational design of a scalable bioprocess platform for bacterial cellulose production
https://www.sciencedirect.com/science/article/abs/pii/S0144861718312839
*2: Cabinet Secretariat General-level national civil servant employment status statistics table
https://www.cas.go.jp/jp/gaiyou/jimu/jinjikyoku/files/20220701_toukeihyou_gaiyou.pdf
*3: FIS SKI JUMPING WORLD CUP Official Results
https://medias2.fis-ski.com/pdf/2023/JP/3093/2023JP3093RL.pdf
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.