Good day, everyone. My name is Peter Fogel. I'm an employee of CSL Behring Innovation, and it's my pleasure today to talk to you about Automated Extraction of Data from PDF Documents using what I call Customized JMP Add-ins. More or less, let me give you a little bit of an high level overview of what we're going to do today.
First of all, I want to motivate in the introduction why we should actually want to extract data from PDF documents. Then second of all, in the approach, I want to show you how you can leverage JMP to actually really do so, and what it actually means to use JMP and to create JMP scripts. Finally, we want to really transfer those JMP scripts into what I would call an add- in, and I want to explain a little bit why add- ins are actually the better way to store, if you like, JMP scripts. Finally, I want to tell you what you can do once you are actually at the level of JMP.
Why should we actually use PDF documents and want to extract data from it? Well, on the right- hand side, you see one example of a PDF document, and you see that it actually contains quite a lot of data. Quite often, this data is unfortunately not really accessible in any other way. Be it for questions of old software systems, be it in any proprietary software, be it of whatever it actually is.
Sometimes, really, PDF documents, and here you can really also replace the word PDF with any other document format is really the only choice. You want to actually have this data, or otherwise, you would really need to actually have a lot of manual operations to do on the data, which is both annoying but potentially also really demotivating for your team members.
The latest point is, if you don't have the data at hand, well, you can't make the decisions you want to. Quite often, data is key to making informed decisions. Without informed decisions, well, that's really a disadvantage in today's business world.
What I want to show you now is really how can we actually use structured data in PDFs files, how can we leverage them using JMP, and how can we, based on that, really make decisions. Today, I'll only focus the aspect of really how to get the data out of the PDF and how to really give it over to the user, everything else, how to analyze the data and so on, could be then a topic for another talk at another time.
Before we actually start really with JMP itself, let's talk a little bit about what I would call the guiding principle. The first part, I believe, is really, first of all, understand what you want to do. If you don't understand the topic itself, you can't really work with it. In this case, we know we have any PDF document, or potentially also multiple PDF documents, which we want to actually parse.
Then we might need to actually do some organization of the data. And finally, potentially also do system processing depends obviously on what is in there and what specifics we have. But in the end, that could be more or less a three- step approach. From there on, you could be ready to do any data analysis you want to do. Really understand your question at hand and we'll do so also in the next slides in a little bit more detail.
The next part is really break it down into modules. The more modules you have and the better they are defined, the easier it is. Really make your problem into smaller pieces and then you can really tackle each piece on its own, and it's much easier than if you actually have one big chunk of things to do at the same time.
The third part, I believe, is always use JMP to the best you can do, because JMP really can do quite a lot of what I would call heavy lifting for you. We'll see one example, which in this case will be the PDF Wizard, but there are many, many more things that you could do from analysis platforms like the distribution platform, over other platforms. They can really do a lot for you, and in the end, you just have to scrape the code and that's it. You can really get it more than for free.
The fourth point, I believe, if you define more fields, really also make sure that they are standardized. Standardized, this sense really means they should have defined inputs and outputs so that actually if you figure out I want to do this part of one of the modules slightly differently, it still doesn't break the logic of the code after all because it still has the same inputs and outputs.
The last part, I hope should be clear, let's first focus on functionality and then later on, really make it user- friendly and really suitable for any end user. That's also what we will do today. We'll really focus more on functionality today and less on the appearance.
Let us now very shortly look into our PDF documents, and I'll also share that with you in a second, the actual document. But now let's first look into this snapshot here on the right- hand side. What do we see? Well, this PDF is actually consisting of several pieces. The first one is typically this header, which just holds very general information of which we might just use some of them, but potentially the all.
Then we actually get an actual table, which is this data table here which has both a table header as well as some sample information. If we look into that now in an actual, let's say, an actual sample, then we can actually look into this PDF and we'll see that, we can share that with you. It really looks like that. You see this table continues and continues and continues across multiple pages.
On the last page, we'll actually see that there's again data and then at some stage, we'll have some legend down here. Potentially, obviously, we'll also note that there might not necessarily be data on this page, but we can rather just have the legend here. Just as a background information, we know now how the structure of this document works.
More or less, we can also state the first page is slightly differently, then we'll actually have our interior pages, and the last page, as mentioned, can contain sample information, but it does not have to, and it certainly contains always the legend.
If we now get a little bit into more details, we'll actually see that each more or less, let's say line or each... Let's call it line or entry in this data table, actually consists of a measurement date as typically also than actual measurement. Those again are actually separated into multiple pieces.
You will have, for example, here the assay, which in this case is just called end date, or here, let's say the user side. You might then also have some code or assay code, it depends. You will have a sample name, you will also have a start and end date typically. You might have some requirements and so on and so forth until finally, you get what we call the reported result at the end.
Our idea would be really how to get that out. We'll see actually, yes, this first line, if you like, of each entry, that actually holds different information than the second one. This third line actually just holds here what we call WG in terms of the requirements. It's not really yet that perfectly structured, but we see there is a system behind this data, and that really allows us to really then scrape the data to parse them, to really utilize them to their full extent.
Let's now again break it down in modules, as I said. What we can do is we can again think around this three- step process, and I believe what we could also do is we could actually try to break it down in even more steps. The first step could be, and that is now really user dependent then, that actually user says, Please tell me which PDFs to parse. The user tells you, It's PDF one, two, three, for example.
Then you would actually say per PDF, I always do exactly the same, because in principle, every PDF is the same. One has more pages than the other, doesn't matter, the logic always stays the same. You would, first of all, try to determine the number of pages. This we won't cover today, but in general, we can think around it.
Then you might actually want to read general header information as we know it, and obviously process it. We might certainly want to read the sample information and process that. We might want to combine that. Again, this one we'll skip today, and we obviously want to combine the information across files.
Now that means at that stage, we would really have all the information available that we want to. Finally, just what we need to do is we need to actually tell the user, now tell us where to store it. Finally, we want to store the result. Again, those two last steps we won't cover today, but I guess you can really imagine that that is something that is not too complicated to be achieved.
Now, let's actually jump into JMP itself. What I want to show here is that really, let JMP do the help lifting. This case, in particular, let actually the PDF Wizard do all the powering of the data for you, and if you like all, you then have to do is really change the structure of the data. But more or less, you actually can leverage the JMP Wizard or the PDF Wizard in JMP to a full extent.
At that stage, let's really switch very quickly over to JMP itself, and let's see how that works. I've taken here this example which is called just Freigabedaten Biespiel.p df and we'll actually see what happens. If you open that either by double- clicking on it or by actually going via File and Open or this shortcut File Open, then you can actually see that if we select a respective PDF file, you can actually use the PDF Wizard, and now let me make that a little bit larger for you to actually read the data.
We see that from the beginning, actually JMP already auto- detects some of those data tables in here, but we now want to be really specific and we just want to, in this case, only look at the header. Let's ignore that for now and let's really just look at the general header table. We would say in that case, it starts here with the product and adds with the LIMS Product Specification. So we can draw just simply a rectangle around it, let that fall, and you'll actually see in an instant what happens over here.
You'll see JMP recognizes that one has two lines. That seems to be about right. It also recognizes, well, in principle, I have only two fields. Now, one could argue, well, this one is one field, this one is a field, and this one is a field. So it might or might not. It depends a little bit also on how you want to process the data, say JMP, please split here the data. If we don't want to do so, we really need to actually look at, yes, this second part of the field starts with something like a LIMS log number.
In any case, we now have more or less data at hand in the format and could just say okay, JMP will actually open that data to the force. Now, very interestingly, what we can directly do is we can actually look into the source script and we can see, oh, there's actually code. And this code we can really leverage. I would now just copy this code for a second. We could now actually create a first script. For this, I'll just actually open a script all by myself. I'll very quickly open that for you.
We can actually add here the code. What you should actually see is that this code that I've just added is really the same as the code that we have down here. It has no difference whatsoever. So let's just use the code as it is. Now, if we look a little bit closer at that code, we'll actually see that there are a couple of things we can see.
The first one would be that this actually is just the file name of the file that we used. Instead of actually having their long file name, I said down here, okay, let's define that as a variable and let's just use the file name here. What we also see is that this table name that was here is actually the name of the table how it actually is returned by JMP.
In this case, we would potentially not just call it something like that, but rather this case had that information. And then more or less, we also see that JMP actually tells us how it actually passed that PDF table. In this case, it says it was page one, and it says I actually looked for data in this rectangle. Everything else was done automatically.
If we execute this statement now, we actually see it gets us exactly the same data as previously, and that is it. So far, so good. That is just, if you like, all until now about the reading of a PDF file. However, as I said, we actually also wanted to look at the actual sample data, not only the header data, but also the sample data.
Let's now do that once more. Let me enlarge that again a little bit so that we can look at that. A gain, you could say, Okay, in this case, let's ignore the data. Let's again focus only on one specific part, in this case, the sample data only here on page one. Where does the sample data start? Well, it starts here with the LIMS Proben number. It goes down exactly until here and also out until the scales column if you look.
We can read that now in assays. What we would now see directly is both at looking over here but also looking over here, that JMP actually utilizes two lines as a header, so two rows. That is not really what we desire because only the first line is really the header. Everything else actually is content. If you right- click on this red triangle, you could actually adjust that and say, Oh, I don't want to use, in this case, two rows as a header, but only just one.
Now, once you change that, you see, okay, we start with the end date as the first actual value here. That's perfectly fine. The other part that we might actually spot is that this first, if you like, column actually contains two extra columns. Here, the one that actually holds the sample number, and here, the start date. The reason for that is that actually, many of those values are actually too long to be broken into two columns.
We can now tell JMP, please enforce that it is broken into two columns by right- clicking into more or less the right vertical position and then telling it, Please add here column divider, and would now directly see that yes JMP splits that. More or less we now get, unfortunately here, a little bit of a mess for always this, let's say, first column where actually SOP word here is split as an S and OP, but therefore we have a start column.
Here , I would say, let's appreciate as it is, obviously, keep in mind that we split this field always, which is a little bit unfortunate, but it is good as it is for now.
Again, if you capture that content, you would get a JMP data table, and for that, you could again source or use the source script to actually look at the code. If you compare this code to the code I've captured now here previously, you would see it is pretty exactly the same, potentially up to this field where we actually set the header or the column divider. That might be shifted a little bit only, but the remainder is exactly the same.
We could really read here how that actually works. You see that you have one header row, you see that it's page one. You again have defined a rectangle for where you want to read, and here you have also defined column borders as we more or less want to appreciate.
Again, as previously, you could actually say, Let's source out this name, and let's also source out this table name or replace them. And that is more or less what we call now our content file. If I close that and we just run once more this code, you would actually see. That creates our JMP data table as we want to. Getting more or less the first shot at your data seems perfectly fine is not way too complicated, I would argue.
Now, how do we go from here? We have now the data in principle, but obviously, we need to organize that a little bit. For this, we can actually take a number of features, and it depends a little bit as to what we want to do. There is things where we can actually use the lock, which actually records more or less all your actions in JMP on the graphical user interface. From there, you can actually really script code. That is something we'll see just as an instance here.
In addition, you could also use the scripting index, which I highly recommend, which really holds quite a number of functions and examples. And so really helps you to actually also use them. We can use the formula editor, I believe, and we can also use the copy table script, for example, to really get things going.
Now, let's demonstrate that again at our JMP data table. In this data table, we'll actually see that we have a number of things in here. For example, we want to now actually get that organized in meaningful form. First of all, let's define how that format should look like. Let's open a new JMP data table, which will be, if you like, our target. Into this data table, we want to write, and let's define what should be it.
We could, for example, say the first thing we want to do is that we have here the assay, for example. We then potentially would also want to have an assay or just assay code, it depends on what you want to call it. We might want to have here the sample name because obviously, that is now this field that should be captured as well because that is highly relevant.
You might also want to include a start date or an end date, and so on, and so forth until you actually have more or less included all of those fields as you want. Now, I would at that stage also say they should actually be just by now because this data over here is also correct. So if you want like attribute, if you like, we should also do so here and standardize those attributes by selecting actually data type and say, yeah, that should be correct at that stage.
Now we have that data table, but obviously, this doesn't help us so much because that is not reproducible by now. However, there is the option to really record that. For example, you could say, copy the table script without data, and I'll do so for a second, and I would now insert that script here as well. If we look at that, we'll see that we actually created a new data table which has the name Untitled 4, and obviously, we can change that.
It has so far zero rows and it has all the different columns that we just created from assay to start. We could give it a name and I've actually created here a data table that has just the name data for page one that holds those first four attributes, as well as all the others that we actually want to have. Let's actually leverage that and continue with this one as this one was really just a demonstration.
Let's create that one. Let's run it, and you'll actually see that's just a data table as it should be with all the fields that we want to fill from now on. What we also want to do for now is we want to recall this data table, which is just called something like that, and we call it to actually call that that content, in that case, say and we actually want to abbreviate this LIMS Proben minus number to LIMS Probe for simplicity.
Now, what do we actually want to do? We actually want to work with the data a little bit, and I want to illustrate two examples how we could do so. Let's look first at this column unfold. Within this one, you see that there is actually the A G and also the WG, and we might actually want to split that into two separate columns to really make sure that in one column later on, we can more or less capture the AG values and in another one, the WG values, and that not the sample information as here is split really across three rows, but rather following what I would call a date target or a fair data format in one room.
How could we do so? Let's, in this case, just insert the column and let's call this column AG, say requirement just to more or less translate the word unfold into English. Now, what would we want to see? We would actually say if there is an AG in here, then let's capture the value after the AG in this column. If there's nothing there, then let's capture nothing. And if there's WG, then let's also not capture anything because that does relate to age.
How could we do so? Well, I would say let's build a formula. Formula typically is really the best place to start. What do we want to do? As I said, we want to do something conditional, which means if there's an AG in there, we want to see something in there. If there's no AG in there, then let's not do so. The easiest way to do so, I would say, is the if condition, which really tells you if there's something, then do something, and if there's nothing in it, then do something else.
We would say here if contains and contains really looks for a substring if you like. We would actually look now for this column which is called Anforderung, and we would look for the word AG, and we say that should happen something, and if not, then something else should happen. Now, we've actually just created a very simple if statement. And more or less those two, we would still have to specify.
However, even at that stage, we could actually look like if that what we described makes sense. We would see whenever there's an AG like here or here in our column, Anforderung, then we would see a then statement, which is good. Otherwise, we would see here just the else statement, which is also good. So let's modify that a little bit.
What would we want to see in the then statement? Ideally, I would say we want to see more or less what is called in the Anforderung filter or the Anforderung column, but really getting rid of this AG part and just keeping it in mind. To do so, you have many options. One of them, I would say, is so- called Regex or Regular Expression, which really says, take what is in this column, look for this, in this case, AG part, replace this by nothing, and then actually give me back the remaining.
You would see if we do so, then we would actually looking at more or less the whole expression, we'd see if there is AG with a minus, we will actually get a minus as a return. If there is a smaller equal to 50 minutes, we'll get the smaller equal to 50 minutes. That sounds good. The else statement assay , we would actually just say, let's make there an empty statement, so nothing else should be returned. And that actually really would work.
You see, if we go to this column, only whatever you have this AG, it will return the value after the AG. That looks perfect. Now, I would actually use more or less this idea or this logic to actually include it in my script. We could also again capture the code from the data table and we would see it down to formula. But in principle, we could also capture.
Before we do so, I have inserted here a little bit of additional information, which means in case we would actually read the last page, we saw that there was the legend. And in this case we said, let's remove the legend and it should be good. In addition, I also said if there should be any completely empty rows, I would want to remove them.
Now to continue, I would actually say, let's look now for where are the samples, and then let's capture actually the data of each sample. In this case, we would look into where our samples and we would see, let me very quickly execute this part, would actually execute and would see, okay, that is actually a start where each sample starts.
It looks actually, in this case, only for where more or less this value of end is missing. Similarly, where the Anforderung is missing because those are the two columns that define where actually only the sample resides if we have to move up the column.
Now, iterating across each sample on its own, we would actually look at where is the data. Taking, for example, this Losezeit sample here as the second sample, we'd look at, okay, the assay, or we first look at where does it start. It would start in this case at row 4. Would actually now combine the data of those two fields to get, again, a full name.
We would look actually at where does the assay sit. The assay is, if you like, just in verbal names, it would be actually the first part of this whole string, if you like, just before the forward slash. You could really just capture that, potentially also removing the one because that doesn't make sense. Similarly, you could look into the code, which would be really the second part here, which you could get from there, and so on, and so forth.
Now, obviously, I agree, this part of code doesn't look way too simple, but if you read it very carefully, it actually always has more or less the same structure. You look at the part of the code that is in the respective line at the respective field and potentially to do a little bit of twisting just as we did with the AG column. If you look at this AG column, you'll actually see there's again our regular expression, there is the AG part that we replace by nothing, and that's more or less it as we do.
If you have done so now, you would actually want to create here one additional line where you can actually now enter all the data that we have captured. How would we do that? We would actually say, right- click onto that here, sorry, right- click onto, left- click onto the row menu and say Add Rows and enter there.
Now, interesting enough, at that stage, you could really look also into the Log statement and see there, there's one statement that says Add Rows and you could just copy this part about add rows. This is really more or less the same as I did here. You see there's also, in addition, this At end. Typically, that's the default value so it doesn't matter if I have it or not, but that's it.
From there on, I could really say if I have included that, I actually just copy all those values that I had previously here, everything that starts with a C into the respective column. Sorry, into the respective column. In principle, it should, if I know correctly, execute that at once. It should now actually work as is. So we see actually the second row now was the one that was correctly added, or if I delete them for a second. Again, that should now execute as is.
We could really do so line by line by line, and we'll see if we do that across all the samples, which should be very good. Now, let's return at that stage a little bit into the presentation and look how we continue from there. Now, we have actually at that stage really captured all the sample information, but we want to make it a little bit more handy for like. So far it's a little bit of massive code, but we can certainly break it down a little bit better.
That is what we would do now. We'd really say, let's make out functions from it. And functions have really the nice feature of they tell you what to actually have here as an input and what to have it as an output. That really means you have that standardization of inputs and outputs anyways. In my eyes, it's also way easier to debug and to maintain. You have no need for any copy- pasting operation. In my eyes, it also really enforces a good documentation of code.
Let's do so. What could we do now? As we have seen previously, when we actually read our data, we use this open statement and just said that's it. However, here we could now also say, let's define a function which just has a file name, then we read the data and we return the data. In principle, it's not, let's say, too different from what we did. Just that we actually say it's a function which takes one argument, in this case, the file name, could be also multiples, and which returns something. If we actually execute that, we'll see, oh yeah, that actually created exactly that data table that we initially brought in.
Similarly, you could also do so and say, oh, we just transformed the data by creating a new data structure and then by actually changing the data or let's say organizing it as we want it. If we also more or less initialize that data, we would see, yeah, also that should work as is. So we'll see here. This more or less now concepts exactly to what we did previously. So it really means you have just, if you like, only two functions which you can call, which I believe is a really good way of organizing your code.
Now, let's more or less think also about the last part. And the last part in my eyes is really a little bit around UX or user, let's say, experience. That means a little bit around how should I present it to the user? What I believe is that you can certainly play around with which data tables are visible at which stage. And you see here a really short snippet around that you could create a data table from the beginning as invisible, or you could just more or less hide it after being created at the initial stage.
Or you could actually say, if I actually store data, I could provide users a link to the directory directly, which means they don't have to actually look for that file, but really can just click on the link and see now the directory opens. Or you could actually inform the user about the progress of your execution, and so he or she knows, Oh, I'm still at File 1, but already at page 8 out of 12.
There's a number of options that you could do, but obviously, as I mentioned, I would take that only once I've really implemented the whole code. Now, more or less what we can state at that stage is that, yes, we have now more or less all the code in place to really run, let's say, this collection of data from our data table or from our PDF files into a data table.
However, there is one issue and that is more or less the issue of really bringing it to the user. The point being, more or less I have one big JMP file, potentially it has quite a lot, let's say, offline, and the user in principle has to, at least up to some degree, interact with that. T hat is something I typically would want to avoid because that is not really, let's say, something users want to do, and I would also be a little bit scared that they might break the code.
Instead, I would turn to JMP Add- in, which has the nice feature of being only one file and it just requires a one- click installation. The other part is it's easily integrated into the JMP graphical user interface. You don't have to interact with the script. You have a lot of information at your fingertips, and there's actually a lot of information of how you can do more or less create an add- in.
There is, for example, the add- in manager, I've added here the link, but there's also the option to actually do so on a manual or script- based way. I believe while it takes a little bit higher effort, it's actually much better in terms of the understanding. I want to show you very quickly how that works.
For that, I've actually created in my folder where I've stored all the data so far, so all the JMP codes so far, I've actually created once the functional code, which actually holds all the code that we've created just in a slightly more organized form if you like. You might actually really recognize, again, this read sample data page or this t ransform sample data. Plus I've added here an additional file which really just holds an example of additional code.
You could imagine that potentially you want to outsource the functions from the functional code to the custom function, say, for example, to really make the code better readable, or so on and so forth. Now, you could actually say from those two, I want to create a JMP add- in. Simply by saying, okay, I go to File, sorry, I go to File and New.
There, you have the option to create the add- in. You would now actually have to specify a name and a ID. I've now just thought about it previously and so will not really care too much about what they are called. But please really look at more or less the suggestions for JMP add-ins. You would look into, oh, which menu items do I have? And so you would add a command, you would give it a name, let's say in this case, launch PDF creator, and you would have to specify if either you want to add here the JS code or if you actually have it in the file.
In this case, I would say, let's use it in the file as we did it. It should actually be in here and you would include that one. Similarly, you could actually see that there are a number of additional options like startup or exit scripts. At the end, you have to include any additional file you want to have of it. In this case, let's just assume it would be our custom function code. In the end, you can more or less save that as, say, our example PDF data browser add- in.
Once that is actually stored, you can simply install that by actually double- clicking on that install, and you would see that you have under add- in now a launch PDF reader, which in this case would really just read this one specific PDF. So it's still quite fixed. There's quite a lot of, let's say, information which we could make more dynamic , for example, the file selection as I mentioned at the beginning. But that's more or less at least one way how you could read the data.
Now, let's return here very quickly to a little bit of what we could do in addition. We could have really a short look also into JMP add- in. I would say that a JMP add-i n, and that is very nice about it, actually contains really more or less every single one tool. Let's look at our example PDF data path and we'll see where it was installed.
In addition, if you look into that, you will actually see it holds all the JSR code that we have, plus two additional files which define actually what that add- in is named and what its ID is, plus more or less the graphical or the integration into the graphical interface. If you read that a little bit careful, those two statements in here, you will actually see how you can easily adapt them to your purposes if needed.
The last part I actually want to show here is actually what you could also do if you had it fully functional. This is more or less what I want to show you now at that stage. We'll install what I would call the final add- in. A little bit of the add- in, having also in addition, let's say, a little bit of the user- friendly tools. You could see I have to edit that now under here, this GDC menu.
I would have a little bit of buttons to click a few more than potentially previously. You could actually say, Oh, what do I want to actually read? In this case, I would want to read those seven files. As mentioned, they are all copies of each other just to have examples here. We would see that there in principle should be also a progress window here which waits now for demo purposes after each file for two seconds, reads each file, we see also that the speed of the reading is actually quite impressive, I believe.
At the end, you see there's data being progressed in the background. The user sees that also in principle but doesn't see it in the foreground. The user is not really annoyed in the foreground, but only once the data are processed, we'll get here a final result and we'll actually see that this is the whole data table. It holds data from, let's say, the first file until more or less the last file, so on file number is called six, and that would be more or less the way.
Now, as I mentioned, that is until now, I believe, also quite a lot to do. So we could still ask, what is next? Is there any next step? I would argue, yes, there is. The first one in my eyes is really celebrate. Getting until this stage is really not a triple task and it is really a true achievement. Really be happy about it, really concrete yourself that is really an achievement.
The second part is, in principle, you might want to do a little bit more around it. You might want to think about code versioning. How do you actually work with going back a version or going ahead a version? If you have developed that or looking into feature which doesn't work anymore, but stuff like that. Code versioning, I believe, is quite helpful.
Similarly, if you think about collaborative development, Git might be an answer there. If you think about unit testing, so how to really ensure that even though you have once tested your code and you have now changed it a little bit, it still works, then unit testing might be the answer. If you want to deploy more or less add-ins to a larger user base, you still have to think a little bit around how that works. There is so far, I believe, no really good solution on the market.
The other part is, obviously, I would love to hear feedback and any questions. You can reach me under this email address and I'm happy to hear more or less any suggestions, criticism, whatever it is, please feel free to reach out and I hope you could learn a bit today. I'm really happy to share with you the script, the code, the presentation, everything that I showed you in the last 30-ish minutes. Thank you very much and have a wonderful afternoon.