Using Point-and-Click and Basic Code-Cracking Skills to Import Tricky Data
During my time working as a scientist in R&D, there were times I wanted to augment my analyses with data that was not always in a straightforward file type that could be imported efficiently into JMP. For example, think of multiple PDF pages with chemical composition information – each with slightly different formatting. Or think about multiple web pages with data for different regions – with a different URL for each page. Another example would be including experimental pictures in a data table – and matching each of the hundreds of pictures with the corresponding file names.
Although JMP supports bringing in data from PDF formats and websites, as well as allowing you to bring in pictures into a data table, it is not always obvious how to do this efficiently when you have lots of pages or images to import. If you are not a proficient scripter (or don't know someone who is) you may end up doing each import manually…or simply forgoing the information and moving on.
The good news is that even without being able to write a script from scratch, you can use basic JSL decoding skills to bring in data from these kinds of sources efficiently. In this paper, we show a few examples of how we used JMP’s point-and-click tools, along with basic JSL deciphering skills, to automate the import of images, data from PDFs and web pages with scale.
Hi, everyone. Welcome to this presentation on using points and click and basic code-cracking skills to import tricky data. I'm joined today by my colleague, Wendy Tseng. We are both system engineers at JMP. My name is Monique Roerdink Lander.
We're going to talk about importing tricky data, and this came about from me running into a number of challenges when I was in R&D. We're going to discuss three examples today, and they all have to do with importing multiples, importing multiple images, multiple data tables from websites or multiple PDF pages.
These are all non-straightforward data sources, and they might be meant to augment or other types of data, like experimental data. When I was running into these challenges, I was initially not proficient in scripting. I did learn it to deal with these challenges, but not everyone has the time or interest to learn JSL.
That means it may be difficult to import this type of data efficiently, and you might just forego the whole thing because you can't spend time on it. It doesn't have to be difficult, especially with the introduction of JMP's Workflow Builder that was introduced in JMP 17 and got more important extended features added in JMP 18.
Today, we're going to show a just to automate importing this type of data. We're going to use Workflow Builder, and we're going to not use code, JSL straight up. We're going to do some JSL deciphering or code cracking, as we call it. We are going to assume you have a basic understanding of Workflow Builder, and if you don't, it'll just be a deep dive straight into it. We have three examples today. In the first example, my colleague Wendy will tackle Automating Importing Pictures into a data table.
This is something I have to do. I would have to do a lot. I worked in detergent formulation and would run wear-washing tests on glasses, and I would end up with hundreds of pictures of glasses. I wanted to have these pictures into my data table to link them to the data from image analysis.
For me, that was at the time quite tricky. Wendy is going to show us how to do that efficiently. After that example, in the second example, I will talk us through making the import and necessary cleanup steps from PDF files more efficiently. Then finally, Wendy will talk us through importing data from website. With that, let me turn it over to Wendy.
Thanks, Monique. In this first example, I'm going to be showing you how you can bring in images to increase your data understanding and communication. Let's take a look at what we're talking about. Instead of glasses, we are going to be looking at pictures of dogs.
This data set that you see in front of you contains pictures of dogs in this column. I have 28 rows here, 28 pictures of dogs. These pictures have been joined in with a publicly available data set by the American Kennel Club, where we have different information about dogs, including things like, grooming frequency, shedding.
Then the one we're going to look at is their demeanor. Here's a graph showing the dog's graft, popularity versus demeanor. If I hover over a particular point, you get a picture of that dog. Here's the Affenpinscher. This is a fun example, but this really brings to life what we're seeing in the data. How did I bring this data together? Well, as I mentioned, there's a publicly available data set, American Kennel Club, and I captured these pictures in a file folder and I joined them together.
We're going to take a look at this and let me first zoom out to give you a preview of what you're going to see in this first example. I'm going to start by introducing multiple file imports as a way to bring in pictures in a folder easily. Then you're going to see Workflow Builder in action to take those images and join them with an AKC data set and create a picture. We're going to see some evolutions of this example going from simple to complex.
You're going to see in the last iteration using some Custom JSL to create a user prompt in Workflow Builder. Let's first start by taking a look at using import multiple files to bring pictures in. I have a folder here that's called AKC that contains dog pictures. It also contains other files in it, but mostly dog pictures, and you can see the dog pictures have been named using the breed name. That's important because we're going to use that to join to the AKC data set, which has breed name as well. Before I begin, I'll go ahead and initiate a new workflow and hit record.
Now let's go to import multiple files and navigate to the folder where those files exist. Here's the AKC folder, and we want to select files that are only pictures. Then we'll add a file name column because we want the breed name to come in to the data table. Now you can see 28 pictures were brought in all in one step. Then I have the file name column that I can use to edit. I need to get rid of the .JPG so that I can join it to the breed name in the AKC folder.
I'll go ahead and stop the workflow. What I want to just show is that JMP is capturing these steps in the form of JSL in this workflow for me. Let's go ahead and open up the completed workflow. Here you can see the completed workflow, and I'm going to play this workflow for you. It's something that I recorded before where I brought in the data, created those steps, but one thing I want to comment on is when you use import multiple files, JMP will name the data table that comes in with the first file name, underscore the last file name.
My AKC folder, I had my first dog as the Affenpinscher and the last dog as Otterhound. I'll go ahead and play this workflow, and here's the result. It's the same plot that you saw before, with the exception that I've replaced the points with pictures of the dogs instead of the points.
Now, let's take a look at these categories of demeanor. I have four categories of demeanor, starting with alert, responsive, going to outgoing. In the full AKC data set where I have 277 rows. There are more categories than what you saw on that graph because I only had 28 pictures. Intentionally, I left out dogs that are aloof and wary. This is so I can show you how reusable this workflow is.
We are going to add four dogs to our AKC folder that are aloof, and then we'll rerun this workflow so that you can see how adaptable the workflow is. Again, this workflow was built with 28 dogs. Let me bring up the folder here. We're the last dog was that Otterhound. We're going to bring in on the right-hand side these four dogs that are aloof, and we'll drop them into the AKC folder, not the AKC2 folder.
Here's our AKC folder. I'll paste in the new dogs. Now we've added the aloof dogs to the AKC folder, and we will replay the workflow. Now you can see that we have the aloof dogs added to the graph. That's how flexible Workflow Builder is. We recorded a workflow that brings in pictures, joins it with the AKC data set, does some data cleaning, and creates a graph. Then, when we add more files to that folder, we can simply hit run, and everything will update.
Let's dive into one of these workflow steps. The next evolution of complexity, a little bit of code cracking, not too much. We can look at this import multiple files step and take a look at the JSL that's been captured. JMP is speaking to a specific folder path where my dog pictures reside.
If I have another folder that I want to reference, I can simply modify this file path. I happen to have another folder that's called AKC2 that has even more dog pictures. All I'm doing is coming in here making this small change, and then we can replay the workflow, and this time, you'll see 53 dogs instead of 32 dogs, and this resulting graph will be much more populated.
That's a pretty simple way to do some code tracking is to look at the step settings and make some small tweaks to the JSL code. Now in this final iteration, we're going to make one more change. If this is a workflow that I want to give to others so that they can run, it would make sense for me to modify this first step so that the user is prompted to set their own file folder from their own directory.
We'll come into this step again, and we'll scroll down to the bottom. Instead of having the Import Data step, we can replace this with Create Window. The only other change I would recommend here is also replacing this folder path with something generic like desktop or here I'll put documents. Let's go ahead and just play this first step.
Now you can see the folder has been defaulted to documents. All of my selections have been maintained, like the add file name column, and the user is being prompted to navigate to wherever their pictures reside. Just to wrap this example up, we looked at using import multiple files to bring in images into a single table, and then we looked at using Workflow Builder to create an adaptable workflow. At this point, I will pass it over to Monique for the second example.
Sure. Thank you, Wendy. Let me share my screen. There we are. I'm going to hide my meeting control. There we are. Good. We can all agree they were much cuter than pictures of glasses. That was a really nice example. The second example, we're going to look at data from PDF.
The why. When I was in research, I would sometimes dive into a topic and I would find this great information in some PDF file. This could be chemical reference tables, maybe compositional data. It could be surfactant information from vendors, from our raw materials. They could be analytical reports.
Often there was a lot of data, and I did not want to bring that data in manually just by copying it into a data table. How do we get that data into JMP? What I will show today is, first of all, we will start with the point-and-click PDF Import Wizard, which is very similar to the Excel Import Wizard and other import Wizards.
A great place to start. We will use that in conjunction with Workflow Builder, and then we will do similar to what Wendy was doing, manipulate the source code here and there to help us automate that import and make some parts of that process a little bit easier for us.
Then in JMP 18, a nice feature came out in Workflow Builder that really helps solve an issue with the cleanup. After we import many PDF pages, we have to clean those up, and it might be a lot of tables we need to clean up. Workflow Builder will now let us look through those. I will show that as well. Let me close this off here. To start, I will first introduce the PDF that we're going to work with. Let me open that up. Here it is.
We're going to look at this example, which is a two-page PDF file, and it's listing the fatty acid composition of common fats. On page one, it's listing composition for animal-origin fats, and on page two, it's listing that for a plant-derived fat and oils.
It just looks like a pretty neatly formatted table, but when we look at the first page in more detail, we see that although it looks nice for a PDF, for a JMP data table, there are some issues with it. That's just indicating that there are lots of things needed that we need to do after we import this data to get it in a usable format.
For example, we don't have column headers, really. This first column is really category. Then we have a column with percentages. Another column with percentages here. We have two columns here that lists fatty acids. One lists saturated or the other unsaturated.
Then we have this other column here as well. Another issue is we have subheaders here. For example, this beef fat is a subheader. It has subtotals here. That's something that we might want to take out, but we also need to take this subheader and fill that in a white space, obviously.
This is another issue. As I said, some of these subheaders are spread over two rows, and that will obviously create issues in JMP as well, because JMP will not recognize that these belong together.
Then we also have things like this data that spans multiple columns. Lots of issues that we will have to clean up after we bring this data in. Let's come back to our journal here, and let's first look at the end result. What we really want to do is we want to use JMP to bring in both pages, preferably just doing the work on one of these pages, and then do all the cleanup steps.
Again, let's preferably just on one page and then make JMP apply to both pages and then bring that data together, so we can use. I recorded that in a workflow. This is the end result, and we're going to recreate that afterwards. Let me play this workflow first.
I have two pages here that are opening up, and the workflow builder is walking through all these cleanup steps. There are a lot of those steps. It's doing them first on. The first data table, and now it's resetting and doing that on the second data table. Then, once it's done, it will put both of these data tables together and have some output from the combined data.
There we go. In this case, we're looking at saturated versus on just saturated fat. All right, let me close these outputs by rewinding or resetting my workflow. I'm actually going to close this workflow entirely, and I'm going to open up an empty workflow. I am going to dock that here to the right in my project that I'm working from today.
Okay, so let's start from the beginning and open the PDF. I will say this. I was tempted to show an example where the PDF import works perfectly, but I guess to leave that out, you just have to believe me that sometimes you have a PDF and you open that with JMP, and JMP It will just recognize the data table that you want perfectly, and it's just a matter of two clicks, and you have it in a data table. That's the perfect example.
I'm going to go to a non-perfect example. Let me open this PDF. I can double-click on my link here, and that opens up the PDF Import Wizard. You can see right away here in my preview on the left that JMP does not recognize this table as it's supposed as it used to be. We recognize part of these here, but that's not something we can use.
The first thing we need to do is select this, Ignore All Tables, and that will reset everything. Now that we reset everything, I can just use my mouse and just select this table here.
When I do that, JMP makes a guess at where my columns are. I'm going to zoom in a little bit here. That lets you see that it did not see all of the columns here. That's because of some of those issues where we have headers that span over multiple columns.
That's not a problem because I can add those manually by right-clicking. For example, I can come to the right here of my weight %, and I right-click, and I can add a column divider here. Then I have to do the same thing over here where we had that second weight % column, and I'm going to add a column divider here divider and a column divider over here.
In this case, this is as good as it gets for this table. Just to point out, under the red triangle, there are some options. For example, you can set the number of rows that you need to use as a header, which is a very useful tool as well.
In this case, I'm just going to import this first table. Now, I could obviously make the same selection here on page 2, and that wouldn't be too much work because this is just a two-page PDF file. What if I had a table that was 10 pages or 20? That would be a lot of clicking around and adding all these column dividers, and it might be a bit tedious to I'm going to do that.
I'm going to just bring in that first page, and we're going to see what we can do from there. I'm going to click okay. I forgot to start my workflow recorder, so John prompts me. I'm going to say, yes, please record. Here we have that table in and we also have this step recorded. I'm going to stop my recording. I'm going to double-click on my recording step and bring that over a little bit, so we can look at the JSL.
Here's the code for that first step of importing. We can see what's happening. This is the location where we're having this file located. Then there's a PDF table function, there's another table function, and then there's something called table name. Indeed, this is the name of my data table. I see something about rows, heteros, and then I see page 1. Then I see rectangle. That must have been that rectangle that I was drawing, and I see column border. Those must be the coordinates of where I wanted to have all those dividers.
When I see that, I am I'm really tempted when I have a second page to just play around with that. Let's do that. We wanted to bring in that second page. Let's just take this one and we place that for the two. Let's do that also here in the table name. We're going to change that one to a two.
Now I want to run this code, and I can do that directly from here by just right-clicking and say run script. Let's do that. Now a second table opened up. I'm going to make this a little bit smaller I'll take it again and look at that second table that opened up.
Right away, I can see that was successful because in this data table, I have the plant-derived oils and fats. This was my first page table with the animal fat. That was a very simple way of just modifying my script that workflow builder captured for me to import a second page. Copy pasting, let me bring this up again, copy pasting or changing this code here might be quicker than me manually clicking and adding all these column dividers. That could be something that could save time.
What I want to do next is anticipating the second part, the cleanup. First I'm going to do it, and then I'm going to show you why I want to do that. What I want to do is I want to have one step that opens up both of these data tables. What I need to know is I need to know when I import two tables with a PDF Import Wizard, how JMP will record that. If I don't know, the simplest way to do that is just start recording again. First, I'm going to reset my outputs here, close this one, just to keep this.
Now I'm just going to start recording again and click on a PDF. Now I'm going to ignore all tables. I'm not so much interested in how exactly these tables should look like. I just want to see what happens if I have two tables each on a page. Click Okay. I'm going to look at that step code for this.
This is that step code for that. For the first, I think it entered this step here. When I imported the two tables, it looks like JMP repeats the table function. I have one table here, then I have a comma, and then I have another table. That seems to be the construction for importing two tables.
Now that I know that, I can delete this step. I don't need that. I can come back to this step. Now what I want to do, I'll make this a little bit bigger still, I'm going to put my cursor to the left of this parentheses of my first table because that will highlight my closing parentheses. There it is. Now I know I need to copy all of this I'm going to simply do Control + C. Then before this, bring this down, and I'm going to Control + Pace that in, and I'm going to add that comma, just like we found out.
Now I can change this first I'm going to change that back to one. I'm going to change the title back to one and the page back to one, and then the second will have page 2 and page 2. All right, let's close all of that and see if this works.
When I play this step, I indeed get those two data tables. That's how I have that now in one step. Why did I do all of this? I will show you. I'm going to close off my outputs. I'm going to close this workflow, and I'm going to open a workflow. Let me also dock this again to the right.
This workflow has this step. Let me double-click on it where I'm importing two tables. Here's my table 1, and here's my table 2 for page 2. Then I recorded cleanup steps, and I did that all on table 1. When I highlight the first step, you can see this is on page 1. I grouped all my cleanup steps, and you can see there are a lot. We will not go into detail in this presentation about that, but we'll have a resource afterwards to look that up.
What is important is now that we did all these cleanup steps on one of those data tables, how do we do that on the second? In JMP 18, that became really easy. The only thing we have to do is we have to right-click and say group width for each. Let me back up. I have all my cleanup steps that I want to repeat selected, and then I'm going to right-click and say group width for each.
Now JMP grouped this or bucket these steps for each item. It says no source were selected, though. I have to do something about that. I have to tell JMP what to loop these steps on. For that, under Step settings, I'm going to look at the source. It's really easy. The only thing I have to tell JMP is I have to point it to the workflow step that generates my multiple outputs.
In this case, I will select Import to PDF pages step. That is the step where I have two tables created. Now JMP will know that it needs to do all of these steps on both of the outputs generated here. Let's see if that works.
I'm going to play this. We have two outputs opened. It's running. No, it's not running. It's walking through the steps on the first table. Then at some point when it's done, it should switch to the second. There it goes, and run it the second table. I think that's a really nice feature, especially if you have more than two tables.
I have a PDF that I imported recently that was 13 pages. After I imported the 13 pages, I did the cleanup on one of the tables, and then I just looked through all the steps for all 13. It was very nice. Okay, that is what I wanted to show.
We're going to head into the last example. This is an example, Import data from website. This is straight up from what I did when I was in R&D. I needed to get data on water hardness. Water hardness is an important consideration when formulating. I found some great information on a website. I very painstakingly formulated that. Wendy is going to talk us through how to do that much more elegantly. I'm going to stop sharing and turn it over to you, Wendy.
All right, sounds good. Okay, so let's take a look at example 3, and we're going to be taking a closer look at the For Each, or another way to describe For Each is simply looping. Let's go take a look at the website where Monique found this water hardness data. I'm at this HydroFlow USA, and if we come to this interactive maps area, I can navigate to a map of the US.
HydroFlow USA provides water hardness data at the city level. I'll navigate to one of the states, Minnesota, where both of us reside, and I'll click into this link to show you for this page that contains information for Minnesota, there's a table with cities, corresponding zip codes, and then water hardness in two different units.
Taking a look at this URL, the other URLs for the other states follow the same structure. It's the state name followed by the water-hardness. Monique's problem was being able to access this data on the web so that she could augment her our analysis.
Using JMP, I was able to bring in all of the water-hardness data from all of the states into a data table like this. You can see there are 855 rows because each state contains multiple rows corresponding to the cities. Being able to have this data allowed Monique to be able to augment her analysis. If I hit play on this, you can see, very simply, it allows you to recreate water hardness at the state level.
In this example 3, I'm going to show you a few tools. One of them is using JMPs Internet Open to access data from a web page. Then we are going to take that Internet open and loop. We're going to wrap that in a loop so that we can bring the data from all the states together into a table.
Let's first start by taking a look at Internet Open. I'm going to start by actually grabbing that website URL. I'm just doing a Control + C on this one web page that contains Minnesota Water Hardness, and I'll start a workflow. I'll hit record, and now we will go to File, Internet Open Web page. I'll paste the URL in and click okay.
JMP is detecting a table on that page. I'll go ahead and select OK. This should give you a taste of what's possible in terms of bringing in data that resides on a single page.
Lets I go to the workflow and stop the recording. When I take a look at what was recorded, JMP is running this JSL, or it's recorded this JSL. It's simply an open statement with the URL name and then some more details about it being an HTML table. Thinking ahead, we would not want to have to go to each of the 51 pages separately.
Ideally, we would have this single import step that we would loop through for all of the states. Now we're going to come back to water hardness, but first we are going to take a look at an example that's written up in the JMP documentation that describes how for each works.
If you follow this link in the materials, you'll navigate directly to the example. I'm going to talk through the example in the slides, and then we'll build it together. Oh, boy. Let's see. Let me bring up my slides.
I will make sure that's fixed in the version that is uploaded. Okay, so in the JMP documentation example, there's a goal to create a set of graphs, and it was simply three distribution graphs for each gender. The data set used in the example is diabetes, where there are two genders, in the way that this was constructed was creating a subset from the diabetes data set with gender 1 and gender 2, creating those distribution plots just for gender 1, and then wrapping that step 2 in a for each so that the workflow works for both gender 1 and gender 2 without having to actually record the steps for gender 2.
Let's take a look at building this from scratch. I'll initiate a new workflow and I'll hit Record. Let's go find the diabetes dataset. Here's the diabetes dataset, and here you can see gender 1, gender 2. We'll first subset by gender. Let's do that again because we want to deselect the rows. Subset by gender. Here in the preview, you can see gender 1, gender 2. I should get two tables, and I do. Here's gender 1, gender 2.
I'll start with gender 1 and create my visualizations. We'll keep it simple, and I will create a distribution with three variables. Now I'll go to my workflow. Let's go to the correct workflow. Here you can see that I've actually recorded several steps because I had to redo this a few times.
You're going to get to see the sausage being made a little bit. We're going to delete the steps that I don't need because I resubmitted it. Let's just play this to make sure we have what we need. Open the data table, subset the data table into gender 1, gender 2, and then I should just get distribution for gender 1. There you go. Gender 1, gender 2, and then this is just the data for gender 1.
Now the for each, I'll take this report step, and I'm going to right-click and say group with for each. This is the thing that I want looped. Now if I come to this for each, the workflow step that's already been selected is the previous step where I've subsetted, so no changes here necessary. Now when I rerun the workflow, you're going to see two distributions being run, one for each gender. There you go. Here's gender 1 and here's gender 2.
How can we use this type of framework to get the data that we want for water hardness from HydroFlow. Do some cleanup here. We're going to go back to the slides, so adapting this for water hardness. Knowing that I wanted to loop through the states, conceptually, I thought, let me create a table just like the diabetes example, and then I'll subset that table into individual tables for each state that contains the URL. I'll record the internet open for one of the states, and then I will wrap that internet open in a for each so that it does the internet open for all of the states.
Let's play the workflow that I've completed. One thing that you'll notice when I play this workflow is that there will only be a handful of states because it would take too long if I included all of the states. I'm going to just run the workflow right before the data cleaning steps, so we can have a conversation, and then we can play the remainder.
The States table gets opened, and then it's subsetted into the individual states that contain the URL. Now you're seeing the internet open happen for each of those states. If you look at the workspace, I have done the internet open for these five states.
If I go back to the workflow and enable the rest of the steps, you're going to see all of this data being combined into a single states table, again, in this situation with just five states, and then a graph showing that water hardness.
Taking a look at the workflow, here is that for each step. This is where I start to describe an additional complexity that exists in this example. The Import HTML, so that Internet Open, there was a need to add some custom JSL. If I expand this, I modified this import step that was recorded from the workflow builder. Now I'm going to step out back into the slide so that we can talk about what those particular changes were.
This statement here is what's recorded by Workflow Builder when I do an internet open on the Alabama. Because I want to reference all of the states, I'm going to replace this with something generic, a variable simply called web address. Then, if we think of just the Alabama table, I need to be able to reference for the web address a particular cell in this data table, and that cell contains the URL.
Stepping through the JMP scripting language, being able to speak to a data table, you say data table, and then you have the data table name inside.
Then when we're referencing a column in that data table, you would call the data table, have a colon, and then the column name. In this case, it would be data table state equals Alabama, URL is the column name. Then finally, because we want a specific cell which contains the URL, we add these brackets. For all of the states, it's just the number 1 because all the tables just contain one row, and that's the cell that contains the URL.
Just to wrap up in this example, you got to see internet open as a way to bring in data from a web page. Then we took a deeper dive into the for each using the JMP documentation example. Then you had a chance to see how I adapted that for accessing data from the internet open with a little bit of custom JSL. Then I will pass it back to Monique to wrap up.
Yes. Thanks, Wendy. We showed you today how you can bring in all sorts of data, images, data from PDFs, data from web pages, and how to automate that without scripting. Workflow Builder is a great tool. You can take it even further by taking a look at that JSL code that it records for you and making some small modifications. Don't be afraid to look at that JSL, and to dive in. Thank you for joining us today, and we hope this has been helpful to you.