Choose Language Hide Translation Bar
View Original Published Thread

Automating the Data Curation Workflow (2020-US-45MP-620)

Jordan Hiller, JMP Senior Systems Engineer, JMP
Mia Stephens, JMP Principal Product Manager, JMP

 

For most data analysis tasks, a lot of time is spent up front — importing data and preparing it for analysis. Because we often work with data sets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation, and analysis/reporting. While the tasks in the first and third sections are relatively straightforward — point-and click to achieve the desired result and capture the resulting script — data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process, and provide advice for generating JSL code for data curation via point-and-click.

 

 

The Data Cleaning Script Assistant Add-in discussed in this talk can be found in the JMP File Exchange.

 

 

Auto-generated transcript...

 

Speaker

Transcript

mistep Welcome to JMP Discovery summit.
I'm Mia Stephens and I'm a JMP product manager and I'm here with Jordan Hiller, who is a JMP systems engineer. And today we're going to talk about automating the data curation workflow.
And we're going to split our talk into two parts. I'm going to kick us off and set the stage by talking about the analytic workflow and where data curation fits into this workflow.
And then I'm going to turn it over to Jordan for the meat, the heart of this talk. We're going to talk about the need for reproducible data curation.
We're going to see how to do this in JMP 15. And then you're going to get a sneak peek at some new functionality in JMP 16 for recording data curation steps and the actions that you take to prepare your data for analysis.
So let's think about the analytic workflow. And here's one popular workflow. And of course, it all starts with defining what your business problem is, understanding the problem that you're trying to solve.
Then you need to compile data. And of course, you can compile data from a number of different sources and pull these data in JMP. And at the end, we need to be able to share results and communicate our findings with others.
Probably the most time-consuming part of this process is preparing our data for analysis or curating our data.
So what exactly is data curation? Well, data curation is all about ensuring that our data are useful in driving analytics discoveries. Fundamentally, we want to be able to solve a problem with the day that we have.
This is largely about data organization, data structure, and cleaning up data quality issues.
If you think about problems or common problems with data, it generally falls within four buckets. We might have incorrect formatting, incomplete data, missing data, or dirty or messy data.
And to talk about these types of issues and to illustrate how we identify these issues within our data, we're going to borrow from our course, STIPS
And if you're not familiar with STIPS, STIPS is our free online course, Statistical Thinking for Industrial Problem Solving, and it's set up in seven
discrete modules. Module 2 is all about exploratory data analysis.
And because of the interactive and iterative nature of exploratory data analysis and data curation, the last lesson in this module is data preparation for analysis. And this is all about identifying quality issues within your data and steps you might take to curate your data.
So let's talk a little bit more about the common issues. Incorrect formatting. So what do we mean by incorrect formatting?
Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply your data table as a whole. So, for example, you might have your data in separate columns, but for analysis, you need your data stacked in one column.
This can apply to individual variables. You might have the wrong modeling type or data type or you might have date data, data on dates or times that's not formatted that way in JMP.
It can also be cosmeti. You might choose to remove response variables to the beginning of the data table, rename your variables, group factors together to make it easier to find them with the data table.
Incomplete data is about having a lack of data. And this can be on important variables, so you might not be capturing data on variables that can ultimately help you solve your problem
or on combinations of variables. Or it could mean that you simply don't have enough observations, you don't have enough data in your data table.
Missing data is when values for variables are not available. And this can take on a variety of different forms.
And then finally, dirty or messy data is when you have issues with observations or variables.
So your data might be incorrect. The values are simply wrong. You might have inconsistencies in terms of how people were recording data or entering data into the system.
Your data might be inaccurate, might not have a capable measurement system, there might be errors or typos.
The data might be obsolete. So you might have collected the information on a facility or machine that is no longer in service.
It might be outdated. So the process might have changed so much since you collected the data that the data are no longer useful.
The data might be censored or truncated. You might have columns that are redundant to one another. They have the same basic information content or rows that are duplicated. So dirty and messy data can take on a lot of different forms.
So how do you identify potential issues? Well, when you take a look at your data, you start to identify issues. And in fact, this process is iterative and when you start to explore your data graphically, numerically, you start to see things that might be issues that you might want to fix
or resolve. So a nice starting point is to start by just scanning the data table. When you scan your data table, you can see oftentimes some obvious issues.
And for this example, we're going to use some data from the STIPS course called Components, and the scenario is that a company manufactures small components and they're trying to improve yield.
And they've collected data on 369 batches of parts with 15 columns. So when we take a look at the data, we can see some pretty obvious issues right off the bat.
If we look at the top of the data table, we look at these nice little graphs, we can see the shapes of distributions. We can see the values.
So, for example, batch number, you see a histogram. And batch number is something you would think of being an identifier, rather than something that's continuous. So this can tell us that the data coded incorrectly.
When we look at number scrapped, we can see the shape of the distribution. We can also see that there's a negative value there, which might not be possible.
we see a histogram for process with two values, and this can tell us that we need to change the modeling type for process from continuous to nominal.
You can see more when you when you take a look at the column panel. So, for example, batch number and part number are both coded as continuous. These are probably nominal
And if you look at the data itself, you can see other issues. So, for example, humidity is something we would think of as being continuous, but you see a couple of observations that have value N/A.
And because JMP see text, the column is coded as nominal, so this is something that you might want to fix.
we can see some issues with supplier. There's a couple of missing values, some typographical errors. And notice, temperature, all of the dots indicate that we're missing values for temperature in these in these rows. So this is an issue that we might want to investigate further.
So you identify a lot of issues just by scanning the data table, and you can identify even more potential issues when you when you visualize the data one variable at a time.
A really nice starting point, and and I really like this tool, is the column viewer. The column viewer gives you numeric summaries for all of the variables that you've selected.
So for example, here I'm missing some values. And you can see for temperature that we're missing 265 of the 369 values. So this is potentially a problem if we think the temperature is an important factor.
We can also see potential issues with values that are recorded in the data table. So, for example, scrap rate and number scrap both have negative values.
And if this isn't isn't physically possible, this is something that we might want to investigate back in the system that we collected the data in.
Looking at some of the calculated statistics, we can also see other issues. So, for example, batch number and part number really should be categorical.
It doesn't make sense to have the average batch number or the average part number. So this tells you you should probably go back to the data table and change your modeling type.
Distributions tell us a lot about our data and potential issues. We can see the shapes of distributions, the centering, the spread. We can also see typos.
Customer number here, the particular problem here is that there are four or five major customers and some smaller customers. If you're going to use customer number and and analysis, you might want to use recode to group some of those smaller customers together into maybe an other category.
we have a bar chart for humidity, and this is because we have that N/A value in the column. And we might not have seen that when we scan the data table, but we can see it pretty clearly here when we look at the distribution.
We can clearly see the typographical errors for supplier.
And when we look at continuous variables, again, you can look at the shape, centering, and spread, but you can also see some unusual observations within these variables.
So, after looking at the data one variable at a time, a natural, natural progression is to explore the data two or more variables at a time.
So for example, if we look at scrap rate versus number scrap in the Graph Builder.
We see an interest in pattern. So we see these these bands and it could be that there's something in our data table that helps us to explain why we're seeing this pattern.
In fact, if we color by batch size, it makes sense to us. So where we have batches with 5000 parts, there's more of an opportunity for scrap parts than for batches of only 200.
We can also see that there's some strange observations at the bottom. In fact, these are the observations that had negative values for the number of scrap and these really stand out here in this graph.
And when you add a column switcher or data filter, you can add some additional dimensionality to these graphs. So I can look at pressure,
for example, instead of...
Well, I can look at pressure or switch to dwell. What I'm looking for here is I'm getting a sense for the general relationship between these variables and the response.
And I can see that pressure looks like it has a positive relationship with scrap rate. And if I switch to dwell, I can see there's probably not much of a relationship between
dwell and scrap rate or temperature. So these variables might not be as informative in solving the problem.
But look at speed, speed has a negative relationship. And I've also got some unusual observations at the top that I might want to investigate.
So you can learn a lot about your data just by looking at it.
And of course, there are more advanced tools for exploring outliers and missing values that are really beyond the scope of this discussion.
And as you get into the analyze phase, when you start analyzing your data or building models, you'll learn much much more about potential issues that you have to deal with.
And the key is that as you are taking a look at your data and identifying these issues, you want to make notes of these issues.
Some of them can be resolved as you're going along. So you might be able to reshape and clean your data as you proceed through the process.
But you really want to make sure that you capture the steps that you take so that you can repeat the steps later if you have to repeat the analysis or if you want to repeat the analysis on new data or other data.
And at this point is where I'm going to turn it over to to Jordan to talk about reproducible data curation and what this is all about.
Jordan Hiller Alright thanks, Mia. That was great. And we learned what you do in JMP to accomplish data curation by point and click. Let's talk now about making that reproducible.
The reason we worry about reproducibility is that your data sets get updated regularly with new data. If this was a one-time activity, we wouldn't worry too much about the point and click. But when data gets updated over and over,
it is too labor-intensive to repeat the data curation by point and click each time.
So it's more efficient to generate a script that performs all of your data curation steps, and you can execute that script with one click of a button
and do the whole thing at once. So in addition to efficiency,
it documents your process. It serves as a record of what you did. So you can refer to that later for yourself and remind yourself what you did, or for people who come after you and are responsible for this process, it's a record for them as well.
For the rest of this presentation, my goal is to show you how to generate a data curation script with point and click only.
We're hoping that you don't need to do any programming in order to get this done. That program code is going to be extracted and saved for you, and we'll talk a little bit about how that happens.
So there are two different sections. There's what you can do now in JMP 15 to obtain a data curation script, and what you'll be doing once we release JMP 16 next year.
In JMP 15 there are some data curation tasks that generate their own reusable JSL scripting code.
You just execute your point and click, and then there's a technique to grab the code. I'm going to demonstrate that. So tools like recode,
generating a new formula column with the calculation, reshaping data tables, these tools are in the tables menu. There's stack, split, join, concatenate, and update. All of these tools
in JMP 15 generate their own script after you execute them by point and click.
There are other common tasks that do not generate their own JSL script and in order to make it easier to accomplish these tasks and make them reproducible,
And it helps with the following tasks, mostly column stuff, changing the data types of columns, the modeling types, changing the display format,
renaming, reordering, and deleting columns from your data table, also setting column properties such as spec limits or value labels.
So the Data Cleaning Script Assistant is what you'll use to assist you with those tasks in JMP 15.
We are going to give you a sneak preview of JMP 16 and we're very excited about new features in the log in JMP 16, I think it's going to be called the enhanced log mode.
The basic idea is that in JMP 16 you can just point and click your way through your data curation steps as usual.
The JSL code that you need is generated and logged automatically. All you need to do is grab it and save it off. So super simple and really useful, excited to show that to you.
Here's a cheat sheet for your reference. In JMP 15 these are the the tasks on the left, common data curation tasks; it's not an exhaustive list.
And the middle column shows how you accomplish them by point and click in JMP. The method for extracting the reusable script is listed on the right. So I'm not going to cover everything in here.
But yeah, this is for you for your reference later. Let's get into a demo. And I'll show how to address some of those issues that Mia identified with the components data table. I'm going to start in JMP 15.
And the first thing that we're going to talk about are some of those column problems, changing
changing
the data types, the modeling types, that kind of thing. Now, if you were just concerned with point and click
in JMP, what you would ordinarily do is, for for let's say for humidity. This is the column you'll remember that has
some text in that and it's coming in mistakenly as a character column. So to fix that by point and click, you would ordinarily right click, get into the column info, and address those changes here.
This is one of those JMP tasks that doesn't leave behind usable script in in JMP 15. So for this, we're going to use the data cleaning script assistant instead. So here we go.
It's in the add ins menu, because I've installed it, you can install it too. Data cleaning script assistant, the tool that we need for this is Victor the cleaner. This is a
graphical user interface for making changes to columns, so we can address data types and modeling types here. We can rename columns, we can change the order of columns, and delete columns, and then save off the script. So let's make some changes here.
For humidity, that's the one with the the N/A values that caused it to come in as text. We're going to change it from a character variable to a numeric variable. And we're going to change it from nominal to continuous.
We also identified batch number needs to come...needs to get changed to to nominal; part number as well needs to get changed to nominal and the process, which is a number right now, that should also be
nominal.
fab tech. So that's not useful for me. Let's delete the facility column. I'm going to select it here by clicking on its name and click Delete.
Here are a couple of those cosmetic changes that Mia mentioned. Scrap rate is at the end of my table. I want to move it earlier. I'm going to move it to the fourth position after
customer number. So we select it and use the arrows to move it up in the order to directly after customer number.
Last change that I'm going to make is I'm going to take the pressure variable and I'm going to rename it.
My engineers in my organization called this column psi. So that's the name that I want to give that column. Alright, so that's all the changes that I want to make here.
I have some choices to make. I get to decide whether the script gets saved to the data table itself. That would make a little script section over here in the upper left panel.
Where to save it to its own window, let's save it to a script window. You can also choose whether or not the cleaning actions you specified are executed when you click ok. Let's let's keep the execution and click OK.
So now you'll see all those changes are made. Things have been rearrange, column properties have changed, etc. And
we have a script. We have a script to accomplish that. It's in its own window and this little program will be the basis. We're going to build our data curation script around it. Let's let's save this. I'm going to save it to my desktop. And I'm going to call this
v15 curation script.
changing modeling types, changing data types, renaming things, reordering things.
These all came from Victor. I'm going to document this in my code. It's a good idea to leave little comments in your code so that you can read it later. I'm going to leave a note that says this is from the Victor tool.
And let's say from DCSA, for data cleaning script assistant Victor. So that's a comment. The two slashes make a
line in your program; that's a comment. That means that the program interpreter won't try to execute that as program code. It's recognized as just a little note and you can see it in green up there.
Good idea to leave yourself little comments in your script. All right, let's move on.
The next curation task that I'm going to address is a this supplier column. Mia told us how there were some problems in here that need to be addressed.
We'll use the recode tool for this. Recode is one of the tools in JMP 15 that leaves behind its own script, just have to know where to get it. So let's do our recode and grab the script, right click
recode.
And we're going to fix these data values. I'm going to start from the red triangle. Let's start by converting all of that text to title case, that cleaned up this lower case Hersch value down here.
Let's also trim extra white space, extra space characters. That cleaned up that that leading space in this Anderson. Okay.
And so all the changes that you make in the recode tool are recorded in this list and you can cycle through and undo them and redo them and cycle through that history, if you like.
All right, I have just a few more changes to make. I'll make the manually. Let's group together the Hershes, group together the Coxes, group together all the Andersons.
Trutna and Worley are already correct. The last thing I'm going to do is address these missing values. We'll assign them to their own category of missing.
That is my recode process. I'm done with what I need to do. If I were just point and clicking, I would go ahead and click recode and I'd be done. But remember, I need to get this script. So to do that, I'm going to go to the red triangle.
Down to the script section and let's save this script to a script window.
Here it is saved to its own script window and I'm just going to paste that section to the bottom of my curation script in process.
So let's see. I'm just going to grab everything from here. I don't even really have to look at it. Right. I don't have to be a programmer, Control C, and just paste it at the bottom. And let's leave ourselves a note that this is from the recode red triangle.
Alright, and I can close this window. I no longer need it. And save these updates to my curation scripts. So that was recode and the way that you
get the
code for it.
All right, then the next task that we're going to address is calculating
a yield. Oh, I'm sorry. What I'm going to do is I'm going to actually execute that recode. Now that I've saved the script, let's execute the recode. And there it is, the recoded supplier column. Perfect.
All right, let's calculate a yield column. This is a little bit redundant, I realize we already have the scrap rate,
but for purposes of discussion, let's show you how you would calculate a new column and extract its script. This is another place in JMP 15 where you can easily get the script if you know where to look. So making our yield column.
New column, double click up here, rename it from column 16 to yield.
And let's assign it a formula.
To calculate the yield, I need to find how many good units I have in each batch, so that's going to be the batch size minus the number scrapped.
So that's the number of good units I have in every batch. I'm going to divide that by the total batch size and here is my yield column. Yes, you can see that yield here is .926.
Scrap rate is .074, 1 minus yield. So good. The calculation is correct. Now that I've created that yield column, let's grab its script. And here's the trick, right click, copy columns.
from right click, copy columns.
Paste. And there it is. Add a new column to the data table. It's called yield and here's its formula.
Now, I said, you don't need to know any programming, I guess here's a very small exception. You've probably noticed that
there are semicolons at the end of every step in JSL. That separates different JSL expressions and if you add something new to the bottom of your script, you're going to want to make sure that there's a semicolon in between. So I'm just typing a semicolon. The copy columns
function did not add the semicolon so I have to add it manually.
All right, good. So that's our yield column. The next thing I'd like to address is this.
My processes are labeled 1 and 2. That's not very friendly. I want to give them more descriptive labels. We're going to call Process Number 1, production; and Process Number 2, experimental. We'll do that with value labels.
Value labels are an example of column properties. There's an entire list of different column properties that you can add to a column.
This is things like the units of measurement. This is like if you want to change the order of display in a graph, you can use value ordering.
If you want to add control limits or spec limits or a historical sigma for your quality analysis, you can do that here as well.
Alright. So all of these are column properties that we add, metadata that we add to the columns.
And we're going to need to use the Data Cleaning Script Assistant to access the JSL script for adding these column properties. So here's how we do it. At first, we add the column properties, as usual, by point and click. I'm going to add my value labels.
Process Number 1, we're going to call production. Add.
Process Number 2, we're going to call experimental.
And by adding that value label column property,
I now get nice labels in my data table. Instead of seeing Process 1 and Process 2, I see production and experimental.
Data Cleaning Script Assistant. We will choose the property copier.
A little message has popped up saying that the column property script has been copied to the clipboard and then we'll go back to our script in process.
from the DCSA property copier.
And then paste, Control V to paste. There is the script that we need to assign those two value labels. It's done.
Very good. Okay, I have one more data curation step to go through, something else that we'll need the Data Cleaning Script Assistant for.
We want to consider only, let's say, the rows in this data table where vacuum is off. Right. So there are 313 of those rows.
And I just want to get rid of the rows in this data table where vacuum is on. So the way you do it by point and click is is selecting those, much as I did right now, and then running the table subset command.
In order to get usable code, we're going to have to use the Data Cleaning Script Assistant once again. So here's how to subset this data table to only the rows were vacuum is off. First, I'm going to use, under the row menu, under the row selection submenu, we'll use this Select Where
command in order to get some reusable script for the selection. We're going to select the rows were vacuum is off.
And before clicking okay to execute that selection, again, I will go to the red triangle, save script to the script window.
Control A. Control C to copy that and let's paste that at again
From rows.
Select Where
Control V. So there's the JSL code that selects the rows where vacuum is off. Now I need, one more time,
need to use the Data Cleaning Script Assistant to get the selected rows. Oh, let us first actually execute the selection.
There it is. Now with the row selected, we'll go up to again add ins, Data Cleaning Script Assistant, subset selected rows.
I'm being prompted to name my new data table that has the subset of the data. Let's call it a vacuum, vacuum off. That's my new data table name. Click OK, another message that the subset script has been copied to the clipboard.
And so we paste it to the bottom.
There it is.
And this is now our complete data curation script to use in JMP 15 and let's just run through what it's like to use it in practice. I'm going to close the data table that we've been working on and making corrections to doing our curation on. Let's close it and revert back to
the messy state.
Make sure I'm in the right version of JMP.
All right. Yes, here it is, the messy data. And let's say some new rows have come in because it's a production environment and new data is coming in all the time. I need to replay my data curation workflow.
run script.
It performed all of those operations. Note the value labels. Note that humidity is continuous. Note that we've subset to only the rows where vacuum is off. The entire workflow
is now reproducible with a JSL script. So that's what you need to keep in mind for JMP 15. Some tools you can extract the JSL script from directly; for others, you'll use my add in, the Data Cleaning Script Assistant.
And now we're going to show you just how much fun and how easy this is in JMP 16. I'm not going to work through the entire workflow
over again, because it would be somewhat redundant, but let's just go through some of what we went through. Here we are in JMP 16 and I'm going to open the log.
The log looks different in JMP 16 and you're going to see some of those differences presently. Let's open the the messy components data.
Here it is. And you'll notice in the log that it has a section that says I've opened the messy data table. And down here. Here is that JSL script that accomplishes
what we just did. So this is like a running log that that automatically captures all of the script that you need. It's not complete yet.
There are new features still being added to it. And I, and I assume that will be ongoing.
But already this this enhanced log feature is very, very useful and it covers most of your data curation activities.
I should also mention that, right now, what I'm showing to you is the early adopter version of JMP. It's early adopter version 5. So when we fully release
the production version of JMP 16 early next year, it's probably going to look a little bit different from what you're seeing right now.
Alright, so let's just continue and go through some of those data curation steps again. I won't go through the whole workflow, because it would be redundant. Let's just do some of them.
I'll go through some of the things we used to need Victor for. In JMP 16 we will not need the Data Cleaning Script Assistant. We just do our point and click as usual.
So, humidity, we're going to change from character to numeric and from nominal to continuous and click OK. Here's what that looks like in the structured log. It has
captured that JSL. All right, back to the data table. We are going to change the modeling type of batch number and part number and process from continuous to nominal.
That's done. That has also been captured in the log. We're going to delete the facility column, which has only one value, right click Delete columns. That's gone.
PSI.
OK, so those were all of the tool...all of the things that we did in Victor in JMP 15. Here in JMP 16, all of those are leaving behind JMP script that we can just copy and reuse down here. Beautiful.
All right. Just one more step I will show you. Let's show the subset to vacuum is off. Much, much simpler here in JMP 16. All we need to do is
select all the off vacuums; I don't even need to use the rows menu, I can just right click one of those offs and select matching cells, that selects the 313 rows where vacuum is off. And then, as usual, to perform the subset, to select to subset to only the selected rows, table subset
and we're going to create a new table called vacuum off that has only our selected rows and it's going to keep all the columns. Here we go.
That's it. We just performed all of those data curation steps. Here's what it looks like in the structured log.
And now to make this a reusable, reproducible data curation script, all that we need to do is come up to the red triangle, save the script to a script window. I'm going to save this to my data...to my desktop as a v16 curation script.
And here it is. Here's the whole script. So let's uh let's close all the data in JMP 16 and just show you what it's like to rerun that script.
Here I am back in the home window for JMP 16. Here's my curation script. You'll notice that the first line is that open command, so I don't even need to open the data table. It's going to happen in line right here.
All I need to do is, when there's new data that comes in and and this file has been updated, all that I need to do to do my data curation steps is run the script.
And there it is. All the curation steps and the subset to the to the 313 rows. So that is using the enhanced log in JMP 16 to capture all your data curation work and change it into a reproducible script.
Alright, here's that JMP 15 cheat sheet to remind you once again, these, this is what you need to know in order to extract the reusable code when you're in JMP 15 right now, and you won't have to worry about this so much once we release JMP 16
in early 2021.
So to conclude, Mia showed you how you achieve data curation in JMP. It's an exploratory and iterative process where you identify problems and fix them by point and click.
When your data gets updated regularly with new data, you need to automate that workflow in order to save time
And also to document your process and to leave yourself a trail of breadcrumbs when you when you come back later and look at what you did.
The process of automation is translating your point and click activities into a reusable JSL script.
We discussed how in JMP 15 you're going to use a combination of both built in tools and tools from the Data Cleaning Script Assistant to achieve these ends.
And we also gave you a sneak preview of JMP 16 and how you can use the enhanced log to just automatically passively capture your point and click data curation activities and leave behind a beautiful reusable
reproducible data curation script.
All right. That is our presentation, thanks very much for your time.