Automating the Data Curation Workflow (2021-EU-45MP-737)

Jordan Hiller, JMP Senior Systems Engineer, SAS
Mia Stephens, JMP Principal Product Manager, SAS

For most data analysis tasks, a lot of time is spent up front importing data and preparing it for analysis. Because we often work with datasets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation and analysis/reporting. While the tasks in the first and third sections are relatively straightforward – point-and click to achieve the desired result, and capture the resulting script – data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process and demonstrate how you can use the new JMP 16 action recording and enhanced log to create a data curation script.

Auto-generated transcript...

Speaker	Transcript
Mia Stephens	Welcome to JMP Discovery Summit. I am Mia Stephens, and I am a JMP product manager. And I'm joined by Jordan Hiller, who is a systems engineer. And today we're going to talk about automating the data curation workflow.
	And this is the abstract just for reference; I'm not going to talk about it.
	And we're going to break this talk into two parts. I'm going to kick it off and talk about the analytic workflow
	and talk about data creation, what we mean by data curation. And we're going to see how to identify potential data quality issues in your data.
	And then I'm going to turn it over to Jordan, and Jordan is going to talk about the need for reproducibility.
	He's going to share with us a cheat sheet for data curation and show how to...how to curate your data in JMP.
	the action recorder and the enhanced log.
	So let's talk about the analytic workflow.
	It all starts with having some business problem that we're trying to solve.
	And of course we need to compile data, and you can compile data from a number of different sources and bring the data in a JMP.
	And at the end, we need to be able to share results, communicate our findings with others. Now sometimes this is maybe a one-off project,
	but oftentimes we have analysis that we're going to repeat. So a core question addressed by this talk is, can you easily reproduce your results?
	Can others reproduce your results? Or, if you have new data or, if you have updated data, can you easily repeat your analysis, and particularly, the data curation steps on these new data? So this is what we're addressing in this talk.
	But what exactly is data curation?
	And why do we need to be concerned about it?
	Well, data curation is all about ensuring that our data are useful in driving analytic discoveries. Fundamentally, we need to be able to solve the problems that we're trying to address.
	And it's largely about data organization, data structure, and also data cleanup.
	If you think about issues that we might encounter with data, they tend to fall into four general categories
	you might have incorrect formatting, you might have incomplete data, missing data, or dirty or messy data.
	And to help us talk about these issues, we're going to borrow some content from STIPS. And if you're not familiar with STIPS, STIPS is our free course, Statistical Thinking for Industrial Problem Solving.
	And this is a course based on seven independent modules, and the second module is exploratory data analysis.
	And because of the iterative and interactive nature of exploratory data analysis and data curation, the last lesson in this model...in this module is called Data Preparation for Analysis, so we're borrowing heavily from this lesson throughout this talk.
	Let's break down each one of these issues. Incorrect formatting. What do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply to the data table as a whole. So, for example, you might have
	called...data stored in separate columns, but you actually need the data stored in one column.
	Or it could be that you have your data in separate data tables and you need to either concatenate or update or join the data tables together.
	It can relate to individual variables. So, for example, you might have the wrong modeling type. Or you might have dates in your data table, so you might have columns of dates and they might not be formatted as days, so the analysis might not recognize that these are...this is date data.
	Formatting can also be cosmetic. So, for example, if you're dealing with a large data table, you might have many columns...
	you might have names...column names that are not really recognizable that you might want to change.
	You might have a lot of columns that you might want to group together to make it a little bit more manageable.
	Your response column might be at the very end of the data table and you might want to move it up. So cosmetic issues won't necessarily get in the way of your analysis, but if you can address some of these issues, you can make your analysis a little bit easier.
	Incomplete data is when you have a lack of data. This can be a lack of data on important variables. So, for example, you might not have captured data on variables that are fundamental in solving the problem. It could also be a lack of data on a combination of variables.
	So, for example, you might not have enough information to estimate an interaction.
	Or you might have a target variable that you're interested in that is unbalanced, so you might be studying something like defects. It may be only 5% of your observations
	are for defects and you might not have enough data, for you know, you might only have a very small subset of your data
	where the defect is present. You might not have enough data to allow you to understand potential causes of defects. You might also not have a big enough sample size, so you just simply don't have enough data to have good estimates.
	Missing data is when you're missing values for variables, and this can take several different forms.
	If you're missing data and the data...the missingness is not at random, this can cause a serious problem, so you might have biased estimates.
	If you're missing data completely at random, this might not be a problem if you're only missing a few observations, but if you're missing a lot of data, then this can be problematic.
	Dirty and messy data is when you have issues with observations or with variables. So you might have incorrect values, values are simply wrong.
	You might have inconsistency. So, for example, you might have typos or typographical errors when people enter things differently.
	The values might be inaccurate. So, for example, you might have issues with your measurement system. There can be errors, there can be typos, the data might be obsolete.
	Obsolete data is when you have data on, for example, a facility or machine that is no longer in service.
	The data might be outdated, so you might have data going back a two or three year period, but the process might have changed somewhere in that timeframe,
	which means that those those historical data might not be relevant to the current process as it stands today. Your data might be censored, or it might be truncated.
	You can have redundant columns, which are columns that contain essentially the same information. Or you might have duplicated observations.
	So dirty or messy data can take on a lot of different forms.
	So how do you identify potential issues?
	Well, a good starting point is to explore your data and, in fact, identifying issues leads you into data exploration and then analysis. And as you start exploring your data, you start to identify things that might cause you problems in your analysis.
	So a nice starting point is to scan the data table for obvious issues.
	So we're going to use an example throughout the rest of this talk called Components, and this is an example from the STIPS course
	where a company is is producing small components and they have an issue with yield. So the data were collected, there are 369 batches. There are 15 characteristics that have been captured, and we want to use these data to help us understand potential root causes of low yield.
	So if we start looking at the data table itself, there are some clues to what kinds of data quality issues we might have. And a really nice starting point, (and this was added in JMP 15)
	is is is header graphs. I love header graphs. What they do is, if you have a continuous variable that show you a histogram, so you can see the centering and the shape and the spread of the distribution.
	They also show you the range of the values. If you have categorical data, it'll show you a bar chart with values for the the most populous bars. So let's take a look at some of these. I'll start with batch number. So batch number is showing a histogram, and it's actually uniform in shape.
	Batch numbers something that's probably an identifier, so just right off the bat, I can see that these data are probably coded incorrectly.
	I can see that this distribution is highly skewed and I can also see the lowest value is -6, and this can cause me to ask questions about the feasibility of having a negative scrap number.
	Process is another one. I've got basically two bars and it's actually showing me a histogram with only two values.
	And as I'm looking at these these column graphs, these header graphs, I can look at the at the column panel,
	and it's pretty easy to see that, for example, batch number and part number and process are all coded as continuous. When you import data into JMP, if JMP sees numbers, it's automatically going to code these columns as numeric continuous. So these are things that we might want to change.
	We can also look at the data itself. So, for example, when I look at humidity (and context is really important when you're looking at your data) humidity is something that we would think of as being continuous data.
	But I've got a couple of fields here where I've got N/A.
	If you have alphanumeric data, if you have text data in a numeric column, the column is going to be coded as nominal when you pull the data into JMP. So this is something right off the bat that we see that we're going to need to fix.
	And I can also look through the other columlns. So, for example, Supplier, I see that I'm missing some values. When you pull data into JMP, if there are
	empty cells for categorical data, we know that we're missing values. I can see that there's some some entries, where we're not consistent...not consistent in the way that the data were entered.
	So I'm getting getting some some serious clues into some potential problems with my data.
	notice all the dots. Temperature is a continuous variable and where I see dots, it's indicating that I'm missing values. So temperature is something that's really important for my analysis, this might be problematic.
	A natural extension of this is to start to explore data, one variable at a time. One of my favorite tools when I want to first starting to look at data is the columns viewer. And columns viewer gives us numeric summaries for the variables that we've selected.
	If we're missing values, there's going to be an N Missing column. And here I can see that I'm missing 265 of the 369 values for temperature, so this this, this is a serious gap here if we think temperature is going to be important in analysis.
	I can also see if I've got some strange values. So when I look at things like Mins and Maxes for number scrapped and the scrap rate, I've got negative values. And if this isn't feasible, then I've got an issue with the data or the data collection system.
	It's also pretty easy to see miscoding of variables. So, for example, facility and batch number, which should probably coded as nominal were reporting a mean and the standard deviation.
	And a good way to think about this is, if it's not physically possible to have an average batch number or (part number was the other variable) or part number, then these should be changed to nominal variables, instead of continuous.
	Distributions is is the next place I go when I'm first getting familiar with my data.
	And distributions, if you've got continuous data, allow you to understand the shape, centering, and spread of your data, but you can also see if you've got some unusual values.
	For categorical data, you can also see how many levels you have. So, for example, customer number. If customer number is an important variable or potentially important,
	I've got a lot of levels or values for customer number. When I'm preparing the data, I might want to combine these into four or five buckets with another category for those customers where I don't really have a lot of data.
	Humidity. We see the problem with having N/A in the column. We see a bar chart instead of a histogram.
	We can easily see what we were looking at in the data for supplier. For example, Cox Inc and Cox, Anderson spelled three different ways, Hersh is spelled three different ways.
	For speed, notice for speed that we've got a mounded distribution that goes from around 60 to 140 but, at the very bottom, we see that there's a value or two
	that's pretty close to zero, and this might be...it might have been a data entry error but it's definitely something that we'd want to investigate.
	An extension of this is to start looking at your data two variables at a time. So, for example, using Graph Builder or scatterplots.
	And when you look at variables two at a time, you can see patterns and you can more easily see unusual patterns that cross more than one variable.
	So, for example, if I look at scrap rate and number scrapped, I see that I've got some bands. And it might be that you have something in your data table that can explain this pattern.
	And in this case, the banding is attributed to different batch sizes, so this purple band is where I have a batch size of 5,000.
	And I have a lot more opportunity for scrap with a larger batch size than I do for a smaller batch size. So that might make some sense, but I also see something that doesn't make sense. These two values down here in the negative range.
	So it's pretty easy to see these when I'm looking at data in two dimensions.
	I can add additional dimensionality to my graphs by using column switchers and data filters.
	This is also leading me into potential analysis, so I might be interested in understanding what are the x's that might be important, that might be related to scrap rate.
	And at the same time, look at data quality issues or potential issues. So for scrap rate, it looks like there's a positive relationship between pressure and scrap rate.
	doesn't look like there's too much of a relationship. Scrap rate versus temperature, this is pretty flat, so there's not much going on here.
	But notice speed. There's a negative relationship, but across the top I see those two values; the one value that I saw on histogram, but there's a second value that seems to stand out.
	So it could be that this value around 60 is simply an outlier, but it could be a valid point. I would probably question whether this point here down near 0 is valid or not.
	So we've looked at the data table. We've looked at data one variable at a time. We've looked at the data two variables at a time, and all of this fits right in with the data exploration and leads us into the analysis.
	There are more advanced tools that we might use (for example, explore outliers, explore missing) that are beyond the scope of this course, or this talk.
	And when you start analyzing your data, you'll likely identify additional issues. So, for example, if you've got a lot of categories of categorical variables
	and you try to fit an interaction in a regression model, you know JMP will give you a warning that you can't really do this. So as you start to analyze data ,this this whole process is iterative, and you'll identify potential issues throughout the process.
	A key is that you want to make note of issues that you encounter as you're looking at your data. And some of these can be
	corrected as you go along, so you can hide and exclude values, you can reshape you can reclean your data as you go along, but you might decide that you need to collect new data. You might want to conduct a DOE so that you have more confidence in the data itself.
	If you know that you're going to repeat this analysis or that somebody else will want to repeat this analysis,
	then you're going to want to make sure that you capture your steps that you're taking so that you have reproducibility. Someone else can reproduce your results, or you can you can repeat your analysis later. So this is where I'm going to turn it over to Jordan, and Jordan's going to
	Talk about reproducible data curation.
Jordan Hiller	Okay, thank you, Mia.
	Hello, I am Jordan Hiller. I am a systems engineer for JMP. Let's drill in a little bit and talk some more about reproducibility
	for your data curation.
	Mia introduced this idea very nicely, but let's give a few more details. The idea here is that we want to be able to easily re-perform
	all of our curation steps that we use to prepare our data for analysis, and there are three main benefits that I see to doing this. The first is efficiency. If you have to...if your data changes and you need to replay these curation steps on new data in the future,
	it's much more efficient to run it once with a one-click script than it is to go through all of your point-and-click activities over again. Accuracy is the second benefit. Point and click can be prone to error, and by making it a script, you ensure accurate reproduction.
	And lastly is documentation, and this is maybe underappreciated. If you have a script, it documents the steps that you
	took. It's a trail of breadcrumbs that you can revisit later when, inevitably, you have to revisit this project and remember, what is it that I did to prepare this data?
	Having that script is a big help. So today we're going to go through a case study. I'm going to show you how to generate one of these reproducible data curation scripts
	using only point and click. And the enabling technology is something new in JMP 16. It is the enhanced log and the action recording that's found in the enhanced log.
	So here's what we're going to do, we are going to perform our data curation activities as usual by point and click. As we do this,
	the script that we need, the computer code (it's called JSL code, JSL for JMP scripting language) it's going to be captured for us automatically
	in the new enhanced log. And then when we're done with our point-and-click curation, all we need to do is grab that code and save it out. We might want to make a few tweaks, a few modifications, just to make it a little bit stronger, but that part is optional.
	Okay, so this is a cheat sheet that you can use. This is some of the most common data cleaning activities
	and how to do them in JMP 16 in a way so as to leave yourself that trail of breadcrumbs, in a way so as to leave the JSL script in the enhanced log. So it covers things like operating on rows, operating on columns,
	ways to modify the data table, all of our data cleaning operations and and how to do it by point and click.
	So it's not an exhaustive list of everything that you might need to do for data cleaning, and it's not an exhaustive list of everything that's captured in the enhanced log either, but, but this is the most important stuff here at your fingertips.
	Alright, so let's go into our case study using that Components file that Mia introduced and make our data curation script in JMP 16 using the enhanced log.
	Here we are in JMP 16. I will note that this is the last version of the early adopter program for JMP 16, so this is pre release. However I'm sure this is going to be very, very similar to the to the actual release version of JMP 16.
	So, to get to the log,
	I'll show it to you here. This looks different if you're used to the log from previous versions of JMP. It's divided into these two panels,
	okay, a message panel at the top and a code panel at the bottom. We're going to spend some time here. I'll show you what this is like but let's just give you a quick preview, if I were to do some quick activities like importing a file and maybe deleting this column.
	You can see that those two steps that I did (the data import and deleting the column), they are listed up here in this message panel
	and the code, the JSL code that we need for reproducible data curation script, is is down here in in this bottom panel. Okay, so
	that it's really very exciting, the ability to just have this code and grab it whenever you need it just by pointing and clicking is is a tremendous benefit in JMP 16.
	So in JMP 16, this this new enhanced log view is on by default. If you want to go back to the old version of the log, that simple text log, you can do that here in the JMP preferences section. There's a new
	section for the log and you can switch back to the old text view of the log, if you prefer. The default when you install JMP 16 is the enhanced log and we will talk about some of these other features a little bit later on, further on in our case study.
	Alright, so I'm going to clear out
	the log for now from the red triangle. Clear the log and let's start with our case study. Let's import that Components data that Mia was sharing with you. We're going to start from this csv file.
	So
	I'm going to perform the simplest kind of import just by dragging it in. Oh, I had a...I had a version of it open already. I'm sorry, let me, let me start by closing the old version
	and clear the log one more time.
	Okay, simple import
	by dragging it into the JMP window. And now we have that file, Components A, with 369 batches, and let's now proceed with our data cleaning activities. I'll turn on
	the header graphs. And first thing we can see is that the facility column has just one entry, one value in it, FabTech, so there's no variation, nothing interesting here. I'm just going to delete it with a right click, delete the column.
	And again, that is captured as we go in the enhanced log.
	Okay, what else? Let's imagine that this scrap rate column at near the end of the table is really important to us and I'd like to see an earlier in the table. I'm going to move it to the fourth position
	by grabbing it in the columns panel and dragging it to right after customer number. There we go.
	Mia mentioned that this humidity column is incorrectly represented on import, chiefly due to those N/A
	alphabet characters that are causing it to come in as a character variable. So let's fix that. We are going to go into the column info with the right click here
	and change the data type from character to numeric, change the modeling type from nominal to continuous. Click OK. And let's just click over to the log here, and you can see, we have four steps now that have been captured and we'll keep going.
	Alright, what's next? We have several variables that need to be changed from continuous to nominal. Those are batch number, part number, and process. So with the three of those selected, I will right click and change from continuous to nominal.
	And those have all been corrected.
	And again, we can see that those three steps are recorded here in the log.
	All right, what else? Something else a little bit cosmetic, this column, Pressure. My engineers like to see that column name as PSI, so we'll change it just by selecting that column and typing PSI. Tab out of there to go to somewhere else. That's going to be captured in the log as well.
	The supplier. Mia showed us that there are some, you know, inconsistent spellings. Probably too many values in here. We need to correct the character values. When you have incorrect, inconsistent character values in a column, think of the recode tool.
	The recode tool is a really efficient way to address this. So with the right click on supplier, we will go to recode.
	And let's group these appropriately. I'm going to start with some red triangle options. Let's convert all of the values to title case,
	let's also trim that white space, so inconsistent spacing is corrected. That's already corrected a couple of problems. Let's correct everything else manually. I'm going to group together the Andersons. I'm going to group together the Coxes.
	Group the Hershes. Trutna and Worley are already correct with a single categories. And the last correction I'll make is things that are, you know, just listed as blank or missing, I'll give them an explicit
	missing label here.
	All right, and when we click recode,
	we've made those fixes into a new column called supplier 2.
	That just has 1, 2, 3, 4, 5, 6 categories corrected and collapsed.
	Good.
	Okay let's do a calculation. We're going to calculate yield here, using batch size and the number scrapped.
	Right. And yeah, I realize this is a little redundant. We already have scrap rate and yield is just one minus scrap rate, but just for sake of argument, we'll perform
	the calculation. So I want that yield column to get inserted right after number scrapped, so I'm going to highlight the number scrapped and then I'll go to the columns menu, choose new column.
	We're going to call this thing yield.
	And we're going to insert our new column after the selected column, after number scrapped, and let's give it a formula to calculate the yield.
	We need the number of good units. That's going to be batch size minus number scrapped.
	So that's the number of good units and we're going to divide that whole thing by
	the batch size.
	Number of good units divided by batch size, that's our yield calculation.
	And click OK.
	There's our new
	yield column. We can see that it's a one minus scrap rate. That's...that's good and let's ignore, for now, the fact that we have some yields that are greater than 100%.
	Okay we're nearly done. Just a few more changes. I've noticed that we have two processes, and they're, for now, just labeled process 1 and process 2.
	That's not very descriptive, not very helpful. Let's give them more descriptive labels. Process 1, we'll call production process; and process 2, we'll call experimental. So we'll do this with value labels, rather than recoding. I'll go into column info
	and we will assign value labels to one and two.
	One in this column is going to represent production.
	Add that.
	And two represents experimental.
	Add that.
	Click OK. Good. It shows one and two in the header graphs, but production and experimental here in the data table.
	All right, one final step before we save off our script. Let's say, for sake of argument, that I'm only interested in the data...I want to proceed with analysis only when vacuum is off.
	Right, so I'm going to subset the data and make a new data table that has only the rows where vacuum is off. I'll do that by right clicking one of these cells that has vacuum off
	and selecting matching cells.
	That selects the 313 rows where vacuum is off. And now we'll go to table subset,
	create a new data table, which we will name vac_off.
	Click okay.
	All right, and and that's our new data table with 313 rows only showing data where vacuum is off.
	So that's the end. We have done all of our data curation and now let's go back and revisit the log and learn a little bit more about what we have.
	Okay, so all of those steps, and plus a few more that I didn't intend to do, have been captured here in the log.
	Look over here, we have...every line is one of the steps that we perform.
	There's also some extraneous stuff, like at one point I cleared out the row selection. I didn't really need to...I don't really need to make that part of my script.
	Clearing the selected rows, so let's remove that. I'm just going to right click on it and clear that item to remove it. That's good.
	Okay, so messages up here, JSL code down here. I'd like to call your attention to the origin and the result. This is pretty nifty.
	Whenever we do a step, whenever we do an action by point and click, you know, there's there's something we do that action on and there's something that results. So that's the origin and the result. So, for instance, when we deleted the facility column...
	well, maybe that's a bad example...let's choose instead changing the column info for humidity. The origin, the thing that we did it on
	was the Components A table, and we see that listed here as the data table. When I hover over it, it says Bring Components A to Front, so clicking on that, yeah, that brings us to the Components A table. Very nice.
	And the result is something that we did to the humidity column. We changed the humidity. We changed it to data type numeric and modeling type continuous. See that down here.
	So I can click here, go to the humidity column, and it...it highlights, it selects for us. JMP selects the humidity column for us.
	green for everywhere, except this one last result in blue. Well that's to help us keep track of our activities on different data tables. We did all of these activities on the Components A data table
	and our last activity, we performed a subset on the Components A data table, the result was a new data table called vac_off. And so vac_off is in blue. Right, so we can use those colors to help you keep track of things.
	Alright, the last feature I want to show you here in the log that's...that's
	helpful is if you have a really long series of steps and you need to just find one, this filter box lets you find. Let's say I want to find the subset. There it is. We found the subset data table, and I can get directly to that code that I need.
	Okay, so this...is this is everything that we need. Our data curation steps were captured. All that we need to do to make a reproducible data curation script is go to the red triangle and we'll save the script to a new script window.
	the import step, the delete column step, the moving the scrap rate step. All of those steps are are here in the script.
	We have the syntax coloring to help us read the script. We have all of these helpful comments that tell us exactly what we were doing in each of those steps.
	Right, so this is everything. This is all that we need and I'm going to save this. I'll save it to my desktop as import and clean...let's call it import and curate components.
	Right, that is our reproducible data curation script. So if I were to go back to the JMP home window and close off everything except our new script,
	here's what we do if I need to replay those data curation steps. I just opened the script file and we can run it by clicking the run script button. Opens the data, does all the cleaning, does the subsetting and creates that new data file with 313 rows.
	Let's imagine now that we need to replay this on new data.
	I have another version of the Components file. It's called Components B. It has 50 more rows in it, so instead of 369 rows, it has 419 rows. Imagine that, you know, we've run the process for another 50 batches and we have more data.
	So it's called Components B, and I want to run this script on components B. But you'll notice that throughout the script it'sit's calling Components A multiple times. So we'll just have to search and replace Components A and change it to Components B. Here we go. Edit.
	Search.
	We will find Components A, replace with Components B. Replace all. 15 occurrences have been replaced.
	You can see it here and here.
	And now we simply rerun the script, and there it is on the new version of the data. You can see it has more rows, in fact, it was off. It was 313 before, it's 358 now.
	Alright, so a reproducible data curation script that can run against new data.
	Okay, so here is that cheat sheet once again. This will be in the materials that we save with the talk, so you can get to this and find it. And this tells you just how to point and click your way through data curation and leave yourself a nice replayable, reproducible data curation script.
	That script that we made didn't require us to do any coding at all, but I'm going to give you just a handful of tips, four tips that you can use
	to enhance the scripts just a little bit. The first tip is to insert this line at the beginning of your scripts. It's a good thing to do for all your scripting. Just insert the line names default to here.
	This is to prevent your script from interacting with other scripts. It's called a namespace collision and you don't really have to understand what it does, just do it. It's a good thing to do. It's good programming practice.
	Second tip is to make sure that there are semicolons in between JSL expressions.
	The enhanced log is doing this for you automatically. It places that required semicolon in between every step. However, if you do any modification yourself, you're going to want to make sure that those semicolons are placed properly. So just a word to the wise.
	add comments.
	Comments are a way for you to leave notes in the program and leave them in a way so that you don't mess up the program, right. It's a way to
	leave something in the program that won't be interpreted by the JSL interpreter. And there are notes that the enhanced log has left for you...action recording in the enhance log has left for you, but you can modify them and add to them, if you like. So here are the main points about comments.
	The typical format is two slashes and everything that follows the slashes is a comment. So you can do that at the beginning of a line or also at the end of the line. So the interpreter will run this X = 9 JSL expression, but then it will ignore everything after the slashes.
	You can also, if you have a longer comment, you can use this format, which is a /* at the beginning and a */ at the end. That encloses a comment. So comments are useful for leaving notes for yourself but they're also useful for debugging your JSL script.
	If you want to remove a line of code and make it not run, you can just preface it with those two slashes and, if you want to do that for a larger chunk of code, you can use this format. So good to know about how to use comments.
	The last tip I'm going to leave you with is to generalize data table references. Do you remember how we had to search and replace to make that script run on a new file name, Components B?
	We had to change 15 instances in the data table, in the script. Wouldn't it be nice if we only had to change it at once, instead of 15 times? So you can make your scripts more robust
	by generalizing the data tables references. Instead of using the names, we'll use a JSL variable to hold those table names. Here's what I'm talking about. I'll show you an example.
	On the left is some code that was generated by action recording in the enhanced log.
	We're opening the Big Class data table. We're operating on the age column, changing it to a continuous modeling type and then we are creating a new calculated column,
	on open, we use it to perform the change on the age column, and we use it over here.
	Very simple, what you need to do to make this more robust and generalized.
	You need to make three changes.
	This first change over here, we are assigning a name. I chose BC; you can choose whatever you want. You'll see DT a lot.
	So BC is the name that we're going to refer to that Big Class data table by in the rest of the script. And so when we want to change the age
	age means BC data table and the age column in that data table.
	Down here, we're sending a message to the Big Class data table, and that's what the double arrow syntax means. So we're just generalizing that, we're generalizing it so that we use the new name to send that message
	to the data table. And now if we need to run this script on a new data table that's named something other than Big Class, here's the only change we need to make. We need to change just one place in the script. We don't have to do search and replace.
	Okay, so after those four tips, if you're ready to take your curation script to the next level, here are some next steps. You could add a file picker. It doesn't take much coding
	to change it so that when somebody runs this script, they can just navigate to the file they want it to run on instead of having to edit the script manually.
	So that's one nice idea. If you want to distribute the script to other users in your organization, you can wrap it up in a JMP add-in, and that way users can run the script just by choosing it from a menu inside JMP.
	Really handy. And lastly, if you need to run this curation script on a schedule in order to update a master JMP data table that you keep on the network somewhere, you can use the task scheduler in windows or the automator in the Mac OS in order to do that on a schedule.
	So in summary, Mia talked about how to do your data curation by exploring the data and iterating to identify problems.
	If you automate those steps, you will gain the benefits of reproducibility and those are efficiency, accuracy, and documenting your work.
	To do this in JMP 16, you just point and click as usual and your data curation steps are captured by the action recording that occurs in the enhanced log.
	And lastly, you can export and modify that JSL code from the enhanced log in order to create your reproducible data curation script. That concludes our talk. Thanks very much for your time and attention.

Presented At Discovery Summit Europe 2021

Presenter

Jordan Hiller

Files

Discovery Europe 2021 Data Curation.pdf

Automating the Data Curation Workflow (2021-EU-45MP-737)

Presenter

Files

Advanced Statistical Modeling

Automation and Scripting

Basic Data Analysis and Modeling

Consumer and Market Research

Content Organization

Data Access

Data Blending and Cleanup

Data Exploration and Visualization

Design of Experiments

Predictive Modeling and Machine Learning

Quality and Process Engineering

Reliability Analysis

Sharing and Communicating Results