Thanks for joining, everybody.
My name is Jason Wiggins.
I'm a senior systems engineer for JMP.
I come from a fairly long career in oil and gas
and manufacturing and RMD and and quality.
What we're here to talk about today is messy data.
I really believe anyone who analyzes data has encountered a data set
somewhere along the way that needed a lot of work before it could be analyzed.
Cleaning or shaping data can be difficult,
but hey, while that's a mess, I find it that it actually can be quite frustrating,
especially when we have to do it manually.
Some of my messiest data problems have come from Excel spreadsheets.
I believe there's a couple of reasons for that.
Excel is great for many things, but for analysis, it's just not that great.
Part of the reason is that it doesn't impose the format for the data.
M y mind, data formats are as varied as the imaginations of people using them.
Excel files also tend to be hand curated,.
The likelihood of misspelling and inconsistent naming conventions
are really quite common.
The example I'm presenting today
comes from my career in oil and gas.
I believe that the problem and solution that I'm going to show
can be found in many of your data sets.
my goal for everybody is to see a few possibilities
for simplifying that front end of our analytical workflow.
Let's take the exclamation point out of, wow, that's a mess,
and just say, yeah, but no problem I understand how to deal with this.
Now, I'm also using an example
where there are data available that you can download off the Web.
I'm going to be uploading my presentation materials
in case anyone would like to practice
some of the concepts that I'm going to work through today.
All right, so let's get to this.
Our problem.
Baker Hughes has been publishing rotary rig count data.
These are the rigs that drill for either oil or gas all around the world.
They've been posting active rotary rig counts for generations.
A rig count is a very important business barometer for the oil and gas industry.
If you're in the industry, you don't need the explanation.
You're consuming these data on a daily, if not weekly basis, but it's used broadly
all the way from downstream to upstream and exploration or challenge.
As I laid the groundwork, we are going to be dealing with Excel data
and some of the problems that we have with that.
One, many of the worksheets that Baker Hughes makes available
are not in the right format for analysis in JMP.
I also found many errors.
This debt certainly isn't the most error-prone data set
that I've worked with coming from Excel,
but there are a few in there that are doozy
that we'll focus on today.
And really, they're going to be around
selling and inconsistent terms and abbreviations.
Again, in terms of the overall analytical workflow.
In order for us to even get to these analysis and knowledge generation,
knowledge sharing,
we have to get our data set into a format where we can begin to do that work.
We're going to be focusing on getting data into JMP, blending and cleaning,
and then at the end, we'll do a little bit of data exploration and visualization.
Ultimately, what we're shooting for is to have a data set
where we might look at rig count trends by state, for instance.
These trends,
might be very telling about the economics in the industry over time.
We may also want to look at more of a time series based analysis, like a bubble plot,
where we conceive the change in rigs over time for the different states.
Again, in order to get to that point
let's see, JMP is pausing on me for a second.
There we go. Okay, in order to get to that point,
we really need to get the data into something that we can work with.
this is the final data set.
This is what we're going to be pushing toward.
I have a couple of ways of accounting for date in the data set.
This is what comes from Baker Hughes worksheet.
But I'm going to create a couple of other variables
that we may want to take a closer look at times.
we're going to create some new variables.
A month here variable and a year variable.
We're going to fix some spelling errors in the state,
and ultimately, we're going to join this with another data set
that has the Lat long coordinates for the capital city of every state.
I don't know what the best way is to show
that time series growth and contraction
have to choose a point capital cities available.
We have to do a few things in order to make those data sets connect.
That's where we're going.
All right, that's essentially what I outlined verbally.
But I'll pause for a second and just let everybody take a look
at our analysis workflow, which is that component,
that data shaping, data blending, data visualization.
All right, let's talk about importing data.
For those of you who want to try this.
I want to point out that the data
that Baker Hughes publishes is in a binary Excel format.
At this point, JMP does not have a way to directly import these data.
If this were XLSX, which it used to be,
I'm not sure when the binary file format was adopted,
but it used to be that you could do a file Internet open
and ping the URL and automatically download the data.
But we can't do that we have to do an intermediate step.
It's pretty simple.
If we go to the website let me pull that back up again real quick.
If we click on any of these links, it'll download the data.
We open that data up in Excel and then save it as an Excel SX.
There are ways to do this automatically,
but they're going to happen outside of JMP.
For those that really want to explore this and make it automatic,
there's a bit of a challenge up front.
I will point out some ways to automate and JMP after we get the data in.
All right, so this is what we're looking at,
that's in fact the Excel sheet that we're going to load.
All right, so the first thing that we need to do in fact.
First things first, let's get our data in
Column headers that are on row five, but I have a couple rows of column headers.
This is common in Excel.
People use merge cells to combine
text from more than one cell into a single label.
We want to make sure that we capture that.
We're going to say data starts on row five
and we have two rows of column headers.
I'm sorry, column headers start on row five and we have two rows of them.
Then we jump out of layers and notes how to adjust for where the data starts.
This is good.
I always like to take a look at the preview and just make sure that
getting what I'm asking for and this looks right.
We're importing the correct worksheet.
Let's just import that.
All right.
This is in a wide data format.
That's not typically the format that JMP likes, doing an analysis.
Almost always we want to be in a tall data type format.
What I'd like to have is a column that has these column headers as rows.
Then another column that has the rig count for each one of the column headers.
The operation that we need to do for that is a stack operation.
Let's just talk through this.
I'm actually doing the presentation in JMP 17,
and the reason I'm doing this is that there's a cool new feature in 17
that I find to be so handy for this type of work.
All right, I forgot to do one thing.
Let me back up real quick, just to keep the dialogue simple.
What I'd like to do is get rid of these summary statistic columns.
These will certainly make sense in a wide data context.
They are going to make sense if we stack them.
we're just going to delete those.
We can do that, deal with that in a lot of different ways, but keep it simple.
We'll just delete it out of the data table and then go back to our stack,
turn on a preview, and let's just select all of those columns.
This is great. One thing I love about the preview is,
first off, I get to see that,
yes, this is in the shape that I need for my analysis,
but I can also make some changes
and see how the data table is going to look
before I actually commit to making a data table.
If you remember, we wanted the count to be rig count.
That data column, we want to be rig count.
Source column.
We're going to name this state and type,
and then we're going to separate those in the end.
Another example of creating new variables, we're actually going to split those apart
so we can use them as two separate variables.
But for right now, I think that's pretty descriptive.
If I hit enter or click anywhere else and I get to see the update
and yes, indeed, this is something that I want.
That's data shaping. Step one.
We may have to do several steps of data shaping.
In this case, it's a simple example,
stacks, appropriate, and that's all we have to do.
All right, before I show this slide, let's go over to the data set.
One of the first things that I do when I'm manipulating data tables
and working with new data sets.
Is I like to graph the data, plot the data,
or show the data in some way graphically
to help me see whether there are issues that I need to resolve.
Distribution is a great way of doing that.
I'll just pause, let everybody look.
It's probably small print. Let me make that much bigger.
I think even in the first couple of bars, few bars, half a dozen bars,
hopefully everybody is recognizing some potential issues.
One, what's wash?
I'm not sure why Washington was abbreviated.
There's probably some reason historically, these data sets are quite old.
We have abbreviations for west.
We have abbreviations for north and south.
Turns out that is the exact same abbreviation as for new.
We have several issues.
let me scroll down. There's another doozy here.
Fortunately there's only one of them.
See if I can find it. We're looking for Tennessee.
There we go, Tennessee.
Everybody take a look at Tennessee there and see if you notice the problem.
We're missing an S, right?
That's something that I would do.
I'm a horrible speller when I'm typing, especially when I'm typing fast.
We found one spelling error there.
Now the trick is, how do I parse that out and fix all of these errors?
More importantly, how do I do that
in a way that doesn't involve 1000 steps
for renaming these different abbreviations and misspellings?
That's where regular expressions come in.
In Rico, there are a variety of different tools to deal with data cleaning problems.
I always like to choose the simplest one, but often they're not that simple.
What are regular expressions?
Well, they're sequences of characters that specify a search pattern.
a simple example, I've got cats and dogs, data set.
Each one of these characters represents a pattern, a search pattern for the text
and then a command for what it is that we're going to return.
Why we use them?
Well, they're very compact and flexible.
I can solve really complex character data problems
with a small handful of regular expressions,
and I just can't imagine how else I might do that.
It definitely takes messy problems and makes them simpler,
but got to learn a new concept.
If regular expressions are brand new to you, I have a resource that I like.
It's free, it's online, it's Debug X.
One of my favorite parts of this is the cheat sheet or the quick reference guide.
I want to understand what those search characters mean.
I can look at this quick reference
and I can start to begin to piece together regular expressions.
I don't often use the test part of the website,
but if you had some example text
and wanted to test your regular expression against it, you could do it here.
I prefer to do that and JMP.
It just saves me time.
JMP actually has something that you'll see that we can use in a similar way.
All right, so that's what they are.
Good place to go learn about them and let's take a look at how we use them.
All right, I'm only going to fix a couple of these,
and we'll fix a couple of them together.
I'll speak through what the regular expression means as I type it.
Before we even get there.
I'd like to recommend that when we're doing this type of work,
that when we recode, we want to dump the results into a formula column.
Reason being is if we decide to add on to this data table,
those fixes will persist in the future.
It's the only time we're really going to be using it.
Maybe we don't use the formula column, but I prefer that.
In fact, I really like to see that right up top for my personal use.
How do we get to the regular expressions?
We have a replace string utility.
Again, there are many other utilities.
Always choose the simplest one.
If we're just trying to pluck first and last word, for instance,
I don't want to write a regular expression for that,
but in this case, I got some mess I need to clean up.
we're going to use a replace string to do that.
A couple necessary of ingredients.
You have to make sure that the Use regular Expressions check box is turned on.
Remember that preview the Debug X shows.
Well, here's our preview.
We only have to type once and adjust once, hopefully.
then we get to see the results.
let's try that out on a couple of these examples.
Again, I'm going to speak what's happening as I type.
Let's work with the new first.
Remember, N. can either be new or north.
If I look at that dataset,
seems logical to fix the new one first and then work on north.
I'm going to look for the character N
in the string in the row of the column that I'm analyzing
Then I'm going to look for a period.
The reason I'm putting this in brackets is
that at a period also is a search character.
It means any character.
To be honest, you could leave the brackets out of this
is still going to work, but it's a little bit confusing.
If we're using a specific character
that may exist as a search character also sometimes it's nice to bracket them out.
Makes it a little more interpretable.
All right, after N we have a space.
This backslash is white space character
and I'm actually going to type out Mexico.
Now I could use a search pattern,
a search character here like a W star or something like that.
I'll explain that a little bit more as I go.
But sometimes when you're writing regular expressions,
it's handy to have something that's a little bit more readable.
I'm choosing to type out the words here
and there aren't any problems with those words.
I'm just going to reference them directly.
Now with a regular expression I can use logic.
I can deal with all the new issues but one in a single line.
I'll tell you why we're going to deal with the New Hampshire one
a little bit differently.
Let's do the New Mexico.
New York.
That pipe again is our logical ore and then we'll do Jersey.
The reason I'm putting parentheses around this
is that it allows me to return the result of everything inside the parentheses
and every parentheses left to right is referenced by 1, 2,3 so in numerical order.
I think that is probably good except oh, we do need to have that last part,
that dash land or if there were offshore rigs,
I don't think there are in New York, but we'd want to capture that.
There's my any character and I'm looking for any character that happens
more than once and put parentheses around that.
Here's where the magic is.
Alright, so now I can type new.
Now I have a white space in there
so I don't need to actually put a white space in the replacement text.
But this one and two reference what's inside the parentheses.
The one is going to be the logical result of that search
inside the first parentheses and then the second it's going
to capture all the characters that go on behind it.
Let's scroll down and see how we did.
We get a star for anything that was recoded and it looks good.
New Jersey, New Mexico, New York.
I think we've done our job here.
Again, another just classic reason why you want to use regular expressions is
that I got in a single statement, I was able to fix three problems.
Let's do one more.
Let's just do a couple of more with our end and then we'll move forward with this.
All right, so we'll go back to replace string.
You can have as many regular expression replace string commands as you want
within recode.
Sometimes again, that's nice.
You could get really clever
and fix a lot of issues with a single regular expression.
Sometimes it's a little more readable if
we tackle them maybe in a few at a time or even in the unique cases one at a time.
we use regular expressions let's deal with the New Hampshire problem.
Same thing we're going to do N character followed by a period.
Then we're going to do,
actually, let's not do parentheses because we don't want the Hamp period.
We want to look for it.
Hamp, end with a period,
and then we want to capture all the text that is behind that.
now we do New Hampshire with the Dashland behind it,
scroll down and it fixed it.
That's great.
I could do something similar.
We're running short on time.
I want to make sure we at least get to some of the graphing part of this.
But we could do a similar set of steps to deal with the north and the south
and the west and the many others and what that looks like in the final data table.
If I look at the recoded formula column.
Once again, very nice to have this because it is portable.
Those are the regular expressions
that I use to fix all the data problems in that column.
Again, the benefit of doing this,
is if you're working with a really huge data set,
could you imagine going through and hand typing those
every time you want to do an analysis on new data, for instance?
It's pretty arduous.
We've saved ourselves a lot of time with just a little bit of creativity.
That's regular expressions.
Again, my intent wasn't to kind of teach regular expressions,
but really show that as an opportunity for folks to investigate
that can help deal with many of your messy data problems.
let's play around with these new variables.
I think let's just go back to...
Yes, I'm actually going to show it in the same table
that I have the results and I'll just show you how I got there.
When we finally got all that data cleaned up, we had a date column.
If we look back here, hey, there's our date column.
I'll use this table.
A little wishy washy here today,
but just so you can see what this looks like if we're starting from scratch,
let's just use this one. I'll reference back to the complete table.
This new formula column has a date time group in it, which is really handy.
Now, you noticed that we are in a day- month- year format,
and that's just when the record was made.
Now, in an analysis context, really, we may want to just look at month and year
so we're combining some of those,
making the time component a little more coarse.
That can be helpful in seeing the trends that we want to see.
we're just going to choose month, year,
JMP, automatically dumps that formula into a new column.
If we look at that, that's what we did.
We can do the same thing with year new formula column.
Date, time, year.
Now we have two other variables that we can use in Internet analysis context.
Remember the fact that we have land and offshore tags on every state?
We probably want to split those out, and to do that, just create a formula column.
Let's see,
I actually used regular expression on it,
but you could play around with other ways of dealing with that.
I have a column that has a regular expression
that's plucking off the Alabama.
Now, if there are states here, we could use that first word,
but I don't think if I remember right, the dash doesn't work.
same thing with type.
Look at that formula.
I just have a regular expression that's looking for the last piece of that string
and then returning only the last piece.
I've created four different variables that I want to use for analysis.
Let's do one more...
Part of the reason that we cared so much about the state names.
One is that if we have multiple representations of the state,
well, that's going to complicate our analysis.
But the other rationale for doing that is
if we need to join it up with something else, like Lat long coordinates.
For each state, we need to have the state names consistent.
The data set that I have,
you could be either all caps or title case to do the join.
Joining, that's cool we can do that
with the data set that in fact, I think I have an intermediate
table that we can open and look at here before we do the join.
Let's do a little housekeeping here.
I'll leave that one up in case we need to go back to it.
Okay, so we have our state, it's been fixed, recoded.
We've created that state variable.
Now I want to join it with my US state capitals.
Reopen that file.
I'm going to use state one and match it with state in the state of table.
Joins and JMP.
I always like to thin I want to invoke the join
from the table I want on the left.
Then I'm going to choose the one that I want on the right.
I'm going to match again. We're going to match on state and state.
Interjoins going to be appropriate here.
I mean, if we had any missing state names,
I'm not sure how informative those might be for our analysis.
I really don't want them. In fact, they don't exist.
An inner joint is appropriate.
it's just going to take anything that matches.
Any non matches are left out again.
Hey, I've got this great preview in 17.
We can look at that a little bit.
Let me get down to the scroll bar and right at the very end I should see
the capital and the latitude and longitude.
Now I ended up with this extra state too.
if I don't want that, I can select the columns for the join table.
If ultimately this is going to a script, maybe you want to do that
just so you don't have to go into week columns after the fact.
But that's it for a join.
We just glued those two data sets together by matching state names
and we have latitude and longitude.
Maybe capital is not necessarily important.
We can just get rid of those columns.
Back to the automation component.
We did several steps along the way,
but for each step that we do in the tables menu,
there's a source script associated with it.
If we have each of the tables,
the intermediate tables, that get us to the final result,
we could steal that script, glue it together and automate.
That workflow.
All right, that is through joins.
Let's play a little bit in bubble plot and Graph Builder.
Let's do Graph Builder first
and then with whatever time we have left, we'll get to a Bubble Plot.
Again, we're going to enjoy the fruits of our labors here.
We exercised a lot of creativity.
We got a data set that is analyzable and graphical.
let's see what we can learn from it.
All right, so because we have state or built in shape files in junk,
so it'll recognize state names and build
state boundaries in a map in the graph frame.
We're going to drop that state into the map shape role.
Now what I may be interested in is the rig count by state.
I'm going to color by this.
to me I don't find that blue and red
are also informative for what I'm trying to get across.
I really prefer spectral colors, but no problem.
We can change the gradient with a simple right mouse click.
I like white to red.
It seems to work for me and that's great.
We the rig count by state.
But again, this is really course right.
We have a lot of years of data, so 22 years of data here.
We may want to drill into these a little bit deeper.
Build another graph, and we look at date and rig count.
We're going to turn off that smoother.
There we go.
That just averaged everything.
What I'm doing this for is I want to steal the script behind this
and I'm going to use it to get those cool hover graphs.
Save script to the clipboard,
go back to this map right mouse click, hover label.
I'm going to paste the graph with, let me see how I did.
JMP knows that...
Let's just click down here because I think we're getting close.
JMP knows that I'm only looking at Texas,
so it's only going to give me the rig count trend over time for Texas.
I believe that is fairly handy.
If we look at Louisiana, for instance.
Well, let's look at North Dakota.
North Dakota is interesting because
of the shale unconventional reservoir boom that happened up there.
They had a pretty big spike.
They had a pretty big drop off,
and then they're building again in terms of bricks.
We are drilling into this.
But hey, this actually has land and offshore in here.
Set's split those out to turn our control panel back on.
We'll drop type down on page.
We'll get a different graph for each of those.
Maybe we'll make this a little bit smaller,
a little bit smaller still.
Maybe I want to show missing data.
Now I have boundaries for all the states
but only the ones that have offshore rigs are colored.
Now if I hover over Louisiana, interestingly, you can notice
a decline in rigs off the coast of Louisiana in the gold.
That coincides a lot with what we see in these unconventional wells
that are really increasing in number quite rapidly in states in the US.
We're able to drill into a few trends.
We were able to do that because we massage the data.
We got it into a format that we can actually use.
I think we're up on time.
I will post these materials out onto the Web,
and you'll have example scripts in each one of the data tables that you can run
to get the final visualization that I was working on.
then you can try and recreate it yourself.
Also I'll have two example data files,
but I would recommend if you're going to try this,
to actually download the data, and then you get the full experience.
With that, I really hope that I've highlighted a few different things
that could potentially save you a lot of frustration
in your data cleaning efforts.
With that, thank you very much.