Hi, welcome. You've found our talk on Regex. It's a powerful text analytics tool that Hadley and I are going to explore the basics of today in our talk.
Yes, thank you very much for clicking on this link and watching this presentation. What is Regex? Well, Regex is a function that searches for a pattern within a store source string and returns a string. That definition was taken from the Regex function of the JMP scripting guide. I'm not sure that that definition quite does it justice.
Before we go into some details about how you can use it, what the power and value of it is, what I'd like to show you here is just the format of the function. It takes in a source, a pattern, and then if you like a replacement string, it has other functionality as well. But for the purpose of this presentation, we are going to be talking about these first three inputs to the function.
Before we dive too deeply into it and show you some examples, I just like to talk a little bit about how to set up a pattern in Regex and specifically about the concept of escape characters. These are characters that can mean many things.
For example, a \W can maybe mean a W. It can also mean any lowercase or uppercase letter A through Z, as well as numbers zero through nine and a lower space lowercase, what's that called? Underscore.
How you would refer to that is simply by typing \W. If you wanted to refer to a literal W, you would just write the word the letter W. Digits can be expressed in their actual form or they can express generally as \ D, and then \ S refers to a single white space character, including tab, return, new line, vertical tab, and something called form feed.
Probably some of you watching it know what's that means. I'd like to mention some special characters now, so you can see the period, the question mark, the asterisk plus refer to matches of different characters. So the period refers to any single character.
Question mark matches zero or one instance of whatever is put in front of it. The asterisk matches zero or more and then the plus matches one or more. Now there's some other characters as well. I won't go through all of these and there are many more that I haven't captured, but I thought to put them in this table and save them here so that if you like, you can pause this and you can see exactly what these are.
Let's look at an example. Let's say that you wanted to extract all email addresses from blocks of text, free text with many email addresses in all different formats. How would you do that? Well, let's look at our source, which would be for example, for help contactsupport@jmp.com. It's free with your license of jump.
If we wanted to look through this and extract the email address, we'd have to refer to it as a pattern. So that pattern is one or more instance of any character, including numbers. Perhaps we can refer to these as \W followed by an ad sign, followed by one or more instances of \W of any character or number or letter, followed by a literal period indicated by \. a nd then the letters C-O-M.
If we were to set that up in a Regex function, the return result would be the email address support@jmp.com. That would be the pattern that matches. Now some of you watching this, I know what you're thinking. Not all email addresses follow this format. Some of them have other characters in them, some of them have multiple periods, some of them perhaps don't end in com, they end in something else.
That's all very true and this isn't going to match with those. What you could do is then take this pattern and perhaps generalize it in different ways to get more email addresses. The more of what you're looking for match more patterns. We'll talk and we'll show you an example of how you can do that and what that process looks like.
The examples we're going to look at is an example of automated machine messaging indicating error messages, different parts of the system. What we want to do is extract the components that are broken from all of these messages. I'll show you how to do that. We're going to take phone numbers that have been entered manually in all different crazy formats and we're going to put them in a uniform format and we're going to extract info from coded text.
In this case, this is file names that contain information about how different biological samples were run, the temperatures, the stressed tests and so on. Times, all of this is coded in the name of the file. We're going to pull out all those pieces and then organize them in a table that we can work with them.
Now probably you've all clued into the fact that Peter and I are not Regex experts. I think that the word novice is probably a better description of how of our competency in Regex. The purpose of this talk really isn't to show off our Regex prowess and how great we are using Regex so that everybody should be impressed.
Now the purpose of this talk is to demonstrate how powerful Regex can be, even for novices. Even with a very little bit of knowledge about how Regex works and how patterns work, you can get a lot of use and a lot of functionality. Now Regex can be intimidating, but it needs because at its core it really is very simple.
We're going to take you through some examples and show you exactly how simple it is and how you can start using it right away. Without further ado, I will turn things over to Pete.
All right. Thanks, Hadley. Go ahead and get started here with the first example. Like Hadley said, this is an example where we're trying to extract out of a description here what part was actually broken. There's probably many different ways you could get at this, but we're going to show you how to do this with Regex.
I'm going to create a new column, generate a formula here, and I'm going to look for Regex in the filter, find it there, and then start with my description. That's what I want to run the Regex on. Then I'm going to define a pattern.
If you remember with what Hadley shared there, there's a couple of little tricks to remember with Regex that will make it a lot easier. The first thing I'm going to do is put in a W, which is a character, but I want this to be more than one character. I'm going to do a W and a plus. Then after that w and plus, I'm looking for something that has a space and says the word broken. As long as I type that out, right, you'll see here that my formula result is there.
If I hit apply here, you can see that it tells me what is broken, but it also contains that word broken. Maybe I don't want that. Maybe I just want what the part is, not the word broken in there. Then if I want to do that, how I can do that is go in here and containerize this to make this a first word of the list here.
Then I'm going to just say, hey, I only want that first word. If we look at the preview here, it's just giving me that. Now if I hit apply and okay, I've extracted out what I was looking for. Now, this is a simple example and you could probably think of other ways to be able to get that specific part of this description out, but I wanted to show you how you could do that with Regex and really just a very simple start to this.
Let's look at a little bit more complex example. Here we have phone numbers that are entered randomly and they have different spacing, different delimiters in there. Sometimes there's a one, sometimes there's not. Sometimes there is extensions, sometimes there's not. We want to format that in a different way and end up with a more clean format. Here's the end result.
Unlike the last example, I think this one is a lot more difficult to do without Regex. Let's walk through how we can do this with Regex. Very similar. We're going to start, I'm going to type in Regex here and I'm going to move this down so you guys can see as we're building this Regex, the results pop up there.
I'm going to put that phone numbers in as my original pattern or my original data, and then I'm going to start with that pattern. If we remember again from what Hadley said, we're looking for digits this time. Our pattern is digit, digit, digit, then something. We don't know what, but we'll put in that question mark because it could be many different things and then we have digit, digit, digit.
Let me pop this open a little so we can see it. Then again, we have a question mark because we don't know what that delimiter is in there. Then we have four digits. Okay, all right. If we look at a preview, you can see it catches some of these. I'm going to just hit apply and now you can see some of these numbers were captured here, but some were not. Then our output formula isn't what we're after.
Let's go back and open this up and we're going to containerize those like we did in the previous example. We're going to look at three individual words here, or three individual sets of digits, I should say. We've containerized them, we'll hit okay. Then we want an output that looks a certain way.
We want to have the first word followed by a dash, then the second word or set of digits followed by a dash, and then the third. Okay. When we hit apply here, you can see this is cleaned it up a little and at least the output format is what we're looking for, but we're missing a few. Like, let's look at this one specifically.
This one has a space here. How do we tell Regex that there might be a space, but there might not? We'll go back here and we're going to edit this a little bit. We're going to put in a potential space. I'm going to put a space with a question mark there because it might be there, might not and I'm going to hit okay and apply. There you can see it captured those two with the space.
But you can also see some of these have a one at the start, like line five here. How do we tell Regex that there might be a one there? So just like we did with the space, we're going to go in, we're going to say, "hey, there could be a one here. " I f we do that and hit okay and apply, you can see that it cleaned those up.
Now we're pretty happy. We've got everything in the format that we want it. But you can see there is other examples of different styles of phone numbers here. If people have put in letters instead of numbers, it's not capturing all of that. There's more we could do with this to clean these up further, but we've taken a lot of messy phone numbers here and clean them up into a nicer format.
This is a good way to use Regex. Now I'm going to pass it back to Hadley for the last example.
All right, thank you very much, Pete. Very well done as well. What I'm going to do is I'm going to show you this example here, which is an example of descriptions taken from file names. The first seven digits, I think the first seven things are the name of the sample and then how it was run. Temperatures sometimes included, but not all of them. Days sometimes or weeks. Time sometimes included, but not always.
Let's extract all of this information and what we ultimately want it to look like is that. We are going to use Regex to extract the sample project code from the front, the stress condition from within, the temperatures as well as the mean of those temperatures, temperature range. Then if there is a time we'd like that as well expressed in days and not in weeks.
Let's delete all this and see how we can do it. Now, the first thing we can do is to add our project code and we could do this in Regex. But you know what, this is actually probably pretty simple to do using substring. It's this guy, the first seven. There we go. Let's not complicate our lives.
Now, the rest of it, I think, is a little bit more tricky. What I'm going to do is I'm going to open up a new script. We're going to start out, we start out old scripts and we are going to go in and grab all of these descriptor names. We're just going to create a list called Description with all the values in this column.
What I'm going to do is just show the log. You can see here that if I run Description, I've now got all my descriptions. What do we feel like starting with? Let's see, I think temperature is probably a good one to start with. What I'm going to do is just to show you that if we take the temperature code here, all of these are going to be in about the same format.
We're going to create a list container to hold whatever it is. We're going to loop over all of the items in description. Temp code, I going to equal something at a description. Then once we get all these, we can just slap the whole thing into a new column.
What is this going to look like? Well, it's going to look like Regex first of all, our description, I think this is just description I followed by what is it? We're talking about temperatures here. It's one digit, maybe a second digit, followed by a dash, followed by another digit and maybe a second digit. Then the letter C.
What we want is this first set of digits, followed by this second set of digits. If I run this, hopefully it works. There we go. As I'm doing this, I see that I probably could have gotten away with just doing this. That would have been fine too. I probably didn't need that second one. But if it works, it works. If it's broken, don't fix it. There we go.
Let's move forward and what should we do next? Let's grab our time. Time is going to work exactly the same way. We're going to create a container for time. We're going to loop over descriptions for time. Now what do we want? We want our time code equals Regex. What does this look like? It looks like well, first of all, we've got our description followed by, what's our pattern?
It is the word day or the word week. Then one digit. Might there be two digits? I guess there might be. We're just going to wrap some containers around this so we have a day or a week. We don't have both. Then we have one digit and maybe a second digit. We want our second container. We don't want the word day or week. We want just this.
If I run this, let's see what time code looks like. There. You can see that where it was able to it managed to grab the day or week and put it in. Let's take all of this and drop it into a column. But before we do that, you perhaps want this expressed as numbers rather than characters.
What I could do is run that and express the whole thing as a number instead of a character. Now we're getting closer to where we need to be. Of course we want to know whether these are days or weeks and we're not going to know that. That's going to affect how we put this in what we need to do here.
Because if it's days, then it's fine. If it's weeks, then we should take whatever numbers in here and multiply it by seven to show that we are consistent with the number of days. Then we'll put that in a new column. What is that going to look like? Well, it's going to be an if statement. If and another Regex, if our descriptor day or week equals week. Once we pull this out, our description if it's week, then take whatever time code we have and multiply it by seven.
What did I do? I think I probably need to close that guy. Sorry about that, everyone. Okay, now if we run our time code, you can see that our weeks are now multiplied by seven. We can take all that and drop it into a column. All right, so far so good. What's left? Oh, yeah. We want the mean temperature rather than the ranges.
What I'd like to show you right now is how we can make use of Regex once more, and that is to take whatever was in our temperature code and again, apply Regex to it to say that if it was the lower one, the minimum one is going to be the one on the left side. The maximum temperature is going to be the container on the right side.
To set these up, but I'm going to take all of this and wrap it into a loop again, like that. Now we've got our minimum temperature, our max temperature, and our mean. This is how we're going to set this up in Regex. Anytime we've got temp code and this is the pattern, take the first one, take the second one, turn them into numbers, calculate the mean, and then slap that entire thing into a new column.
Oops. Okay, so the last thing we want to do, is grab this middle sample here. Now, I'm not going to walk through this in its entirety. Let me say that back. I am going to walk through this in its entirety. Some of you watching this, if Regex is as new to you as it is to me, it may not get this on the first try. That's the beauty of recording. This is you can pause the recording, you can look at this, you can try it out for yourself.
But basically what we're doing is we're going through the same process. We're creating a container for stress. We're looping through all of our descriptions and we're using those each individually as the source. What are we saying? Well, there's going to be eight characters. Any letter or number or underscore potentially a space as well, although I don't think there are any spaces. Oh, yes, there are. That's why I included that.
There may be a space to eight of them. Then I like this here. This is going to be some stuff. Anything one or more of them, I think was what that meant. What this does is it just tells you to start at the beginning and start looking. Okay, and now where are you going to stop? You're going to stop when you find day. You're going to stop when you find week or week or a space, an open parenthesis, closed parentheses or some digits followed by C, or you get to the end of the line.
When you go through all this, what are we looking for? We're looking to extract the second parentheses thing here. This was a literal open bracket. That's what we're looking for. Just drag all of these things here and drop those into your column.
As you can see, this was a little bit more complicated. It used some more complex functionality, including look ahead. I'm not going to go into the details of that right now. But I'll just leave this up here so that you can see how that was done and how you would go about doing this for yourself. All this says is keep looking forward until you see day and then take everything before. That's what these means. That's what these mean.
With that, what I'm going to do is open this up again. Just to summarize that regular expressions are a specification of a pattern frequently used to clean up or extract pieces of data. That you can search for a pattern and replace it with a different string or extracts different parts of the string.
You can define the pattern using the Regex function or the Regex match function, which we didn't talk about, which we invite you to check out in the help files, which contain lots and lots of information all about Regex. As well as examples about how you can use it to solve the problems that you're looking to solve in whatever industry or whatever situation you're dealing with that.
I would like to thank you very much for your attention and I hope you enjoy the rest of the conference to check out the other talks. Thanks again. Bye, bye