cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Submit your abstract to the call for content for Discovery Summit Americas by April 23. Selected abstracts will be presented at Discovery Summit, Oct. 21- 24.
Discovery is online this week, April 16 and 18. Join us for these exciting interactive sessions.
Garbage in, Goodies out

Best practices on getting your data into shape for effective analysis.

 

 

Full Transcript (Automatically Generated)

View more...

Hi, everybody. Welcome to today's segment. And what we are going to do today is talk about a few techniques that can help us clean up data more efficiently. Data cleanup is something that is painful in many instances, but we can, we can certainly help things. If we know a few tips and tricks, and these are all built into jump, what we're doing today that was part of my goal for today's show, is, you know, those of you that know me know that I like to script things when I have a problem that needs solved again and again. But for today's show, we're not gonna use any scripts, we're not going to write any formulas, we're just going to see how much we can do with things that are built into jump. So let's get started. And, you know, you can see here what the topics we'll cover today.

 

And I think we're going to hopefully do some things that you may know they were even in jump but he didn't maybe think to use them the way we're going to use them today. So let's jump right in here. First of all, let's take a look at this data table. It's been post pasted in from somewhere. And we've got a lot of missing cells, or the cells aren't missing, but the data is. And so you can see, for example, that these counties need to span down several cells. And the states are the same way. It's a long table. So it's hard to see that but we've got to spin those down. And the same with these metrics. So the first thing I want to talk about is the different ways that jump can fill in data so that you don't have to cut and paste. So let's just run through those real quickly in case you don't know about any of these.

 

Okay, first of all, if I select a single cell and right click, I have the option to fill that to the end of the table orders or to some row of my choosing. And if I pick a row, that's too high for the given data table, then jump extends the data table to accommodate that request. If we select more than one cell at a time, then jump does a couple things. If it can figure out a pattern to Data if that's numeric data, then it might offer us well, it will offer us to the ability to continue that sequence to the end of the table or to another row. So here, let me do that. And you can see the jump is figured out what that sequence was figured out. It's an arithmetic sequence, I'm adding one each time. And here I've got a geometric sequence.

 

And if I do the same thing, you can see the jump figures out, we're multiplying by three from one row to the next. Now, if I've got categorical data, like you see at the end, you know, there's no mathematics behind that. So my choices are to fill to the end of the table or fill to a given row. And what will happen is it'll just repeat that sequence over and over. And sometimes we want to do that with numbers. And that's fine too. If I had just selected these cells and decided I would repeat the sequence. I'm going to do it to the end of the table. It goes 123123 and so forth. Finally jump. And this is something I think maybe not everyone knows about. If I highlight cells where I've got gaps in between data, then my option here is to replace the missing values with whatever is above them. So here, I'll select and you'll notice that those A's, B's and C's have gotten filled in.

 

And so that's very useful. And something we'll use in that table we saw before. And this works with numbers as well. And notice in that first column, column four, I did not go to the end of the table. So this only happens wherever you select. So if you want to do this all the way down to the bottom of the table, select the whole column, and then do this. Okay, so just a quick review of the ways that jump can fill in data so that you don't have to copy and paste. Let's let's go to our data set. So it's pretty clear what we have to do here and this is going to be pretty easy. We're going to select the state and I want to point this out. If you right click on that The column header, then you get all these column options. That's not what you're after, you need to click in the data that you've selected.

 

And that's where that fill command will show up. So we're going to replace the missings with the previous values for the states and counties. And then these metrics, we're just going to repeat over and over again, because we were lucky enough in this data set, that every county has four rows of data, births, deaths, the net international migration and the net domestic migration. So we can just come here, right click and fill to the end of the table. And we're done. Okay, we can leave the data like this, we could split it out using this column. That's a lot of times why we want to do this and so we can split the data out into a wider format, but we're ready to go. So that's a really easy example.

 

That would not have been fun with copy and paste, but it's easy with the built in tools we have. Let's go to our next example. And This is going to involve a little bit of cleanup where and this is I'm sure this has happened to you before you've you've pasted in data. Or maybe you've, you've tried to do an import, but there's been so many header rows that things didn't work out the way you wanted. And so what ends up happening is you have a header row in the first row or some other early row of jump.

 

And really, you want those to be column names. The easiest way to make that happen is to select that whole row, and you're just going to copy it. Okay, now you won't see it from a right click. So you're going to use your edit menu or you're just going to do a Ctrl C, or Command C on a Mac. Then you're gonna come over here to the left and just select all these columns. And you can do that with a Ctrl A if you need to, and then we'll Ctrl V will paste and immediately all those names go in the columns just like you wanted. Okay, now our columns are named appropriately. So that's that's the first thing is just a quick cut and paste of these column names. It saves a lot of time.

 

Now the next Thing is, I'm going to show something that a lot of you have seen before. And that is when I right click on a selected cell, I can select the cells that match that cell. So here I've got a negative four in this column, and I can select all the rows in the column that have a negative four. And that's with a right click. What you may not know is that if I do this for a pair of columns, a pair of values in the same row, they don't have to be adjacent. I don't see that menu option here. So you might not know that it exists, it still exists, you got to come to the left here in the selection bar for that row.

 

And then you can select matching cells. And if you look down below, in the lower left of the table, you can see only three rows in the data set have this criteria, but the point is, you can do it and we'll take a look at this later. It also works for multiple selections. So if I if I select a 123 or four and I select the matching cells Then I go through the whole table and select anywhere that I see a one, a two or three or a four in that in that column.

 

So how does that help us here? Well, if you look at this data, we've got this header information repeated every so often. And but the thing is, it's the same, it's the same values. It's the same corruption if you will. So I'm just going to take this, this whole row selected and right click and select the matching cells. And you can see that that is selected 10 rows, I'm just delete them. And boom, we're ready to go. Now we have to clean up a little bit, you can see that by the left justification.

 

These columns are actually being stared stored as characters and so are these. But you're going to see a cool technique later today with Peter and Harry. called them, but I won't go through the whole thing. But I'll show you where to see it called standardized attributes. If I right click and click on standardized attributes, This lets me change attributes for all six of those columns at once. So real time saver, so I could make those numeric very easily. Alright, so next, we're going to use this select matching cells, we're gonna use these last three bullets in a more complicated example.

 

So let's, let's pull that up and take a look. Now this data has been pasted in from a website, I'll show you how I got this. This really happened to me is not fun. So I went to our discovery website and I we've got let me go to the abstracts here. So we've got papers and different information about them and the authors and in everything else, and I just copied this whole page and pasted it into jump.

 

And so what I got was something that looked like this. The issue is, of course, that everything's in a single column. And I don't want it to look like that how I want it to look, I'll just run this script here. This is what I want. I want every paper on its own line, and things broken out by session, topic, title, authors level and abstract. So this would be easy with the tricks we've learned today, except for one thing.

 

And the one thing is that some papers have two authors, some papers have one author, some papers have more authors, four, or five or six authors. And that's what's going to keep us from using the trick that we used here where I could just say, Look, I know the five things that I have for every paper, and I'm just going to copy those over and over and over again. It's not going to work like that for us, because we don't always have the same number of things. So what can we do about that?

 

Well, there's a lot of things you can do about that. You saw my script. You know, that does it instantly, but you have to write a script. And if you use formula columns, you might have to write formulas. So how can we do that with with none of those things? Let's take a look. First of all, let's select all the blank cells with select matching cells and delete those rows. We're also going to delete the rows where we have, you know, non useful information. So that's some of these first few rows and then at the bottom of the table, we've got the same thing going on. So let's get rid of those.

 

We'll just highlight all those rows and get rid of them. So that's kind of first step. Next step, I'm going to delete column two here, we don't need that column. And we're going to pass this through a quick recode, where I'm just going to collapse the whitespace. Let's get rid of redundant white spaces. And we'll just do this in place. We don't need a we don't need a new column for this. Once that's done, go ahead and hit OK. A recode rather. Now let's take a look at papers. How do we know that we've got a paper Well, if you if you take a look at the first line in this data table, you'll see the word session. Okay session ID. And this this is how you set off papers.

 

There's, there's a paper, there's a paper, there's another paper. And how can I flag this? Well, the first thing I want to do is I'm going to do a Ctrl. Find Ctrl F, and I'm going to change session ID to session underscore ID. And I do that for for two reasons. One, it lets me double check. You saw it's made 44 replacements, I can see that Yeah, I had 44 papers, so that that makes me happy. But secondly, I've created a word here. Now, that is unusual. It's not something I'm likely to find. If I if I search the text of an abstract, you might find the word session, but you're not going to find session underscore ID. So it helps make that unique. Now what we're going to do, is we're going to do another recode but this time, we will allow the new formula to be made. And we're going to say, or the new column rather, I want the first word I just want the first word at every line.

 

And what's that do for us. Now I can see wherever session ID is, and we're going to use a cool copy and paste with a flagging technique to set up all the papers and group them together so that we don't mix up the contents for each paper. And here's how we'll do it. We'll create a new column with a double click, and we'll use the fill technique we had earlier. To fill this column with zeros. Then what we'll do is we'll change that first entry to a one. That's our flag that's flagging the beginning of a paper. I'm going to Ctrl C, I'm going to copy that into my buffer. And then when I go to Select matching cells in this session ID column or the, you know, the second column we created. Watch what happens when I move around. Those rows are still selected. And because of that, I can just paste into this last column I had. So now I've got a flag. It's a new mayor.

 

I could hear What I want to accomplish next, what I want to accomplish next is use these flags with a new formula column A, it's in the row section here, and we're going to do a cumulative sum and watch what happens here. Let's look at that first paper, my cumulative ones, Psalm starts at one, and then I keep adding zero, so it doesn't change until it gets to that second paper. And it keeps changing only when I find a new paper. So what's nice about this is I'm going to remove that formula. And this is now this is a grouping column. This is now something I can use. I'll just call it paper to group my papers and make sure that the information for each paper stays together. And I'm going to need this as a grouping column when I split out later because papers have different numbers of rows. Now the next thing I can do also with this, this idea of a cumulative sum, this might be counterintuitive. I'm going to change all these entries to one by filling to the end of the table, and then I'm going to group on paper. So I'm going to right click go to the new formula column, and group by that. Now, what sorts happens when we do a cumulative sum on these ones?

 

Yeah, I get counting, right? Because I'm accumulating ones 1234567. And then it starts over, in each paper that the counting starts again. Why is that nice? Let's take a look at one and two, and select the matching cells. Notice, that's always the session ID of that paper. It's always the title of the paper.

 

Now, I wish that this happened everywhere, but it starts to break down for author because here's two authors. And so I've got the numbers three and four, but I only have one author in the next paper, so I just get the number three. So the bottom line of this is the last three sections with three rows of each papers data. The last three rows are always going Topic level and industry but they don't always necessarily equal 567. They're going to equal maybe four or five, six, or 910 11. So how do we deal with that? Let's do another formula column. And we'll do this now with the distributional section and do a rank and reverse. And so now look what this is done. This is allowing us to to count backwards. We're counting those same numbers one through seven for that first paper, but now they're in reverse. why this works is is is now if I select the 123 in this column, I do always get those last three parts, I always get the topic, the level the industry, and this is exactly what's needed. So I'm going to break these down by taking out the formulas so that they don't change as I change data here and watch what we can do. Okay, I don't even I don't need this column anymore.

 

In fact, I don't need this column anymore. Okay, what do I need to do, I need to make sure that these fives sixes and sevens actually turn into high values. Now, everywhere the last three lines in the data, I want it to be high values, I don't want the authors to be able to collide as this goes 34567 authors. So I'm just going to be lazy. And type 50 6070. I know I'm not going to have 40 something authors anywhere. Now again, we'll use a a tricky form of copy paste, very convenient here. I'm going to copy these three things into the buffer, come to this column with the 321 and right click select the matching cells. Now I've got the last three rows of every table. When I come back over here, I'm going to paste these with a Ctrl V and now look what's happened.

 

I've got one two, which we know are the session and the title. I've got 50 6070 which I know is a topic level in industry. And starting at three I count as high as I need you to get all the author's my go to for my go to 12. And who knows, the point is, I can now split this table. So let's do it. We'll split it out. And we'll split by this column that we made. And we'll group by paper. And so now I don't even need this grouping column anymore. And this is exactly what we wanted. Okay, if I, if I take a look at column three, that's where the authors start. And let's look as we go along. Some papers have more authors than others. Okay, it looks like the highest number was eight. So that's six authors. We can combine all those.

 

That's what we wanted. We didn't want to lose any authors. We want them all in a single column. So we're going to highlight all of these columns with a shift to try to highlight them all with the shift and dread. There we go. And we're just going to use our columns utilities. Right here, columns utility and combine those two. And now you'll see that in between column two and column three, I have an author's column, columns, three through eight is still selected in the table. So I'm just going to hit the delete key, and they're gone. So now I have exactly what I want, I just need to do a little more cleanup. And it's pretty easy to do. I'll go through a call this first column session and title, this topic, and I've got level here, abstract. Okay, so that's all looking good.

 

And finally, I can just come through. And as a one last thing, I can do a recoding here. And so I'll just I'll recode on all of these things. And I'll do these in place. So for this first one, I'll collapse the whitespace again. And now that we've done that I can come along and Do the first word if I want to get rid of that first word that's in the advanced section. You can say I want everything but the first word that's cleaned up. You the same thing here. But first word. Do it here. And finally, we'll just do another collapsing of whitespace. Just to be vigilant. Alright, that's it. Okay, we got exactly what we wanted. With no scripting, no formulas, not any that we had to make anyway. So there's a lot you can do here to clean up nasty data. And this is this will not be the last time we talked about cumulative sums on the on the doctor. So that's all I've got for today. I hope that helps you I hope this can help you with your data cleanup and back to Julian

 

Comments

Fantastic contribution thank you @brady_brady!  Perfect for us non-scripters out there who need to corral messy data and who use the JMP GUI routinely as the "tool of choice." @PatrickGiuliano   I did not know of the "cumulative sum" trick to sequence data within a group, and I also never thought to use reverse indexing like you did for this! 

Thomas1

Great usage of cumulative sum. Thanks for sharing.

wendytseng

How does one create a sequence in JSL?  For example, row 1 = 1, row 2 = 2, row 3 = 3, row 4 = 4 and row 5 = 5....row 6 = 1, etc....

MannyUy

Thank you a non-scripter and I am sure that I will keep replaying this until it becomes more natural for me!

 

Extremely helpful video for beginners like me, thanks for posting!

good reminder with some interesting news for me