Creative and clever uses for column formulas is the focus of this episode of Ask the Data Doctor. Using interesting data like crime, automotive and nutritional information for candy bars, Brady shows us how powerful “select matching cells” can be to quickly extract information from data tables by highlighting only the appropriate rows. Learn how to do some amazing things including computations, and group and filtering. Brady also shows us how to easily select local data filters using the lasso feature, how to use the selected(rowstate()) formula to assign values dynamically and much, much more.
Full Transcript (Automatically Generated)
Welcome to today's show. And I just learned that I've got some extra time, which was great because I was I was going to be pressed for time. And I actually decided to slow things down and just cover one topic instead of two. But now I think we'll be able to cover them both. So today's show is going to show you how to do some things in jump, that build on built in features. But that adds significantly more power. And will actually start because I've got this extra time today.
We'll actually start on the selection front. And what we are going to do is first of all introduce the naming selection, name selection and column feature, which may some of you may not know is there we've got some great ways to select data and this is one of them. And actually, it's not selecting data, but it's helping you record what you selected. So let's take a look at that. Just to introduce that idea. So let's say I've got a graph and I find something interesting in the graph. And so I highlight some points. And then I look in the data table. And I would like to remember that selection so that I don't have to, you know, guess exactly how to select those points in the future. That's something that's built right into jump in the rows section of our menu. If we come down here to row selection, about halfway down, all the way to the bottom, named selection and column, what this will let us do is you get to pick a column name. And we get to say what we would like the selected columns to be labeled.
So I'll just put a one there, and unselected columns, I'll put a zero there, we'll just treat this like a flag. But I could put text if I wanted. When we're done, we hit OK. And you'll notice that all the selected rows have the label they're supposed to and the unselected rows get their label. And this is nice because if I ever came back to this table and this graph and Wanted to select those same points again, I could just select the one and then right click and select matching cells and I get my selection back.
So that's pretty nice. What I want though, is for this to be dynamic that adds significant power is if I could actually select make a selection either in the data table itself or in some report that's linked to the data table and have these points respond. And that's not happening here. So how can I get that to happen? I'm going to delete the column I just made, make a new one. And we can get this with a formula column. It's a simple formula, but you might want to write it down because it's not one that you'd probably remember offhand. It's going to be the selected function. And then as its argument, row state, and that is also a function. So we need parentheses there selected row state Okay, that's that's the function we want. Let's hit OK, and bring that in. Now watch what happens. Anytime I select a row, I get a one in that column. And the rows that are not selected, get zeros. Keep pretty straightforward what that's doing. And this, of course, works because everything's linked. It works, whether I'm in the data table, or in a report that is linked. So this is really powerful. We're going to see a couple of reasons why here.
First, the continuous case, that column, I can make it continuous, or discrete. And if I make it continuous, I can do math. I can average or I can sum and if I average, that's actually giving me the proportion of rows that I've selected. If you think about averaging ones and zeros when you take the average, it's just gonna tell you the proportions of ones you had. I could sum this up Instead, and that will give me the number of rows that got selected. And so the punchline here is that we can update a reporter graph just by making selections somewhere else in jump. So very powerful here. And to, to take a look at that in action, we'll notice that I did put this selected row state column into a data table here. Okay, I've got a continuous and a nominal version will work with the continuous version. And this dataset has complaints about people's trucks. So I've got a manufacturer name, among other things. And I've also got system flags. So 35 different columns here. And I'll just double click on the brakes column, take us over there. And so when I see a one that's telling me that this person, this row of data had a problem with the brakes. And you can see that if I select the matching cells, I look down I've got over 9000 rows of data where people had problems with their brakes.
Now rather than goes Through and scroll and do that select matching cells thing. I actually wrote a dashboard that will illustrate this point. And what the dashboard does, is down below, I've got a data table where I put event handlers in. And so that I can just click, you know, like air conditioner, there's 160 air conditioner problems apparently in this data set. And if you've looked at the report that just came up, okay, if you add those numbers in that lower graph, that's going to add to 160. Those are the 160 air conditioner, issues broken down by manufacturer name.
Similarly, up top I've got the proportion that those events represent in that manufacturers data. So what I'm saying for example, is that the 45 instances of the air conditioner problem among the Fords represent point 3% of the Ford data in the data set point 3% of Ford's rows and The data set, if you will, so we can check it out for breaks. And then we're going to get this updating. So why is this happening? It's because I am selecting by clicking this one, I'm going back to that original table and selecting those 9444 cells. And it is updating this table dynamically. So anything I do in that original table just updates this one. And what's great is that it's not dependent solely on this dashboard a row remember this is linked to that original table. So I could do something like run text Explorer. And when my text explorer results come back, I might see something I want to investigate like windshield wipers. And so I can just select the the rows that are associated with the windshield wipers. Not through a flag in the dataset, but in the text that is actually in that data set that text explorer has processed and you'll notice I've selected some rows here in the data set about 1600, almost 1700. And if I actually go back to my dashboard here, okay, these now are those 1600 rows that we selected. Okay, so this that has updated this. So bottom line is I can drive reports by looking at either sums of that flag, or averages of that flag. And it's all driven by selections I make, whether it's in a dashboard, whether it's in a report, a graph, what have you. It's a very powerful thing to do.
Now, what about if we have the nominal case? Let's take a look at a couple of those. What the nominal case is going to allow us to do is to use this selection flag as a grouping variable. So I get I can do an A versus B comparison. I can say, Well, let me let me investigate those selected cells versus everybody else. Are they the same? Are they different, they look like they're within the distribution. Let's take a look at that. And we can also do an a very powerful extension of the local data filter to be selection based instead of column based.
We'll see that as well. So here's a data set about candy bars. And when I was looking through this data set, I wanted to find a candy bar that was pretty high and non saturated fat and pretty for the calories it had. So we're, you know, the kid interested in or, you know, in this portion of the graph, and what I did was I put that selection column in that nominal column. And you'll notice that I've got a box plot here and a violin plot. But I've got my selected variable there, and it's just reading zero. And similarly, I've got a tabulate over here, and I've inverted the selection of zero. So it's showing me things that are not Selected equals zero, which is precisely selected equals one. In this case, nothing's showing up. But as soon as I select, something shows up, and so I get a list in the tabulate.
That is showing me what exactly what I selected. And that helps me not have to hover over all of these little candy bars. And in fact, that gets difficult because some of the candy bars have exactly the same amount of fat and calories. And I haven't managed to land on any of them. But sometimes when you select like, say two points, you get three or four bars. And so this is good because the tabulate will take care of that. It'll show me everything. And so this is really cool, because I could come over here and say, Well, these are some of the bars I might want. Let's compare them to the others. And I can immediately see not just in this graph, but for any for any metric that I want. I can look at the cholesterol and compare them. I can look at the sugars, the protein, I can go through this and it's not just For graphs like this, this could be for models that you fit I could have, if you think about a model, I could, I could put a local data filter on a model and say I want to filter the selected columns for this model and then build that same model, just side by side in a window and say, I want the unselected points to filter on this model. And then you'll get to see side by side model estimates. For the things you've selected. The things you haven't, think about this in the semiconductor industry, where you might be looking at any wafer map, might find something interesting in here. And you're saying okay, those points I selected, how do they compare to all the other points on that wafer? With respect to any parameter you might want to view as well as listing out, you know, whatever metric you wanted to associate with those particular locations on the way for you may have text that you wanted to render here and that could be, you know, part of your local data filter and your data. tabulate. As another example of that, let's consider, go back to clean this up a little bit.
Let's go to the San Francisco crime data, which has a lot of data, I mean too much data to ever, you know, if I if I look at this local data filter. And again, here's the tip, when you're setting this up, put your selected flag, but but select zero and then invert that. Because you won't get a one to choose. If there's nothing selected yet. You won't get a zero and a one you'll just see a zero. So you just select that and put in verse and then you're good. That's the same as picking ones. But if I turn this off, look what happens. Look at the report I get. Trying to turn this off. Let me just take off inverse. Look at what happens here. Way too much to read. This isn't I mean, this is the whole data set, and it's just it's not what I'm interested in what I want They just want to be able to highlight a little piece of this and have this update.
So not a problem. Let's just get the lasso here. Just say I just want this little piece just right there. Okay, boom, there, that that shows me exactly what I want. And this would be very difficult to do with the typical local data filter. Because if you look at the columns I have here, none of them are spatial. There's no Latin long. I mean, well, there's a Latin long, I guess, if I knew the Latin long I could, I could try to do that. But what if it's a funny shape? That would, that would be difficult. So we get around all that just by setting up with the selection flag, I can go anywhere and make any kind of shape, and it's not going to be a problem. So this in generally works anywhere that you might want to do a local data filter. If you don't have the columns that can easily help you slice and dice your data. You can always do this trick, and then slice and dice the data. Just by selecting or brushing certain points in reports. It's really, really powerful way to do that to extend that local data filter. Right. Next up, we're going to attack a problem that we hear about a lot from customers. And I think that there's a feature that's relatively recent, that a lot of people may not know. Is is called select Duplicate rows. And that lives in the rows selection menu, and it's right above name selection column. This is relatively recent. This is in the last couple of releases or so. And what's this do? Let's take a look. Let's just do that. Let's select the duplicate rows does not thing, apparently. Well, why is that that's because when I don't make a column selection, and here's the trick, if I don't make a column selection, then jump is going to look at all the columns. And so the rows have to match all the way across every column value has to be the same for a row to be deemed a duplicate. However, if I do select the column, or multiple columns, now those only those columns will be considered. And if you look at rows seven, and eight, in this respect, eight could be viewed as a duplicate of row seven. Okay, and so in that way, if I come along and try this again, ah, row eight is selected. And sometimes I have even more duplicates. If I go down here a little bit further, I'll see that rows 89 and 90 are duplicates with respect to these first two columns of row ad. Now why is this a thing and why can this help Well, we get a lot of questions about, hey, I've got some measurements that were taken at certain points in time. And I've got certain tool and part combinations for which these measurements are taken. But sometimes I get more than one, you know, I'll have to revisit this, maybe something went wrong, I have to do it again. And I'll get two, maybe three, maybe four or more rows in my data set, where I've got the same tool in part, and it's, there's measurement going on. And so I get, I don't want all of those rows, I just want the unique rows that are the most recent. And so how can I make that happen? Well, one of the ways if you want the most recent is I can sort here, just sort the date and time in descending order.
That means all the most recent dates are at the top of my data table. So the the tools and parts don't have to be contiguous, you know, they don't have to all be grouped together. For this to work, it'll just go through the table and find those duplicates for you. So I can just now do that selection. And it's going to find out those 200 selected rows that are duplicates of some other row that came first. And remember, a row gets to come first, if its date time is more recent. So I could just delete those 200 rows, if that's all I'm after, is the most recent data.
Similarly, I could I could decide I want the oldest measurement, the first measurement I want to look at, well, then I sort snd and so that now the oldest data is at the top of the data table. And I do the same trick where I might decide that hey, if I have multiple measurements, it's because I wasn't happy with you know, the standard deviation. I think the tool was messed up and I'm going to fix the tool, and I'm looking for the one with the lowest standard deviation Okay, easy enough, sort that standard deviation in ascending order. And now select the duplicate rows and delete them. So why I point this out is not a lot of people, I think know about this because it's relatively new, especially if you've been using jump a long time. And sometimes that's all you need. You're just, you know, really looking for that first occurrence, if you will, in the data table, and you want to, you want to save that and delete everything else. When things get more complicated, you're going to have to start using rankings, and we'll talk about that right now. So let's first of all, bring up a simple data set for illustrations sake. Okay, and what I'm going to do here is I'm going to do a couple things here. I want to show you you know what, I'm gonna do this. We'll make this easier for you. I'm going to put numbers here that allow me to get this data set back in the shape it is right now with respect to row orders. Okay, and I'm just gonna say I'm just gonna, this is my original order. Now I'm going to sort on group so that we can take a look right here. I'm going to put a columns of ones and those of you that watched last week's episode are going to understand what I'm about to do.
Okay, I'm unclear as to why that's not a it's not a sequence. Actually, I just want to continue sequences to the end of table. Yeah, or repeat it, whatever. Okay, there we go. All right. Now, I'm going to actually group by these letters here. And so I'm going to make a new formula column and group by that. Okay, and then I'm going to do a couple cumulative sum. Let's take a look at this. It's in the rows section of the new formulas, cumulative sum. Now what's happening? Let's just look at the A's. Okay, when I just look at the A's, I get 123456. Why? Well, because it keeps taking these ones and adding them up and accumulating them. But it starts over when I get to the B's. It starts over again, when I get to the CS.
What's nice about this is that it doesn't matter how these things are sorted. Okay, it doesn't rely on that when you write lag functions. A lot of people do this with lag functions, then the sorting matters after or you have to write a bunch of logic in there. what's what's happens when I go to the original sort order. It's still okay. If I look at my A's there's still 123456 Okay, there's still going to be numbered in the order in which they occur. In the dataset starting at the top of the table.
So what why is this a thing? Why is this powerful? Well, let's clean this up a little bit. And the first way I'm going to clean this up is to realize, if I just have a column of ones, there's actually a shorthand for dealing with that in the cumulative sum formula. So the column cumulative sum function takes a column that you'd be summing and then and then group variables if they occur. Well, I can just replace this column three, it's all ones. So I can just replace it actually with the number one. And I get the same behavior.
So in fact, they don't even need this column of ones, so we'll get rid of it. Now let's do this. I'm going to group this to make it or excuse me sort on this on these groups to make this a little more clear what's going to happen. But now I'm going to add some columns actually. And let's take a look right here at this x one variable And this x one within group, okay, I'm going to sort by x one and group. Okay, I put a rank function here. If you look at the function, it's the column rank. And it's tied though to x one, this, this is only good for that x one column. But what it does, by group it ranks the values of x one. So the lowest value is rank one. The next lowest values rank two all the way up to whatever the highest value is. I could do this in reverse order, but watch what happens. You know, make sure this is still my by variable. Yep. So let's go ahead and do this in reverse order. And we'll rank this this is the distribution section in reverse. If the trick is I don't know how many each of these groups have, they don't all have six items.
And so what happens when I look at the reverse rank? Look at what happens to me here. I need extra logic, I need to know how many items are in that group for that given column and then do subtract the rank and add one. There's a lot of drama going on. Okay. Okay, before we even go there, let's just get rid of this column and go back and take a look at these numbers. And these numbers are the same. They're the same as long as x one is sorted. These are going to match. The beauty of this column cumulative sum is, all I have to do is sort a column the way I want to sort it. And these are dynamic ranks. I don't have to write 3456 new column formulas. It's all right here as long as I remember what I sorted. If I say you know what i want to rank The X twos, okay, we'll just sort those ascending. And now these ranks actually represent x two and we could take a look at that by by sorting by group and x two.
You'll notice that these values are the ranks. Okay? If wherever I have a one for example, that's going to be the lowest value that any of these variables have for X to pick the lowest value. That's what this universe okay now the ones always show me the highest value within a given group of x two. So why this is so powerful is it now allows us to greatly extend what we did before in case we need more power for something like that, you know, select Duplicate cells. Let's let's look at what we can do with this now.
First of all, it lets us easily perform a bunch of queries. If we have to do that if we have to break out a bunch of subset tables that involve things like oh, what are the top 10 for each group I want to see their top 10 are their top three are the top three territory you know each territory want to see their top three sales, on and on and on. That's really easy with this. Okay, so let's, let's take a look. Here in the car physical data data set. I've got these cars grouped by country and type. And so I might decide, okay, within each of those categories within each combination of country and type, I want to see the three most powerful cars Actually, let's let's look at the three least powerful cars. Let's say I've got to buy a car for my teenager. I don't have a teenager yet, but when I when I buy them a car will be very large and have low horsepower.
So let's let's do this. All I have to do I've got this this column built in right now. Okay, let's take a look at the formula. It's our column, cumulative sum of ones. It's grouped by country and type. Okay, that's exactly what we want. And let's just sort that horsepower, ascending. And so now what do I know, I know that a one right here means it's the first time this country in type has shown up in the table. And that is going to mean because I sorted on horsepower, that it's the lowest horsepower for that country and type.
So I just look at the the two lowest horsepower cars, maybe I just select the one and the two. Then I select matching cells. Now all any cell with one or two get selected, and I can just bust out that subset table. And now I've got my list. You know, I could sort of clean it up and everything but the bottom line is this list now contains for each combination of country and type in this table, the two least powerful vehicles. Okay, that's what that's what this is doing. horsepower, I didn't have to write a ranking for I never will have to write a ranking formula. This is just doing it for me. And it's nice because even if I try to set this up for somebody, you know, I can't decide where they're gonna go from there. And if it's someone who's not super familiar with formula columns, they might not know to set up a rank formula. And they may want to do something like maybe they want to know power to weight ratio, you know, maybe that's what I should have looked at, it's, it's not just enough to be low powered, it has to be heavy to make sure it's really slow. And so I can select horsepower and weight. And let's do a column transform here. On ratio, we'll go reverse order. So now we have power to weight ratio. Let's sort that ascending. Okay, and these will be these cards are going to be good and slow. They got high weight and low horsepower for the for the weight. And we'll just pick let's say the top three in each category. So I just have to grab some ones, twos and threes, as long as I got all of those, select the matching cells. And now I'm ready to take a subset.
Now, you will notice the subsets coming back, you know, this should have 45 cards in it. But in fact, one of these, if you look at one of the country type combinations that we've got, it only has one car, and I think it was the other large.
Yeah, this one, this only has one car in the in the category. So that's why we only got 43 rows instead of 45. But it's so easy to do this, and I don't have to keep rewriting ranking formulas. So it's very easy to do those types of queries and just, you know, table after table after table if you've got work where you have to do that kind of thing. It's just so fast. Last thing is, maybe you've got to answer questions about you know, some do a model say on some data by country and type and you want to figure out Okay, which rows Do I need to get rid of just because there's not enough occurrences within country and type, let's say I only want to keep value, keep countries and types where they got at least 10 cars. Well, typically what you'd have to do is go through, do a summary, right by country and type. And then see that table, come back, sort the rows that come back and pick the ones that are high enough. And you'll notice I've selected 73 rows back in the main table. And so those are the 73 rows, I want these other guys don't have enough observation.
But I don't have to do all that and create, you know, run another platform, create a summary, all I got to do here. If I've got this ranking in places just go find a 10 somewhere, okay, find a 10. And I select the matching cells. Well, you don't you don't have a 10 unless you have at least 10 data points in that country and type combination. So these are precisely the combinations of country and type that have at least 10 data points. So in fact, these six combinations of country and type that are selected now are the ones we want. So I can just right click right now and select the matching cells and those are the same 73 rows that I had selected before, but I didn't have to make a summary table and sorted or do any of that work. It's all you know, these these little few seconds here a few seconds there adds up to a lot of time saved over over an analysis. So that is what I have for you today. I hope you find this useful and can go practice this. I hope that this saves you a lot of time.
Thanks for sharing more inside knowledge about JMP.
Love @brady_brady tricks. Will it be possible to share sample data for practicing?
Very help tip! I'll be creating a "selected(rowstate())" variable in just about every data table I work with going forward. thanks!