cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar

Selecting and Subsetting Data

Started ‎11-08-2022 by
Modified ‎11-08-2022 by
View Fullscreen Exit Fullscreen

So in this demonstration, we'll look at selecting specific rows and using the subset command, which interactively is found in the Tables menu, to pull out a subset of the data table. So in the course journal in section I'm going to click on Ski Lodge Temperature Data. And this is a wide data table, but fairly short. And so our scenario here is that we have a ski resort that has a weather station and it records the temperature every hour and saves it to some file. And so every observation the file is going to have the date plus of the hourly temperature readings. It stats at AM, so is the temperature at AM. Now let's suppose that when the temperature falls below minus degrees Celsius, we plan to turn on the snow-making machines, because at that temperature may be too cold to snow. And so we want to look at our record of historical data to find days in the record when we would have used that snow-making machine. Perhaps we're trying to justify adding another machine or replacing the existing one. And so a couple of things here. One is that this temperature was recorded on a Fahrenheit scale, not Celsius. And two, we're using just a small subset of the data. Now these are in day order, but this is representing one of our best practices of when you're going to be working on a large data set, try to get a representative sample of observations for developing your script so that you can develop and test your script against a smaller subset. Again, hopefully representing any issues you might expect to find in the data. And then you can test it against the full data set. So in this example, hopefully this is representative, that we just took a single week of data. The other thing is that we've got Fahrenheit temperatures recorded but our cutoff is minus degrees Celsius. So while we can certainly convert minus degrees Celsius to Fahrenheit, instead we're going to reiterate this concept of working with the vectorized operations and we're going to extract all the Fahrenheit temperatures as a matrix and convert them to Celsius temperatures. So in the journal there's a pre-written script called Isolate Snow-Making Days Script. And I'm going to run that one piece at a time. And in this demonstration, I'm actually going to use the embedded log option rather than navigating back and forth to the log window. So I'm going to right click in my script editor window and select Show Embedded Log. And I'll just kind of resize this so that we can still see the entire script in the editor part of the window. So I'll select line and run that. And so that gets the reference for the current data table, the Ski Lodge Temperature data table, and stores that in dt. Then I'm going to send the message get as matrix and store that in the variable tempF. So I'm going to run line And often with a larger matrix, rather than displaying the matrix in the log we just get the matrix dimensions and the indication that that has been assigned. Now notice it's seven rows and columns. And if I navigate back to the data table there are columns in this data table. So why did we only pull out columns? Well if you look at the date column, for one thing you can see that it's nominal. So that suggests that maybe this is being treated as character data rather than numeric. And I can further confirm that, I could go to the column info of course, but I can additionally confirm that by making the column just a little wider and noting that the values are left justified, which is how JMP handles character data. So when I sent get as matrix to the data table, it ignored this character column and only extracted the numeric columns. Line then is going to take each element in the variable tempF, that matrix that's stored there, and convert that to Celsius and store that in tempC. And again I get that same message in the log, but you can hover over tempf or tempc of course to see their values. Next I want to count the rows in tempC and I've been mentioning this as a best practice, counting rows or columns outside of a loop. So I count the rows in the matrix tempC and store that in the variable nr. So I'll run line and of course there are seven. The other thing I want to do is I want to have-- I'm going to create two matrices. One which gives the minimum temperature at any given day, that's minTemp, and the other one which is going to indicate whether the snow-making machine should have been turned on or not, and that's called Machine. And they're both going to start with some number of rows, one column, and missing values. And so in line we're cascading the assign function. I use that J function to make a matrix with some number of rows, here are the number of rows from tempC, that single column as I mentioned, and the missing value. And if Machine, and minTemp, and any other variables are going to start with the same value, we can simply assign them in this cascading fashion like you see in line So I'll run that, and in the log and if you hover over Machine or minTemp you just see that column of missing values. Then I want to loop, and I start with a counter I test r less than or equal to the numbers stored in nr, and then the body of the for loop is I'm going to have a variable called obs which will take in this case, the first row and all the columns in tempC. I will find the minimum value in that row and put that in the rth, or first place, in minTemp. Then I will compare the value I just put in minTemp in that position to minus and if this comparison is true, if the value is less than minus I'll get a Boolean one is the result and put that in the first position in Machine, whereas if it's false, if it's greater than or equal to minus I'll put in Machine. Then we update, we test again, and now obs will be overwritten with the next row in the data table. And we'll do the same steps until we populate both minTemp and Machine. So I'll run lines through and we can hover over minTemp. And again it shows up in the hover help, it shows up also in the bottom of the script enter window you can see down below. And I can hover over Machine and there are those and Now the last couple steps we could have cascaded these messages, but for demonstration purposes, we're going to separate them out. So I'm going to send a new column message to dt. It'll be called minTemp, and the values will just be the ones that I just got from that for loop. So I'm going to run lines through And you can see in the background that that new column has been created. I could have added a format argument to new column to have a fixed number of decimal places, for example. This also returned a column object reference that I could have stored in a variable if I was planning to work with that column again. Then the next piece of code is setting new columns a message to dt. This one is called Make Snow. And the values are coming from that machine column vector, which has and And here my script is going to request that we use value labels where if there's a for the data value the label will be no, and if there's a for the data value the label will be yes. And then we of course, use the Boolean one to turn on the option to use value labels. So I'll run lines through Again, we are creating an instance of a column object. And I want to navigate back to the table and scroll over. So in the column right now what we're displaying is no and yes. In the columns panel If I scroll down though, you can still see that Make Snow is continuous, which means that it is numeric data for the actual data values. And another thing with value labels when I'm using them-- and that asterisk in the columns panel also indicates that there is a property, and if you clicked on that you'd see this one-- when I'm using value labels, if I hover over a cell in the data table it shows me the actual data value. But if I were to make a graph right now or any kind of analysis, it would use these labels as long as those are turned on. And then back in the script, again we could cascade these messages to the data table but for demonstration purposes, we're going to do them in these steps. I send select where to the data table, and I'm looking for the rows where the value in the column Make Snow is equal to So I'll run line And you can see in the background there are four rows that have been selected. And then finally I send the subsect command to the data table. I supply an output table name and specify the columns. And because I have four rows selected, I don't have to say which rows I want. Those will automatically be the ones pulled into the subset table. So I'll run lines through And now we have that subset table. Again at this point, we could choose to format the minimum temperature data. The source script reflects the subset. And let's actually take a quick look at that, if I right click on source and select edit. Any time I create a new script using a command from the Tables menu, whether I do it interactively or with a script, I get the source script. And so it just shows that subset was the table command used. The caution with subset is that it doesn't know why these rows were pulled out, so it hard codes those rows. There's no basis for that selection in the script that JMP writes. And JMP also overrides that default tendency to just pull the selected rows by putting a as the argument to selected rows. So if you wanted to capture the script and adapt it to another script that you were writing, you would want to get rid of the two rows I've shown, and prior to sending the subset message, make whatever logical selections of your script requires. So I'll cancel out of that script. So this has been one example of using a command from the table's menu, but also again working with the data as matrices.