I have used multiple ways to subset a long table (rows >100K, very common size of datatable). The problem arises because subsetting requires selecting rows which means a operation on the table before subsetting. And if the table is long it makes it very slow. specially when a foor loop is required to generate plots by subsetting everytime a new parameter.
it does not matter whatever you use, it is just slow: I am not sure if JMP team have any solution to speed up the subsetting??????
select rows where
get rows where
Why are you subsetting the data table?
Try selecting the data you don't want, set the Hide and Exclude row state, make your plot, and repeat for the next group.
I think i mentioned it clearly that problem arises when you select the data, which slows it down.
but i may be not able to understand what you are suggesting. could you provide JSL example. surely i will try if it speeds up things.
I've never had a problem if I use get rows where or even just the rows. You don't actually have to select them.
names default to here(1); dt = open("$SAMPLE_DATA\Big Class.jmp"); rows = dt << Get Rows Where(:sex == "M"); dt_sub = dt << Subset(rows(rows));
What are you using to select the rows? It's possible that's slow.
@vince_faller currently i use the same method. but when file size becomes bigger, even get rows matrix becomes bigger and slow.
if you use small data table like
, anything can work. problem is with Big data files.
So with this table of a million rows. It takes a sixth of second to run. It's probably something to do with your script. Like Dan said, if you gave an example script of it running slow we may be able to help you.
Names Default to Here( 1 ); dt = New Table("Test", Add Rows(1000000), New Column("X", Character, <<Set Each Value(Random Category(.3, "A", .3, "B", "C"))), New Column("Y", continuous, <<Set Each Value(Random Normal(0, 1))) ); st = HPTime(); rows = dt << Get Rows Where(:X == "C"); dt << Subset(rows(rows)); tot = HPTime() - st; show(tot); //RETURNS //tot = 148508;
If you're calling a sixth of a second sluggish, then I don't know what to tell you.
Also, if you're using a for loop to generate a bunch of plots for each parameter, is there a reason you're not using a by?
Here is a simple example of how to do the analyses by groups without subsetting
Names Default To Here( 1 ); dt = Open( "$SAMPLE_DATA/big class.jmp" ); // Reset the data table row states dt << clear row states; // Find the different values for Sex Summarize( byGroup = By( :sex ) ); // Loop across the different groups and analyze them For( i = 1, i <= N Items( byGroup ), i++, // Reset the data table row states dt << clear row states; // Hide and Exclude the unwanted rows dt << select where( :Sex != byGroup[i] ); dt << hide and exclude; // Run the analysis ow = dt << oneway( x( :age ), y( :height ) ); // Change the title of the report to show what data are in the analysis Report( ow )[Outline Box( 1 )] << set title( Report( ow )[Outline Box( 1 )] << get title || ": Sex=" || byGroup[i] ); ); // Reset the data table row states dt << clear row states;
There are two problems with your aproach.
1. you are using "select where(" which is the real root cause for bigger table. once you use select where it is operation of updating the data table of selecting rows (in my case >300K rows) you are just using Big class which is not even counted for a file size.
2. this does not work for the situation below
if i had to correlate height data on X_axis for sex="M" and height data on Y_axis for sex="F".
i hope it can help you to undestand the problem now