Re: Should you Loop through a data table or use Recode, or use Get Rows Where to...

txnelson · Jun 10, 2023 4:29 PM

A recent Community Discussion script to find and replace a numeric value with a missing value in the whole tablehad a very nice entry provided by @SDF1 that was selected as the solution

names default to here(1);
dt=current data table();
For( i = 1, i <= N Rows( dt ), i++,
	For( l = 1, l <= N Cols( dt ), l++,
		If( Column( l )[i] == 999,
			Column( l )[i] = .
		)
	)
);

When I viewed the solution, an old teaching popped into my brain that roughly stated is, "Don't loop through a data table. Looping is slow. Use the JMP built in capabilities."

So I started out to modify @SDF1 code.

I determined that my fallback standard of "Get Rows Where()" would be faster. After all, it allows all found rows to be set with one assignment statement, after the "very fast" Get Rows Where() has found the rows.

Another solution, that I have seen ~~pushed~~ recommended by the community is the Recode facility in JMP.

So:

Hypothesis: The built in JMP functions of Get Rows Where and Recode will be faster than Looping down and across an entire data table.

I selected the Semiconductor Capability data table as a good sized table, 132 columns and 1455 rows

I wrote a little script that sets 10 random values in each column to 999.

names default to here(1);
dt=current data table();
for(i=5,i<=ncols(dt),i++,
	if(col stddev(ascolumn(dt,i))!=0,
	for(k=1,k<=10,k++,
		column(dt,i)[randominteger(1,1455)]=999;
	)
));

Names Default To Here( 1 );
dt = Current Data Table();
start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	dt << Recode Column(
		As Column( dt, i ),
		{If( _rcNow == 999, ., _rcNow )},
		Target Column( As Column( dt, i ) )
	)
);
show(hptime()-start)

This is the original value to be changed in the Discussion mentioned above. The 999 is to be changed to missing (.) by the scripts.

Method:

Startup JMP 16.
Read in the Semiconductor Capability data table
Run the script to set the 999 values in each column
Repeat the above 2 steps for each of the 3 methods and recode the time down to the millisecond.
Repeat the above steps for JMP15

Here is the script for the Recode

Names Default To Here( 1 );
dt = Current Data Table();
start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	dt << Recode Column(
		As Column( dt, i ),
		{If( _rcNow == 999, ., _rcNow )},
		Target Column( As Column( dt, i ) )
	)
);
show(hptime()-start)

The code for the Get Rows Where

names default to here(1);
dt=current data table();
start=hptime();
for(i=5,i<= ncols(dt),i++,
	column(dt,i)[dt<<get rows where(ascolumn(dt,i)==999)]=.;
	if(mod(i,10)==0,show(i));
);
show(hptime()-start)

And here are the surprising results

The For Loop wins hands down. And from the results, the Get Rows Where should be used sparingly.

Comments and corrections are welcomed......

Jim

jthi · Apr 30, 2021 01:39 PM

I have been doing some optimization at work to some old scripts we have. Some of the stuff that that I had to change were related to get rows where. I have started to use more and more matrix operations (mostly Loc()) if I need to have fast JSL code.

For example you could change the get rows where to following (didn't test if it replaces correct values or all of them):

start=hptime();
for(i=5,i<= ncols(dt),i++,
	column(dt,i)[Loc(ascolumn(dt,i) << get as matrix, 999)]=.;
	if(mod(i,10)==0,show(i));
);
show(hptime()-start);

There is also option of using column properties Value Labels and Missing Value Codes which might work in some cases. I like using them when possible, because then you won't be changing the data.

-Jarmo

txnelson · Apr 30, 2021 01:50 PM

@jthi ,

Your matrix based code came in as the fastest.

0.299795 seconds........

Jim

jthi · Apr 30, 2021 01:58 PM

Maybe @Craige_Hales could have us even faster solution. I have gotten the idea to use Matrix calculations (especially Loc()) from his community posts/blog.

I have made one tool at work faster which opens dataset and then deletes old data (based on user selection). Originally it used << Select Where and Delete rows. I replaced it with some Loc() magic and it got also way faster. I have been really liking Loc() a lot lately.

-Jarmo

Ryan_Gilmore · Apr 30, 2021 02:26 PM

I agree that Loc() is the best method.

For me, the slow down seems to occur because of As Column. I see a significant difference between,

start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
    As Column( column(dt, i) );
);
Show( HP Time() - start );

and

start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
    As Column( dt, i );
);
Show( HP Time() - start );

Changing the As Column syntax in the code from @txnelson original question, the For loop is still the fastest but the others are more comparable.

jthi · Apr 30, 2021 02:41 PM

So we skip AsColumn() with (going more with a hunch and I have not tested that this will replace all correct values):

start=hptime();
for(i=5,i<= ncols(dt),i++,
	column(dt,i)[Loc(dt[0,i], 999)]=.;
	if(mod(i,10)==0,show(i));
);
show(hptime()-start);

-Jarmo

SDF1 · Apr 30, 2021 02:19 PM

Hi @txnelson ,

Very interesting comparison, and not the results I would have expected, either. I remember when I first started with learning programming in college and was also taught something similar: be wary of loops because they're slow.

For sure, with some calculations and in some programs, I have found For loops to be very slow -- slower than the alternatives. Not being all that great of a programmer, though, I tend to fall back on what I know how to do, which isn't always quick or elegant. Sometimes for me, the logic of a loop allows me to write a functioning code to do what I want it to do, which allows me to move on to the next steps.

I find it interesting that the For Loop and Recode options are slightly slower for JMP 16 versus 15, but the Get Rows Where improved by cutting the time almost in half, that's a substantial change.

Thanks!,

DS

jthi · Apr 30, 2021 02:53 PM

I might have been wary of using loops at some point but after taking one algorithm course (which I never did finish..) a little less so. You just have to be efficient with them, quite often you loop way too much (time complexity and Big O notation). Also using loops is way easier than trying to solve problems with pure mathematics.

With JMP I usually just go with the safe options I have learned at some point, as the speed is rarely an issue. But when the time is an issue (one application took over 6 minutes to open at work) I will start looking for ways to make it faster (managed to make it start in less than 2 minutes). In this case I managed to make it faster by using Loc() with different combinations of other functions (delete row, set row state).

In addition to Loc(), using Summarize() seems to be "way" faster when compared to my long favorite option to getting unique values in column (associative array() << get keys). When the row count increases the Associative array() << get keys gets slower and slower and Summarize() doesn't seem to suffer from same issue, at least not as fast. So I have tried to start using the Summarize() more (never know how tall the datatables get).

-Jarmo

ms · May 2, 2021 01:03 PM

Interesting discussion! The data table matrix notation, which I think first appeared in JMP 13, is quite efficient for assignment.

For this particular problem it's almost 500 times faster than the for loop (0,003 vs 1.5 seconds, JMP 15 Mac laptop).

start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	Try( dt[Loc( dt[0, i], 999 ), i] = . )
);
Show( HP Time() - start);

The try() is needed to avoid an error for columns without "999".

Looping over columns is still necessary because direct assignment for the whole matrix, which I for a moment thought would be even faster, does not work. Error: "Multiple subscripts require single subscripts at each inner level in access or evaluation...", which on second thought makes sense.

// The left Data Table matrix syntax is OK on its own, but not for assignment. 
dt[0, 5 :: N Col( dt )][Loc( dt[0, 5 :: N Col( dt )], 999 )] = .;

jthi · May 2, 2021 02:14 PM

Went maybe a bit too deep on this (but at least got some kind of an idea how to do this type of testing later if I have to).

I wrote a some sort of testing script with different expressions to test different ways of replacing the 999 with missing. I tested with two datasets (Semiconductor Capability.jmp and Probe.jmp) and changing the amount of missing values in columns.

Loc() is fastest at replacing values (no surprise there). Missing Value Codes() is "always" fast but it has a bit more limited use cases when compared to Loc() and It is a bit more of a hassle to use if there would be multiple values to replace.

Fastest ended up being loc_2d (didn't bother changing variable names from this: Using Loc With a 2D Matrix ). Still have to loop to get values set (at least to my knowledge), so most likely the reason for this being faster than just Loc() is that it will skip some columns. Below are expr_loc2d and expr_loc:

expr_loc = Expr(
	for(i=5,i<= ncols(testData),i++,
		column(testData,i)[Loc(testData[0,i], 999)]=.;
	);
);

expr_loc_2d = Expr(
	mat = testData[0,5::NCols(testData)]; 
	nr = N Rows(mat);
	nc = N Cols(mat);
	blacklocs = Loc(mat, 999) - 1;
	rowlocs = Floor((blacklocs) / nc ) + 1;
	collocs = Mod((blacklocs), nc ) + 1 + 4;
	uniqCols = Associative Array(collocs) << get keys; 
	For(i_2d = 1, i_2d <= N Items(uniqCols), i_2d++,
		testData[rowlocs[Loc(collocs, uniqCols[i_2d])], uniqCols[i_2d]] = .; 
	);
);

Also it seems to be important to use AsColumn(Column(dt, i)) like @Ryan_Gilmore said (didn't even bother testing << get rows where without it, because it will be very slow), but I did test recode with and without Column(). Also just noticed that I missed one Column() from recode, so run couple of quick tests (not attached in data or script):

Fairly big difference between these three:

expr_recode_ascolumn = Expr(
	For(i = 5, i <= N Cols(testData), i++,
		testData << Recode Column(
			As Column(dt, i),
			{If( _rcNow == 999, ., _rcNow)},
			Target Column(As Column(testData, i))
		)
	);
);

expr_recode_ascolumn_column = Expr(
	For(i = 5, i <= N Cols(testData), i++,
		testData << Recode Column(
			As Column(dt, i),
			{If( _rcNow == 999, ., _rcNow)},
			Target Column(As Column(Column(testData, i)))
		)
	);
);

expr_recode_ascolumn_column_column = Expr(
	For(i = 5, i <= N Cols(testData), i++,
		testData << Recode Column(
			As Column(Column(dt, i)),
			{If( _rcNow == 999, ., _rcNow)},
			Target Column(As Column(Column(testData, i)))
		)
	);
);

The test script and some result data attached. Script will mostly likely have some mistakes here and there. Should be fairly easy to add new expressions if someone wants to do so. One thing I didn't test at all is using << Begin/End data update.

-Jarmo

Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table