Should you Loop through a data table or use Recode, or use Get Rows Where to cha... - Page 3

txnelson · Jun 10, 2023 4:29 PM

A recent Community Discussion script to find and replace a numeric value with a missing value in the whole tablehad a very nice entry provided by @SDF1 that was selected as the solution

names default to here(1);
dt=current data table();
For( i = 1, i <= N Rows( dt ), i++,
	For( l = 1, l <= N Cols( dt ), l++,
		If( Column( l )[i] == 999,
			Column( l )[i] = .
		)
	)
);

When I viewed the solution, an old teaching popped into my brain that roughly stated is, "Don't loop through a data table. Looping is slow. Use the JMP built in capabilities."

So I started out to modify @SDF1 code.

I determined that my fallback standard of "Get Rows Where()" would be faster. After all, it allows all found rows to be set with one assignment statement, after the "very fast" Get Rows Where() has found the rows.

Another solution, that I have seen ~~pushed~~ recommended by the community is the Recode facility in JMP.

So:

Hypothesis: The built in JMP functions of Get Rows Where and Recode will be faster than Looping down and across an entire data table.

I selected the Semiconductor Capability data table as a good sized table, 132 columns and 1455 rows

I wrote a little script that sets 10 random values in each column to 999.

names default to here(1);
dt=current data table();
for(i=5,i<=ncols(dt),i++,
	if(col stddev(ascolumn(dt,i))!=0,
	for(k=1,k<=10,k++,
		column(dt,i)[randominteger(1,1455)]=999;
	)
));

Names Default To Here( 1 );
dt = Current Data Table();
start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	dt << Recode Column(
		As Column( dt, i ),
		{If( _rcNow == 999, ., _rcNow )},
		Target Column( As Column( dt, i ) )
	)
);
show(hptime()-start)

This is the original value to be changed in the Discussion mentioned above. The 999 is to be changed to missing (.) by the scripts.

Method:

Startup JMP 16.
Read in the Semiconductor Capability data table
Run the script to set the 999 values in each column
Repeat the above 2 steps for each of the 3 methods and recode the time down to the millisecond.
Repeat the above steps for JMP15

Here is the script for the Recode

Names Default To Here( 1 );
dt = Current Data Table();
start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	dt << Recode Column(
		As Column( dt, i ),
		{If( _rcNow == 999, ., _rcNow )},
		Target Column( As Column( dt, i ) )
	)
);
show(hptime()-start)

The code for the Get Rows Where

names default to here(1);
dt=current data table();
start=hptime();
for(i=5,i<= ncols(dt),i++,
	column(dt,i)[dt<<get rows where(ascolumn(dt,i)==999)]=.;
	if(mod(i,10)==0,show(i));
);
show(hptime()-start)

And here are the surprising results

The For Loop wins hands down. And from the results, the Get Rows Where should be used sparingly.

Comments and corrections are welcomed......

Jim

hogi · Dec 12, 2022 2:16 PM

hi @ron_horne , the problem is with the get rows where, so speed should be/ is independent of such settings.

On the other hand, it makes a huge difference if the column index is explicit or if it's a variable (even if the variable has an assigned value).

This was not clear to me - how good that I found this conversation!

I don't know how many jsl codes out there could be faster by a factor of ~ N(rows of data table) ...

Can the issue be fixed by what goes on in the background of JSL?

Or is there a fundamental something in the JSL architecture which specifies that such a behavior is intended and cannot be changed?

If so: a warning message could help coders not to be trapped by this issue ...

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Semiconductor Capability.jmp" );

For( k = 1, k <= 4, k++,
	dt << Concatenate( dt, dt, dt, Append to first table )
);

nuke = Function( {},
	For( k = 1, k <= 10, k++,
		Column( dt, 5 )[Random Integer( 1, 1455 )] = 999
	)
);

//dt << show window(0);
//dt<< Begin Data Update();

nuke();

start = HP Time();
i = 5;
dt << get rows where( As Column( dt, i ) == 999 );
Show( (HP Time() - start) / 1000000 );

nuke();

start = HP Time();
dt << get rows where( As Column( dt, 5 ) == 999 );
Show( (HP Time() - start) / 1000000 );


//dt << End Data Update();

hogi · May 30, 2023 2:07 AM

Hm, perhaps it's possible to switch JSL to a user-friendly mode (which pre-evaluates variables if this leads to better performance)?
wish: get rows where: speed issue

If a user is aware if the issue, a very cheap solution is to add an Eval():

start = HP Time();
for(i=5,i<= ncols(dt),i++,
dt << get rows where( As Column( dt, Eval(i) ) == 999 );
);
Show( (HP Time() - start) / 1000000 );start = HP Time();
i = 5;
dt << get rows where( As Column( dt, Eval(i) ) == 999 );
Show( (HP Time() - start) / 1000000 );

hogi · Dec 13, 2022 2:05 PM

@hogi wrote:
...
If a user is aware if the issue, a very cheap solution is to add an Eval()

cheap regarding complexity of the JSL code vs. time penalty.

But for longer tables Eval(i) loses signifantliy agains Substitute(... ,NameExpr(As Column()))).

So: still quite fast (compared to the original version :), but not as fast as I would have expected.

Why? What's the fundamental difference?

Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table