Re: Should you Loop through a data table or use Recode, or use Get Rows Where to... - Page 2

txnelson · Jun 10, 2023 4:29 PM

A recent Community Discussion script to find and replace a numeric value with a missing value in the whole tablehad a very nice entry provided by @SDF1 that was selected as the solution

names default to here(1);
dt=current data table();
For( i = 1, i <= N Rows( dt ), i++,
	For( l = 1, l <= N Cols( dt ), l++,
		If( Column( l )[i] == 999,
			Column( l )[i] = .
		)
	)
);

When I viewed the solution, an old teaching popped into my brain that roughly stated is, "Don't loop through a data table. Looping is slow. Use the JMP built in capabilities."

So I started out to modify @SDF1 code.

I determined that my fallback standard of "Get Rows Where()" would be faster. After all, it allows all found rows to be set with one assignment statement, after the "very fast" Get Rows Where() has found the rows.

Another solution, that I have seen ~~pushed~~ recommended by the community is the Recode facility in JMP.

So:

Hypothesis: The built in JMP functions of Get Rows Where and Recode will be faster than Looping down and across an entire data table.

I selected the Semiconductor Capability data table as a good sized table, 132 columns and 1455 rows

I wrote a little script that sets 10 random values in each column to 999.

names default to here(1);
dt=current data table();
for(i=5,i<=ncols(dt),i++,
	if(col stddev(ascolumn(dt,i))!=0,
	for(k=1,k<=10,k++,
		column(dt,i)[randominteger(1,1455)]=999;
	)
));

Names Default To Here( 1 );
dt = Current Data Table();
start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	dt << Recode Column(
		As Column( dt, i ),
		{If( _rcNow == 999, ., _rcNow )},
		Target Column( As Column( dt, i ) )
	)
);
show(hptime()-start)

This is the original value to be changed in the Discussion mentioned above. The 999 is to be changed to missing (.) by the scripts.

Method:

Startup JMP 16.
Read in the Semiconductor Capability data table
Run the script to set the 999 values in each column
Repeat the above 2 steps for each of the 3 methods and recode the time down to the millisecond.
Repeat the above steps for JMP15

Here is the script for the Recode

Names Default To Here( 1 );
dt = Current Data Table();
start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	dt << Recode Column(
		As Column( dt, i ),
		{If( _rcNow == 999, ., _rcNow )},
		Target Column( As Column( dt, i ) )
	)
);
show(hptime()-start)

The code for the Get Rows Where

names default to here(1);
dt=current data table();
start=hptime();
for(i=5,i<= ncols(dt),i++,
	column(dt,i)[dt<<get rows where(ascolumn(dt,i)==999)]=.;
	if(mod(i,10)==0,show(i));
);
show(hptime()-start)

And here are the surprising results

The For Loop wins hands down. And from the results, the Get Rows Where should be used sparingly.

Comments and corrections are welcomed......

Jim

brady_brady · May 3, 2021 11:18 PM

Here is a way to avoid looping, although there is overhead in creating a matrix and assigning back to the table after. This approach performs quickly on this data, about the same as the loc_2d, but as the table becomes huge, I would expect its relative performance to suffer, but I've not verified this.

expr_mat = expr(
	m = dt[0, 5 :: N Col( dt )];
	m[Loc( m, 999 )] = .;
	dt[0, 5 :: N Col( dt )] = m;
);

ian_jmp · May 4, 2021 08:26 AM

For the 'overhead in creating a matrix' I suspect that data table subscripting might help. @Craige_Hales would know.

brady_brady · May 4, 2021 12:38 PM

Yes... the hurdle is that while matrices can be accessed with a single index (which presumes row-major layout), data tables cannot. Thus, there is sadly no analog to

m[loc(m, 9999)] = .

where m is a pxq matrix, in the data table world... hence the lines preceding and following this one in the code above. While the code is still fast, it would be REALLY fast if single-index access were available for tables.

ian_jmp · May 6, 2021 2:13 AM

Yes, I understand @brady_brady . One could do this, however:

NamesDefaultToHere(1);

// Make an example table with some rogue '999' values
nr = 10;		// Number of rows
nc = 10;		// Number of columns
p = 0.1;		// Fraction of rogue values
val = 999;		// Rogue value
m = J(nr, nc, RandomInteger(1, 10));
m = Shape(m, nr*nc, 1);
m[RandomIndex(nr*nc, Floor(nr*nc*p))] = val;
m = Shape(m, nr, nc);
dt = AsTable(m);
dt << setName("Rogue Values");
Wait(3);

// Set rogue values to '.'
mask = J(nr*nc, 1, 1);
mask[Loc(dt[0,0] == val)] = .;		
mask = Shape(mask, nr, nc);
dt[0,0] = mask :* dt[0,0];

But I very much doubt this would be quicker than what you already have, not least because of the element wise multiplication.

hogi · Jun 5, 2023 03:23 PM

@jthi wrote:
Went maybe a bit too deep on this (but at least got some kind of an idea how to do this type of testing later if I have to).

I wrote a some sort of testing script with different expressions to test different ways of replacing the 999 with missing.
....

I just used @jthi 's script:
https://community.jmp.com/t5/Discussions/Should-you-Loop-through-a-data-table-or-use-Recode-or-use-G...
to reproduce the performance of the different approaches on a system with Jmp (16.2 & ) 17.1.

I highlighted some approaches (incl. loc) which nowadays have to be inside begin/end data update *)
to get rid of a severe time increase with increasing percentage fraction of 999s.

Differences by factor 2 could be caused by the different computers.
Who has a system with Jmp 15 and 17 - for a direct comparison?

*) although the subset was opened "invisible".

ErraticAttack · May 4, 2021 08:17 PM

You've got the comparison a bit wrong here -- the "Get Rows Where" functionality will evaluate it's expression on each row, thus you need to minimize any superfluous lookups before running the loop -- in particular, your iterator is external and it will have to do some scope walking to find it on each row. Also, it will evaluate the AsColumn function on each row, looking up both 'dt' and 'i' each time. This is _very_ expensive. The proper way to to this would be as follows:

names default to here(1);
dt=current data table();
start=hptime();
names = dt << Get Column Names( "string" );
for(i=5,i<= ncols(dt),i++,
	Eval( Parse( Eval InserT( JSL Quote(
	column(dt,i)[dt<<get rows where(:name("^names[i]^")==999)]=.;
	) ) ) );
);
show((hptime()-start)/1000000);

You will notice that this executes on the order of 3000x faster, which should put it comparable to if not better than the other methods you have

Jordan

hogi · Dec 10, 2022 02:14 PM

Ah, I didn't know:

:name() acts like As Column()?

Not so clear from the Scripting index

So, with Substitute and As column() instead of Eval(Parse(Eval Insert(JSL Quote(name("^names[i]^")it is:

start=hptime();
for(i=5,i<= ncols(dt),i++,
	Eval(Substitute(Expr(column(dt,i)[dt<<get rows where(__col__==999)]=.),
			Expr(__col__),
			Name Expr(As Column(dt, i))));
);
show((hptime()-start)/1000000)

And after this surprising finding I am almost surprised here that the get rows where is also fast if the number comparison is replaced by a character comparison

start=hptime();
for(i=5,i<= ncols(dt),i++,
	Eval(Substitute(Expr(column(dt,i)[dt<<get rows where(char(__col__)=="999")]=.),
			Expr(__col__),
			Name Expr(As Column(dt, i))));
	if(mod(i,10)==0,show(i));
);
show((hptime()-start)/1000000)

julian · Dec 10, 2022 05:15 PM

This has been such a fun and informative discussion to follow! Brilliant answers. I am truly not qualified to add anything of value here so I'm going to resort to "cheating."

The OP question was about replacing values of 999 in the data with missing because the 999 are missing value codes and we wouldn't want those included in calculations. It's not at all in the spirit of this very interesting line of questioning, but my contribution is that exploiting the missing value code column property is faster than doing any replacement on the data and achieves the same thing for *most* purposes

start = HP Time();
For( i = 5, i <= N Cols( dt ), i++,
	Column(dt,i) << Set Property( "Missing Value Codes", {999} )
);
Show( HP Time() - start);

~ 0.007 seconds on my m1 mac.

hogi · Dec 11, 2022 03:04 AM

Wow, what a wonderful trick

I feared that problems with Missing Value Code will start as soon as you try:

Col Maximum(:NPN3)

[As we know that excluded rows have to be excluded in Col aggregation by hand (or better: by GroupBy argument).]

But the surprise: Col aggregations respect Missing Value Codes

To be picky: the OP question of the original post was quite implicit:

And from the results, the Get Rows Where should be used sparingly.
Comments and corrections are welcomed......

translated - with an uneasy feeling in my belly - as: "why on earth?!?"

Thanks to @ErraticAttack for solving the puzzle

ron_horne · Dec 11, 2022 02:32 PM

thank you @txnelson
could you please estimate the effect of using invisible and private tables on the different methods? they must be the fastest way to amend any script for boosting performance.

Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table

Re: Should you Loop through a data table or use Recode, or use Get Rows Where to change values in a data table