Solved: What is the fastest way to populate a new column with character values?

mikedriscoll · Jul 29, 2016 08:31 PM

Hi, I have a script that does an advanced table concatenation, and I want to add 'source columns' to indicate the original table names. I have a few methods that work but all are slow for large tables. I also tried a formula method but it wasn't quick either. Is there some other trick I can use?

Example below very closely matches what I'm doing. Data table sizes of 1M rows are not unusual, so I'd really like to get this operation down into the low seconds for large tables.

Thanks,

Mike

dt = new table();

dtList = {dt};

rowCount = 50000;

dt << add rows(rowCount);

time2 = today();

if(0,

dt << new column("test", character, set each value(eval(dtList[1]) << get name)); // 24 sec at 50k rows.

,

tempList = {};

for(i = 1, i <= rowCount, i++, tempList[i] = eval(dtList[1]) << get name); //104 sec at 50k rows.

dt << new column("test", character, values(templist));

);

show(eval(today()-time2));

Craige_Hales · Oct 18, 2016 8:32 PM

start = Tick Seconds();

dt = New Table();

Eval(

Eval Expr(

dt << New Column( "test",

Character,

"Nominal",

Formula( Expr( dt << getname ) )

)

);

dt << addrows( 1e7 );

dt << runformulas;

stop = Tick Seconds();

Show( stop - start );

stop - start = 3.80000000004657; // 10,000,000 rows in 4 seconds

The eval( evalExpr( ... expr( ... ) ... )) pattern is one way to manipulate an expression in JMP. To understand it, start with EvalExpr( ... ) which looks at its argument to find any Expr( ... ). the Expr parts are evaluated and the result (a string in this case) is left in place of the Expr. The result of EvalExpr is an expression, which needs to be evaluated by the outer eval(...)

There might be a faster way; I didn't explore very far.

Craige

View solution in original post

Craige_Hales · Oct 18, 2016 8:32 PM

start = Tick Seconds();

dt = New Table();

Eval(

Eval Expr(

dt << New Column( "test",

Character,

"Nominal",

Formula( Expr( dt << getname ) )

)

);

dt << addrows( 1e7 );

dt << runformulas;

stop = Tick Seconds();

Show( stop - start );

stop - start = 3.80000000004657; // 10,000,000 rows in 4 seconds

The eval( evalExpr( ... expr( ... ) ... )) pattern is one way to manipulate an expression in JMP. To understand it, start with EvalExpr( ... ) which looks at its argument to find any Expr( ... ). the Expr parts are evaluated and the result (a string in this case) is left in place of the Expr. The result of EvalExpr is an expression, which needs to be evaluated by the outer eval(...)

There might be a faster way; I didn't explore very far.

Craige

mikedriscoll · Jul 30, 2016 09:05 AM

Thanks Craige. After I posted my original message, it occurred to me that JMP's native concatenate() tables function has the 'create source column' option, and I was able to find the code to this using the standard concatenate feature and then click on the red triangle / source script to see how it worked. This option doesn't seem to be documented anywhere. My script works with two copies of the original tables, so this intermediate solution required a little manipulation to get it back to the original table names. Had it been the original tables this would have been a good solution.

Your method is fairly clean, and certainly is fast. The need for something like this comes up fairly often so I'll bookmark this one. I think it is time I learned the value of those function combinations. I've seen them used, I've used them when I've been told it's how to get something to work properly, but I rarely seem to be able to recognize when to use it. I was close this time... I thought I got the formula working (mentioned in OP) but looking at the code I remember it didn't work because the expression just showed up in the formula. I figured I would need something like this but wasn't sure how to do it.

Is there some rule of thumb regarding which types of functions can / should be wrapped in expressions to optimize my scripts? Or would you recommend just running it and doing a timing analysis and fixing what's slow, which is what I do?

Thanks again,

Mike

PS thanks for reminding me about tick seconds(). I knew there was a better function than today() for that.

Craige_Hales · Jul 30, 2016 10:14 AM

There is an easier way than my first example, at least for this case.

start = Tick Seconds();

dt = New Table();

dt << New Column( "test", Character, "Nominal", Formula( Eval( dt << getname ) ) );

dt << addrows( 1e7 );

dt << runformulas;

stop = Tick Seconds();

Show( stop - start );

The Formula argument of the NewColumn message to a data table was originally designed to be very easy to use for the common cases, something like

Formula( height + weight )

where height and weight are data table columns and the addition should be done on each row. So, Formula does not evaluate its argument, but just stores the unevaluated expression as the column's formula.

Unless...the outer function is Eval(...) which you see in the easier example above. When the formula(...) argument is processed, it looks for the Eval and uses the evaluated result. There are other places in JMP that do this too. And, there are other places, like the substitute function, that use Expr(...) to prevent an evaluation of expression arguments.

Craige

What is the fastest way to populate a new column with character values?

Re: What is the fastest way to populate a new column with character values?

Re: What is the fastest way to populate a new column with character values?

Re: What is the fastest way to populate a new column with character values?

Re: What is the fastest way to populate a new column with character values?