cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
0 Kudos

Col Median (and others): add "Where" option

☐ cool new feature
☑ could help many users!

☐ removes a „bug“

☐ nice to have

☐ nobody needs it

#myTop10_2022

 

There is a set of posts with questions how to restrict the Col ... aggregation functions to a specific subset.

Compute Column Mean while restricting observations used 

JMP10: Col Max() and Col Min() functions returned the data from excluded row 

Col Functions and Row States 

How do I use the Col Maximum Formula with a "where" condition 

Exclude data in formula 

 

Application cases:

  • excluded rows from the aggregation:
    Col Max(if(not(excluded()), :weight,.),: sex)
    or even shorter (if one doesn't need values in the excluded rows);
    Col Max(:weight, :sex, :excluded())
  • Benchmark some data with a reference data set (country A in this post) 
    :data / Col Maximum(if(:country="A", :data,.))

 

  • calculate relative performance over time (divide all values at time t>=t0  by values at t=t0)
    (here the Col Mean is actually to get the value for :time=0 to the other rows.:)
    : data / Col Mean(if(:time=0, :data, 0) :batch, :sampleID)

 

As one can see in the examples above, the easy solution is an If statement inside the Col xyz aggreagtion:

 

col Mean( if(:sex=="M", :weight,.), :age);
col Mean( if(not(excluded()), :weight,.), :age)

 

 

The problem:

The approach is EXTREMELY slow - if the subset if-condition contains a  string comparison:
How-do-I-use-the-Col-Maximum-Formula-with-a-where/M78599

Then the time scales quadratically with the number of rows and even for small data sets with just a few 1000 entries the calculation takes seconds.
For data sets with some million entries it would take months (!!!) - but will be stopped by a timeout and an error message pops up. Not nice!

 

My wish: 

fix the bug

 

Even better:

Provide an additonal where option for Col xyz aggreagtions

  • which fixes the timing problem automatically: first restricts the data set, then applys the aggregation (--> much faster, even if the cool "col..." speedup has a  bug)
  • which  make subset definitions easier,
    instead of (*), you just need:

 

col Mean(:weight, :age, where(:sex=="M"))
col Mean(:weight, :age, where(not(excluded())))

 

 

timing examples, copied from here

hogi_0-1688758905875.png

corrected version - in a prev. version the screenshot showed [ms] instead of [s]

 

more wishes submitted by  hogi_2-1702196401638.png

8 Comments
Status changed to: Acknowledged

Hi @hogi, thank you for your suggestion! We have captured your request and will take it under consideration.

SamGardner
Level VII
Status changed to: Investigating
 
SamGardner
Level VII
Status changed to: We like it! (in the queue)

@hogi this is being looked at closely as we work on ways to improve how Where( ) clauses are used throughout JMP.  Thanks for the suggestion!

hogi
Level XII

thanks

hogi
Level XII

for the time till where(...) will be available in a future version of Jmp, one can speed up the the col aggregation by orders of magnitude via the workaround
changing the data type to numeric - or use a numeric formula column.
https://community.jmp.com/t5/Discussions/Tips-and-Tricks-best-practice-with-JMP-JSL/td-p/662686 

 

hogi
Level XII

Thanks, @SamGardner  , @Sarah-Sylvestre , it's available now in JMP18 : )


Before changing the status to Delivered, could you please check if the usage of Where is like in tended in this Snippet?

Compared to the original approach is extremely slow : (

dt = Open( "$SAMPLE_DATA/Airline Delays.jmp" );

Print("old aproach");
t0 = hptime();
New Column( "max - 1st of month",
	Formula( Col Maximum( if(:Day of Month == 1 , :Arrival Delay,.), :Month) )
);
dt << run formulas();
show((hp time()-t0)/1000000); // 0.5s

Print("new  aproach - with Where");
t0 = hptime();
New Column( "max - 1st of month",
	Formula( Col Maximum( :Arrival Delay, :Month, Where( :Day of Month == 1 ) ) )
);
dt << run formulas();
show((hp time()-t0)/1000000); // 140s 

 

hogi
Level XII
Print("new  aproach - with Where, semi-bugfix");
t0 = hptime();
 myWhere = Where( :Day of Month == 1 ) ;
Eval (Eval Expr(New Column( "max - 1st of month",
	Formula( Col Maximum( :Arrival Delay, :Month, Expr(myWhere) )
))));
dt << run formulas();
show((hp time()-t0)/1000000); // 40s - better but still way too slow ...
hogi
Level XII

Ah, doesn't work yet  : (

 

Names Default to Here(1);
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );

New Column( "max - 1st of month",
	Formula( Col Minimum( :weight, :sex, Where( :age ==17 ) ) )
);