cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
  • Learn how to build custom Python data connectors and further customize JMP’s Data Connector Framework with the Python Data Connector Demo, available now in the JMP Marketplace!
  • See how to create experiments to support product design and ID useful product features. Register for June 12 webinar, 2pm US Eastern Time.

JMP Wish List

We want to hear your ideas for improving JMP. Share them here.
Choose Language Hide Translation Bar

Col Median (and others): add "Where" option

☐ cool new feature
☑ could help many users!

☐ removes a „bug“

☐ nice to have

☐ nobody needs it

 

#myTop10_2022

 

 

There is a set of posts with questions how to restrict the Col ... aggregation functions to a specific subset.

Compute Column Mean while restricting observations used 

JMP10: Col Max() and Col Min() functions returned the data from excluded row 

Col Functions and Row States 

How do I use the Col Maximum Formula with a "where" condition 

Exclude data in formula 

 

Application cases:

  • excluded rows from the aggregation:
    Col Max(if(not(excluded()), :weight,.),: sex)
    or even shorter (if one doesn't need values in the excluded rows);
    Col Max(:weight, :sex, :excluded())
  • Benchmark some data with a reference data set (country A in this post) 
    :data / Col Maximum(if(:country="A", :data,.))

 

  • calculate relative performance over time (divide all values at time t>=t0  by values at t=t0)
    (here the Col Mean is actually to get the value for :time=0 to the other rows.:)
    : data / Col Mean(if(:time=0, :data, 0) :batch, :sampleID)

 

As one can see in the examples above, the easy solution is an If statement inside the Col xyz aggreagtion:

 

col Mean( if(:sex=="M", :weight,.), :age);
col Mean( if(not(excluded()), :weight,.), :age)

 

 

The problem:

The approach is EXTREMELY slow - if the subset if-condition contains a  string comparison:
How-do-I-use-the-Col-Maximum-Formula-with-a-where/M78599

Then the time scales quadratically with the number of rows and even for small data sets with just a few 1000 entries the calculation takes seconds.
For data sets with some million entries it would take months (!!!) - but will be stopped by a timeout and an error message pops up. Not nice!

 

My wish: 

fix the bug

 

Even better:

Provide an additonal where option for Col xyz aggreagtions

  • which fixes the timing problem automatically: first restricts the data set, then applys the aggregation (--> much faster, even if the cool "col..." speedup has a  bug)
  • which  make subset definitions easier,
    instead of (*), you just need:

 

col Mean(:weight, :age, where(:sex=="M"))
col Mean(:weight, :age, where(not(excluded())))

 

 

timing examples, copied from here

hogi_0-1688758905875.png

corrected version - in a prev. version the screenshot showed [ms] instead of [s]

 

more wishes submitted by  hogi_2-1702196401638.png

8 Comments
Status changed to: Acknowledged

Hi @hogi, thank you for your suggestion! We have captured your request and will take it under consideration.

SamGardner
Level VII
Status changed to: Investigating
 
SamGardner
Level VII
Status changed to: Under Consideration

@hogi this is being looked at closely as we work on ways to improve how Where( ) clauses are used throughout JMP.  Thanks for the suggestion!

hogi
Level XIII

thanks :)

hogi
Level XIII

for the time till where(...) will be available in a future version of Jmp, one can speed up the the col aggregation by orders of magnitude via the workaround
changing the data type to numeric - or use a numeric formula column.
https://community.jmp.com/t5/Discussions/Tips-and-Tricks-best-practice-with-JMP-JSL/td-p/662686 

 

hogi
Level XIII

Thanks, @SamGardner  , @Sarah-Sylvestre , it's available now in JMP18 : )


Before changing the status to Delivered, could you please check if the usage of Where is like in tended in this Snippet?

Compared to the original approach is extremely slow : (

dt = Open( "$SAMPLE_DATA/Airline Delays.jmp" );

Print("old aproach");
t0 = hptime();
New Column( "max - 1st of month",
	Formula( Col Maximum( if(:Day of Month == 1 , :Arrival Delay,.), :Month) )
);
dt << run formulas();
show((hp time()-t0)/1000000); // 0.5s

Print("new  aproach - with Where");
t0 = hptime();
New Column( "max - 1st of month",
	Formula( Col Maximum( :Arrival Delay, :Month, Where( :Day of Month == 1 ) ) )
);
dt << run formulas();
show((hp time()-t0)/1000000); // 140s 

 

hogi
Level XIII
Print("new  aproach - with Where, semi-bugfix");
t0 = hptime();
 myWhere = Where( :Day of Month == 1 ) ;
Eval (Eval Expr(New Column( "max - 1st of month",
	Formula( Col Maximum( :Arrival Delay, :Month, Expr(myWhere) )
))));
dt << run formulas();
show((hp time()-t0)/1000000); // 40s - better but still way too slow ...
hogi
Level XIII

Ah, doesn't work yet  : (

 

Names Default to Here(1);
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );

New Column( "max - 1st of month",
	Formula( Col Minimum( :weight, :sex, Where( :age ==17 ) ) )
);