☐ cool new feature
☑ could help many users!
☐ removes a „bug“
☐ nice to have
☐ nobody needs it
#myTop10_2022
There is a set of posts with questions how to restrict the Col ... aggregation functions to a specific subset.
Compute Column Mean while restricting observations used
JMP10: Col Max() and Col Min() functions returned the data from excluded row
Col Functions and Row States
How do I use the Col Maximum Formula with a "where" condition
Exclude data in formula
Application cases:
- excluded rows from the aggregation:
Col Max(if(not(excluded()), :weight,.),: sex)
or even shorter (if one doesn't need values in the excluded rows);
Col Max(:weight, :sex, :excluded())
- Benchmark some data with a reference data set (country A in this post)
:data / Col Maximum(if(:country="A", :data,.))
As one can see in the examples above, the easy solution is an If statement inside the Col xyz aggreagtion:
col Mean( if(:sex=="M", :weight,.), :age);
col Mean( if(not(excluded()), :weight,.), :age)
The problem:
The approach is EXTREMELY slow - if the subset if-condition contains a string comparison:
How-do-I-use-the-Col-Maximum-Formula-with-a-where/M78599
Then the time scales quadratically with the number of rows and even for small data sets with just a few 1000 entries the calculation takes seconds.
For data sets with some million entries it would take months (!!!) - but will be stopped by a timeout and an error message pops up. Not nice!
My wish:
fix the bug
Even better:
Provide an additonal where option for Col xyz aggreagtions
- which fixes the timing problem automatically: first restricts the data set, then applys the aggregation (--> much faster, even if the cool "col..." speedup has a bug)
- which make subset definitions easier,
instead of (*), you just need:
col Mean(:weight, :age, where(:sex=="M"))
col Mean(:weight, :age, where(not(excluded())))
timing examples, copied from here
corrected version - in a prev. version the screenshot showed [ms] instead of [s]
more wishes submitted by