Re: How to identify mins of maxes across data groups

BHarris

Suppose I took all the students in Big Class and put them in groups by gender and age, e.g. 12yo boys in one group, 12yo girls in another, etc.

With that setup, I'd like to create a formula column that has a 1 if that group has the shortest maximum height for their age across both genders, and a 0 otherwise.

For example, in this data:

Group	Name	Height	NEW COLUMN
12yo Girls	KATIE	59	0
	JANE	55	0
	JACLYN	66	0
	LILLIE	52	0
12yo Boys	TIM	60	1
	JAMES	61	1
	ROBERT	51	1

The maximum height for girls is 66 and for boys is 61, so the boys group has the shortest maximum height for their age, so the new formula column would have a 1 for Tim, James, and Robert, and a zero for Katie, Louise, Jane, etc.

Is this possible in a formula column?

(I can share what I've tried, but I'm trying to avoid the xyproblem, http://xyproblem.info here...)

hogi · Dec 10, 2024 1:29 PM

You can use the Col ... aggregations.

edit: if somebody asked for the gender with the student with smallest height per age group:
First use Col Min to find the student with the minimum height per age.
Then use Col Max to spread this knowledge among all students with the same sex.

so:

New Column( "myCol",
	Formula( Col Maximum( :height == Col Minimum( :height, :age ), :sex, :age ) ),
)

As you asked for
a formula column that has a 1 if that group has the shortest maximum height for their age across both genders, and a 0 otherwise
the approach has to be corrected to

New Column( "myCol",
	Formula(
		Col Maximum(
			:height == Col Min( Col Maximum( :height, :age, :sex ), :age ),
			:sex,
			:age
		)
	)
)

along the idea of:

what is the maximum height per age and gender
take the values and find the minimum (max) height per age
which student?
[ah, no, we don't want a random student, we just want a max height student!]
-> replace :height with Col Maximum( :height, :age, :sex ) *)
~~use col max to spread the 1 to all other students with the same sex and age~~
~~set the other gender to 0 - or 1, if there is a boy and a girl with the same height.~~
this part we don't need anymore
... and we ended up exactly with Jarmo's solution (see below)

hogi · | Posted in reply to message from hogi 12-10-2024

X-Y problem :
thanks : )

BHarris · | Posted in reply to message from hogi 12-10-2024

Wow, that seems to work, but my brain cannot even fathom how this works...

Let me try to unpack it here and maybe you can help me understand --

Starting with: Col Maximum( :height == Col Minimum( :height, :age ), :sex, :age )

Col Minimum( :height, :age ) --> returns the minimum height for all students of this age

:height == Col Minimum( :height, :age ) --> returns true/false (1/0) if this row's height column matches that minimum height

Col Maximum( <1 or 0 for this row> , :sex, :age ) --> uh... this makes no sense, I thought column maximum works by grouping the data by the group-by columns (:sex, :age), and then finds the column maximum of the values or column in the first argument -- which now is just a single true/false or 1/0 -- and then returns the corresponding max value for this row's group, i.e. this row's values for :sex and :age.

Obviously that's wrong, what am I missing here?

jthi · | Posted in reply to message from BHarris 12-10-2024

Many ways of doing this using for example different combinations of col statistical functions (like already shown). Here other option (I think it should do what you are looking for)

Col Min(Col Max(:height, :age, :sex), :age) == Col Max(:height, :age, :sex)

First Col Min() looks for the Col Max() across the age, sex groups OVER age and that is then compared to age,sex group maximums and if it is same as the age groups minimum -> value is 1.

To understand these easier, I think it is beneficial to just write them all out as separate ones and to make them easier to understand, in this case:

First column with Col Max(:height, :age, :sex) (max_height_for_agesex)
Next column then utilizes that Col Min(:max_height_for_agesex, :age) age_minmax_age
And the final one is the comparison of these two :max_height_for_agesex == :age_minmax_age

-Jarmo

BHarris · | Posted in reply to message from jthi 12-10-2024

Thanks @jthi -- I'm still confused though.

Starting with this:

Col Min(Col Max(:height, :age, :sex), :age) == Col Max(:height, :age, :sex)

When JMP evaluates the "Col Max(:height, :age, :sex)", what does it return? A single value? or a list?

I can create a new column with that as the formula and it produces a list of values that end up in the rows of the column.

But within a formula, it seems to be returning a single value, e.g.:

a = col max(:height, :name);
type(a);

... returns "Number".

And if I create a new script and enter:

a = col max(:height, :name); 
show(a);

it says "invalid row number", but without the ", :name" it works fine returning 70. It seems like JMP is doing some kind of context-specific meaning with these functions. How does this work? Is it explained anywhere?

hogi · Dec 12, 2024 2:59 PM

when you run the formula of a formula column in a cell, it calculates the value for the current row.
When you run it in JSL, it calculates it for the current row.

Sounds the same. But the default value for "current row" is "row 0" - i.e. no valid numbers.

You can also check the calculation for any row - different from the default "0".
Just put a row()=xy in front of your calculation and the calculation will be done for row xy.

so, e.g.

row()=5;	
col max(:height, :age)

will return 66.
:age in row=5 is 12 and the maximum height for age =12 is 66.

hogi · | Posted in reply to message from hogi 12-12-2024

I guess it works like this:

Col ... first runs through all the rows and stores the results of the first argument.
It calculates it row by row - e.g. if the first argument is random uniform().

In addition, it stores a lookup table for the additional group by arguments.
Then it splits the data into subsets, calculates the aggregated values - and then it distributes the aggregated values back into the respective rows.

jthi · | Posted in reply to message from BHarris 12-12-2024

When JMP evaluates the "Col Max(:height, :age, :sex)", what does it return? A single value? or a list?

Single value.

If you create a script and wish to test what is going on, you have to utilize Row() IF you are using by columns (without by columns it will work as JMP knows to use all rows). Row() will be 0 by default in JMP, so you have to change it for example to Row() = 1. I think I have seen some documentation about this in JMP Help or scripting index, but no idea where (could be related to Row(), For Each Row or Col statistical functions).

Names Default To Here(1); 

dt = open("$SAMPLE_DATA/Big Class.jmp");

max_height = Col Max(:height); // works fine as no by var used and uses all rows

Show(Row());
Col Max(:height, :age); // won't work as JMP doesn't know which age to use -> it needs row information

Row() = 1;
cur_age = :age;
cur_maxheight = Col Max(:height, :age);
Write("\!Max height for age ", cur_age, " is ", cur_maxheight);

-Jarmo

BHarris · | Posted in reply to message from jthi 12-13-2024

It sounds like you're saying that JMP is wrapping the entire column-formula in a loop that iterates "Row()" over the rows of the table, calculating a value for each row.

That *would* make sense, except for this expression:

Col Min(Col Max(:height, :age, :sex), :age)

... which by that logic, when e.g. Row() is set to i=1, then it would calculate Col Max() as the largest value in the :height column of the whole table where :age[i] == :age[1] and :sex[i] == :sex[1], which returns a single number. Then it tries to calculate Col Min() of that single number, which by definition can only ever be that number. I'm really struggling to make that make sense.

I keep thinking that Col Max() must be returning a list, then the Col Min() could be looking at all the rows of that list where :age[i] == :age[1].

But my idea here requires that JMP do something similar to what perl does, basically keeping track of whether something is being called in scalar or list context, i.e. if a function is used in a way that expects a list back, it returns a list, otherwise it returns a scalar. That way Col Max() here could return a list for Col Min(), which in turn would return a scalar value.

I'm guessing I'm missing something here and can't put my finger on it.