Group by - sooo useful!!! how about Interpolate?

hogi · ‎10-31-2023

☑ cool new feature
☑ could help many users!

☐ removes a „bug“

☐ nice to have

☐ nobody needs it

What inspired this wish list request?

The Group By concept in Jmp is sooo useful!

You can find it everywhere. e.g. in Graph builder one can use X/Y groups to generate split graphs - and summary statistics will automatically give the correct results for every single graph.

In Platform menus, there is a By option - which splits the data into groups and every group is analyzed separately.

Under the porcellain layer, there are more group by options, like the Col aggregation functions (e.g. Col Median(:height, :sex)) which automatically return separate aggregated values for groups of rows.

What is the improvement you would like to see?

Please use such Group By options wherever it is possible.

let's see if other users find more places where Group By could be useful ...

here I want to focus on a Col Grouped Interpolate function which can be used in a Column Formula like

New Column("interpolated", Formula(Col Interpolate(:newX, :height, :weight, by(:sex))))

It should generate an interpolation for :weight by :height - split by subgroups M/F in :sex - and use the interpolation to calculate values @ x values :newX.

In a similar way, the existing Interpolate function could be lifted to a new level by providing an argument BY() which can be used to interpolate subgroups
[for this case, I think an additional argument for()is needed to specify the groups of interest and their order]

Grouped Interpolate

Interpolate(myheight, :height, :weight, by(:sex), for({"F","M"}))

Definitely worth to have it - which can be seen by the fact that users invested the time to program their own grouped Interpolate to get the results:
https://community.jmp.com/t5/Discussions/Grouped-Interpolate/m-p/692358/highlight/true#M87807

Why is this idea important?

Group By is one of the concepts which makes Jmp a million times more useful for exploratory data analysis than Excel .

use this concept at as many locations in the program as possible.

more wishes submitted by

hogi · ‎10-31-2023

Besides a siple interpolation, one could also think about more fancy fits to the data - along the idea of saving a Spline formula to a data table - such that the calculated values can be used for further processing.

The disadvantage of saving the spline [like it's already possible at the moment]:
it's static !

If values in the table are changed or added, the formula will not update.

(view in My Videos)

Sarah-Sylvestre · ‎07-18-2024

hogi · ‎08-03-2024

other functions which will benefit from ByVar:
Dif() and Lag() formula functions by column group (byVar argument)

XanGregg · ‎01-22-2025

Hi @hogi ,

I'm reviewing the Col XXX functions in connection with the transform scoping, where they hook in well. Can you elaborate on the usefulness of Col Interpolate? My examples are a bit contrived and some would work better applying a fitted formula rather than interpolating against nearest values. Probably I'm missing something.

One use case I have is to compute a relative change against some benchmark, such as:

dt = Open( "$SAMPLE_DATA/SATbyyear.jmp" );
dt << New Column( "math relative", Formula( :SATMath / Col Interpolate(1992, :Year, :SATMath, :State)));

However, you could get the same result with existing functions (except Col Interpolate might better handle not all groups having the same base year).

dt << New Column( "math relative 1", Formula( :SATMath / Col Mean( If( :year==1992, :SATMath,.), :State)));

The ab.jmp example from the community discussion seems uncommon and grouped Interpolate is not that natural, at least far as I've come up with (having to call it twice with and not using any grouping):

dtAB << New Column( "diff", Formula(
     Col Interpolate(:time, :time, if(:source table=="a",:position,.))
     - Col Interpolate(:time, :time, if(:source table=="b",:position,.))));

It does save on performance by not re-gathering the X and Y arrays for each row, and maybe that's sufficient.

Also, thanks for linking to the Lag/Dif item. I'm also looking at a solution for that. Currently targeting:

dt << New Column( "Grouped Lag Height", Formula( Col At( :height, -1, :sex, <<relative( 1 ) ) ) );

hogi · ‎01-22-2025

Hi @XanGregg

thanks for picking up the thread. I think it can be very useful! Even with a simple interpolate : )

I agree with you - compared to a grouped interpolate function, a grouped fit function will be much more poweful (I did not dare to ask : )
and I agree: for some applications, Interpolate can be wrong.

If this is the case, I use GraphBuilder with overlay. I save the formula and then I use the saved formula to calculate the values at the desired x-values. A JSL function would be very practical here - but it also has a disadvantage. Actually, the "detour" via Graph builder is NO detour - it's essential. I wouldn't dare to skp it. The graphical feedback is needed to set the fitting parameter to the right value - to TRUST the fit. Without this feedback, lambda can be wrong - and the user won't notice.

Interpolation lacks the freedom to set a parameter to a reasonable value, but it also lacks the danger of setting it completely wrong.

In this respect, Interpolate makes more sense than one might think ...

Especially in places where the result is already OKish, but Interpolate could make it even better.

[we are talking about grouped COL functions. The advantage of grouping is obvious and does not need to be emphasized. So I will spare you the increased complexity and show you a simple example:]

Fuzzy Merge - with and without interpolation.
And compare with curve fitting - I used the Smoother function in the plot and used a wrong value for lambda.

In our case, we have measurement curves as input, with negligible jitter along Y.

The issue: Not all curves were measured at the same points.
the goal: generate a data set where different curved can be divided and subtracted.

When we take your example:
One data set has data for 1992, another data set has data for 1990 and 1995.

I am sure, there are many more application cases.

Names default to here(1);
dt = New Table( "Untitled",
	Add Rows( 10 ),

	New Column( "x", Numeric, "Continuous", Format( "Best", 12 ), Formula( 10 * Row() ) ),
	New Column( "measured",	Formula( ((:x + 5 * :x) - :x ^ 2 * 100) + :x ^ 3 ))
);

dtnew = new table ("new", 
	add rows(100),
	
	new column ("x", 	set each value(row())),
	New Column( "expected",	Formula( ((:x + 5 * :x) - :x ^ 2 * 100) + :x ^ 3 ))

);

wait(0);
fuzzymerge = dtnew << Concatenate(dt);


// Sort data table
fuzzymerge << Sort( By( :x ), Replace Table );

new column ("previous", set each value(if(not(is missing(:measured)), :measured, lag(:previous))));

dt = current data table();
myRows = where(dt,  not (is missing( :measured )) );
myX = dt[myRows, "x"];
myY = dt[myRows, "measured"];

new column("interpolated", set each value(interpolate(:x, myX, myY)));

Graph Builder(
	Size( 645, 399 ),
	Summary Statistic( "Median" ),
	Graph Spacing( 4 ),
	Variables(
		X( :x ),
		Y( :expected ),
		Y( :interpolated, Position( 1 ) ),
		Y( :previous, Position( 1 ) ),
		Y( :measured, Position( 1 ) )
	),
	Elements(
		Points( X, Y( 1 ), Y( 2 ), Y( 3 ), Y( 4 ), Legend( 11 ) ),
		Smoother(
			X,
			Y( 1 ),
			Y( 2 ),
			Y( 3 ),
			Y( 4 ),
			Legend( 12 ),
			Lambda( 0.829468695095263 )
		)
	),
	SendToReport(
		Dispatch( {}, "400", ScaleBox,
			{Legend Model(
				11,
				Base( 0, 0, 0, Item ID( "expected", 1 ) ),
				Base( 1, 0, 0, Item ID( "interpolated", 1 ) ),
				Base( 2, 0, 0, Item ID( "previous", 1 ) ),
				Properties( 0, {Line Color( -4354269 )}, Item ID( "expected", 1 ) )
			), Legend Model(
				12,
				Properties( 0, {Transparency( 0 )}, Item ID( "Smooth(expected)", 1 ) ),
				Properties(
					1,
					{Transparency( 0 )},
					Item ID( "Smooth(interpolated)", 1 )
				),
				Properties( 2, {Transparency( 0 )}, Item ID( "Smooth(previous)", 1 ) )
			)}
		),
		Dispatch( {}, "400", LegendBox,
			{Legend Position( {11, [0, 1, 2, -3], 12, [-1, -1, -1, 3]} )}
		)
	)
);

hogi · ‎01-22-2025

Thanks for Col At, nice
<< relative (1) means Dif() - or ratio?
Seems that https://marketplace.jmp.com/appdetails/Normalization+GUI can be simplified in the near future - or even removed?

Besides "Scoping", one might think of an order parameter for Col Functions like Col At.

the idea:

Lag implicitly uses row() as "timeline" to pick the previous entry.
But there might be a specific column in the data set :myTime which IS the timeline.
... and unfortunately, the table is not sorted by :mytime.

So, one could think of:

Col At (:height, -1, :sex, << order(:mytime))

<< order() could accepts scoped values like [X].

e.g. the user defines a universal probability plot - and uses it in combination with a column switcher.
Whichever column is selected, due to the scoping value which is used as first argument and << order, the plot will be right.

XanGregg · ‎01-22-2025

Thanks for the elaborations and example, @hogi !

By "relative" I meant that the position (second argument) is treated as relative to the current row. Otherwise -1 means last row.

order sounds interesting and would help with my SAT example where the rows are in reverse time order.

hogi

@XanGregg , for the scoping options, one could go one step further and also add @Selected.

With the current functionality, one can use selected() as grouping variable. But this splits the data between selected and unselected data.

With a scoping parameter @selected, one could restrict the plot to "just the selected datapoint" .

e.g. in a graph, the user selects some data points - and gets a fit curve for the selected one - he reads the slope and continues with his exploratory data analysis.
invested time ~ 2seconds.

With the current functionality, one can add an if(selected() , ...) - but this is much more complicated than activating :