Solved: Re: Alphabetical Ranking of Nominal Data

SpannerHead

If I have a column of nominal data where the inputs can occur more than once, is there a way to have a column whereby the output is the alphabetical ranking of the input? I can go the scripting route but I'd prefer a simple column formula.

Slán

SpannerHead

txnelson · | Posted in reply to message from SpannerHead 12-13-2024

It is late in the day, so my solution is probably not the best, however using the Big Class data table, Here is the formula I came up with

As Constant( Summarize( theLevels = by( :age ) ) );
Contains( theLevels, Char( :age ) );

Jim

View solution in original post

txnelson · | Posted in reply to message from SpannerHead 12-13-2024

It is late in the day, so my solution is probably not the best, however using the Big Class data table, Here is the formula I came up with

As Constant( Summarize( theLevels = by( :age ) ) );
Contains( theLevels, Char( :age ) );

Jim

jthi · | Posted in reply to message from SpannerHead 12-13-2024

Until we get Add dense ranking to Ranking Tie and Col Rank functions I would use something like txnelson did show in simple cases. There are all sorts of solutions using Col statistical functions but they tend to be very difficult to understand as you have to combine multiple of them and inside of each other (you can find some from the comments of that wish list item).

-Jarmo

hogi · | Posted in reply to message from jthi 12-14-2024

On the other hand -
wich such a clever workaround I fear that it will take some while till dense ranking gets implemented.

Thanks Jim!!

hogi · | Posted in reply to message from hogi 12-14-2024

dt =  Open( "$SAMPLE_DATA/Big Class Families.jmp" );

for(i=1, i<16, i++, dt<< Concatenate(dt,append to first table()));
wait(0);
show(nrows(dt));

Print("std rank");
t0 = hptime();
New Column( "std Rank",
	Formula(
	Col Rank( :height)
	)
);
dt << run formulas();
show((hptime()-t0)/1000000);


Print("row_min rank median");
t0 = hptime();
New Column( "dense_Rank_ref1",
	Formula(
		Col Median(
			Col Rank( If( Row() == Col Minimum( Row(), :height ), :height, . ) ),
			:height
		)
	)
);
dt << run formulas();
show((hptime()-t0)/1000000);



print("summarize_contains");
t0 = hptime();
New Column( "dense rank_summarize_contains",

	Formula(
		As Constant( Summarize( theLevels = by( :height ) ) );
		Contains( theLevels, Char( :height ) );
	)
);
dt << run formulas();
show((hptime()-t0)/1000000);


print("row_min cumulative sum");
t0 = hptime();
print("- just works after sorting");
dt << Sort(	By( :height ),	Replace Table);
New Column( "dense_Rank_ref1",

	Formula(

Col Cumulative Sum(If(Row() == Col Min(Row(), :height), 1,0))
	)
);
dt << run formulas();
show((hptime()-t0)/1000000);

close(dt, nosave);

hogi · Dec 14, 2024 3:38 PM

factor 2 faster: a detour via summary table:

dt =  Open( "$SAMPLE_DATA/Big Class Families.jmp" );

for(i=1, i<16, i++, dt<< Concatenate(dt,append to first table()));

t0 = hptime();
dtsum = dt << Summary(	Group( :height ),private);
dtsum  << New Formula Column(
	Operation( Category( "Distributional" ), "Rank" ),
	Columns( :height )
);
dtsum << run formulas();

dt  << Update(
	With(dtsum),
	Match Columns( :height = :height ),
	Add Columns from Update Table( :"Rank[height]"n ),
	Replace Columns in Main Table( None )
);

show((hptime() - t0)/1000000)

This approach can be easily extended to calculate dense ranking also for subsets (groupBy).

hogi · Dec 14, 2024 4:18 PM

When you create a summary table instead of using Summarize, one can @txnelson approach to calculate dense ranks for subsets.

the disadvantage: quite slow

names default to here(1);
Try(close(dt,nosave));
dt =  Open( "$SAMPLE_DATA/Big Class Families.jmp" );

for(i=1, i<16, i++, dt<< Concatenate(dt,append to first table()));

t0 = hptime();

//dense rank - with groupby
dt << New Column( "dense rank_ grouped by",
Character, 
	formula (
	local({tmp,myAge},
	as constant (
			dt = Current Data Table();
			dtsum = dt << summary( Group( :age, :height ) );
			//dtsum:age << set name(age2);
			nr = N Rows( dt );
		);
		rw = row();
		myage = dt:age[row()];
		Eval (Eval Expr(myrows = dtsum << get rows where( dtsum:age == Expr(myAge)))); // <- use dtsum:age !!!
		tmp = contains(dtsum[myrows,2], dt:height[rw]); // <- use dt:height !!!!
		if(row()==nr, close(dtsum, noSave));
		tmp

)));

show((hptime() - t0)/1000000)

names default to here(1); Try(close(dt,nosave)); dt = Open( "$SAMPLE_DATA/Big Class Families.jmp" ); for(i=1, i<16, i++, dt<< Concatenate(dt,append to first table())); t0 = hptime(); //dense rank - with groupby dt << New Column( "dense rank_ grouped by", Character, formula ( local({tmp,myAge}, as constant ( dt = Current Data Table(); dtsum = dt << summary( Group( :age, :height ) ); //dtsum:age << set name(age2); nr = N Rows( dt ); ); rw = row(); myage = dt:age[row()]; Eval (Eval Expr(myrows = dtsum << get rows where( dtsum:age == Expr(myAge)))); // <- use dtsum:age !!! tmp = contains(dtsum[myrows,2], dt:height[rw]); // <- use dt:height !!!! if(row()==nr, close(dtsum, noSave)); tmp ))); show((hptime() - t0)/1000000)

hogi · | Posted in reply to message from hogi 12-14-2024

Better put everything into As Constant and stop the "formula" before the main part begins.

... super fast.

... with the help of a small GUI: quite flexible

... and via the main command that is allowed to sleep forever: even dynamic : )

Names Default To Here( 1 );
Try(
	dt = Current Data Table(),
	Caption( "no dat table" );
	Stop();
);


updateLBox = Function( {t},
	(t << sib) << setitems( (t << topparent)[listboxbox( 1 )] << getselected )
);

nw = New Window( "rank unique entries",
	<<Type( "Modal Dialog" ),
	<<return result,
	Border Box( Left( 10 ), Right( 10 ), bottom( 5 ),
		V List Box(
			Lineup Box( N Col( 1 ), spacing( 10 ), 
				// Text Box( "XG Boost Specification" ), 
				// colistbox only necessary if the script is standalone
				H List Box(
					Panel Box( "Select Columns", Col List Box( all, nlines( 15 ) ) ),
					Panel Box( "Cast Selected Columns into Roles",
						Lineup Box( N Col( 2 ),
							Button Box( "Unique", <<setfunction( Function( {t}, updateLBox( t ) ) ) ),
							Cunique = Col List Box( nlines( 2 ), minitems( 1 ), maxitems( 1 ) ),
							Button Box( "By Group", <<setfunction( Function( {t}, updateLBox( t ) ) ) ),
							CBy = Col List Box( nlines( 2 ), minitems( 0 ) )
						)
					)
				)
			),
			Spacer Box( size( 1, 10 ) ),
			Button Box( "OK",
				<<setfunction(
					Function( {t},
						ColsUnique = Cunique << get items;
						ColsBy = Cby << get items;
					)
				)
			),
			Spacer Box( size( 1, 10 ) ),

		)
	)
);

If( Not( nw["Button"] == 1 ),
	Stop()
);


myCol = dt << New Column( "Rank_" || Char( ColsUnique ) || If( N Items( ColsBy ), "_by_ " || Char( ColsBy ), "" ) );
colName = mycol << get name();

groupExpr = Expr( Group() );
For Each( {col}, ColsUnique, Insert Into( groupExpr, Name Expr( As Column( col ) ) ) );
For Each( {col}, ColsBy, Insert Into( groupExpr, Name Expr( As Column( col ) ) ) );
mainCommand = Substitute( Name Expr( groupExpr ), Expr( Group() ), Expr( Concat Items() ) );

If( N Items( ColsUnique ) == 1,
	columnsExpr = Eval Expr( Columns( Expr( Name Expr( As Column( ColsUnique[1] ) ) ) ) ),
	Caption( "multiple columns not implemented yet" );
	Stop();
);
GroupByExpr = Expr( Group By() );
For Each( {col}, ColsBy, Insert Into( groupByExpr, Name Expr( As Column( col ) ) ) );

Eval(
	Eval Expr(
		myCol << set formula(

			As Constant(
				dt0 = Current Data Table();
				dtsum = dt0 << Summary( Expr( Name Expr( groupExpr ) ) );
				newCol = dtsum << New Formula Column(
					Operation( Category( "Distributional" ), "Rank" ),
					Expr(
						Name Expr( Columns Expr )
					),
					Expr(
						Name Expr( GroupByExpr )
					),
				);
				newcol[1] << set name( Expr( colName ) );
				dtsum << run formulas();

				dt0 << Update(
					With( dtsum ),
					Match Columns( Expr( Parse( "dt0:" || ColsUnique[1] || " = dtsum:" || ColsUnique[1] ) ) ),
					Add Columns from Update Table( none ),
					Replace Columns in Main Table( Expr( colName ) )
				);
				Close( dtsum, noSave );
				Stop();
			);
			Expr(
				Name Expr( mainCommand ) // will never execute - ut needed for the automatic update : )
			)

			;
		)
	)
);

dt << run formulas();

Workflows\BuildingBlocks\ColumnOperations\RankUnique_with_ByGroup.jmpflow

SpannerHead · | Posted in reply to message from hogi 12-15-2024

hogi

This one's really clever. I can enter multiple values in the By Group if I hold down the Ctrl button but it only operates on the last entry for the number of inputs. Am I doing something wrong?

Slán

SpannerHead