topic Re: Column formula to summarize column by multiple columns in Discussions

Summary statistics with multiple binning columns.

ih — Fri, 13 Oct 2017 16:46:15 GMT

I often summarize a column by some combination of other categorical columns using a formula column. For example in the Detergent sample data I might want to know the average count for each combination of softness and temperature. Below are three methods to do this, which one I use depends on the size of the data table and who my audience is. Is there an easier way?

I shy away from an add-in containing a custom 'Col Mean by Multiple Columns()' function due to portability issues.

Tabluate

Create summary statistics in a new table using tabulate and then join the results to the original data table.

Pro: Easy to make very complex summary statistics.

Con: Not live, need to refresh the data if a referenced column changes during an analysis, no way for users to follow the source of the data.

Binning Formula

Create a binning formula and then use the second argument of the Col Mean() function to refrence that bin.

Pro: Easy to set up and relatively easy for users to follow.

Con: Creates extra columns that clutter the data table and have no use except for this formula, especially when creating many summary columns.

dt = Open( "$sample_data/Detergent.jmp" ); 

//binning formula
dt << New Column(
	"Bin by Temp and Softness",
	Character,
	"Nominal",
	Formula( :softness || "-" || :temperature )
);

//summary statistic
dt << New Column(
	"Mean Count by Temp and Softness Binned",
	Numeric,
	"Continuous",
	Format( "Best", 12 ),
	Formula( Col Mean( :count, :Bin by Temp and Softness) )
);

Identify bins and calculate summary statistic using JSL inside a formula

Identify similar rows in the data table then calculate the summary statistic on that data. I suspect JMP does not cache results using this method so the whole table is re-evaluted for every row. In any case it is slow for large data tables.

Pro: Tons of flexibility

Cons: Slow calculations, very hard for users to follow.

//summary statistic by bin
dt << New Column(
	"Mean Count by Temp and Softness Direct",
	Numeric,
	"Continuous",
	Format( "Best", 12 ),
	Formula(
		matches = Function( {a, b},
			out = {};
			For( i = 1, i <= N Items( a ), i++,
				found = "no";
				For( j = 1, j <= N Items( b ), j++,
					If( a[i] == b[j],
						found = "yes"
					)
				);
				If( found == "yes",
					out = Insert( out, a[i] )
				);
			);
			out;
		);
		soft = Loc( :softness << get values(), :softness );
		temp = Loc( :temperature << get values(), :temperature );
		Mean( (:count << get values())[matches( soft, temp )] );
	)
);

Re: Column formula to summarize column by multiple columns

jules — Fri, 13 Oct 2017 16:47:59 GMT

Hi @ih,

I find myself doing just this thing pretty often. One way that works for me is to use a column formula like Col Mean(:column, byvariable), and for the second parameter I use a concatenation of the two categorical columns directly (rather than creating any additional columns in the table). For the detergent example you mentioned this would be:

Col Mean( :count, :softness || :temperature )

Does this do what you need or is there something else you're looking for?

@jules

Re: Column formula to summarize column by multiple columns

ih — Fri, 13 Oct 2017 16:55:42 GMT

This is perfect, and so obvious once you point it out.

Thanks!

Re: Column formula to summarize column by multiple columns

jules — Fri, 13 Oct 2017 17:01:45 GMT

Great, I'm glad that helps! I should have mentioned that in recent versions of JMP (I think since JMP 11) you can have multiple column parameters directly in ColMean(). So, rather than concatenating you could have:

Col Mean( :count, :softness, :temperature )

I'm so used to concatenating that I still do that. Old habits die hard :)

Additionally, in JMP 12 and on, with Instant Formulas in the data table you can select those columns in the table, right-click one of their headers, select New Formula Column > Group By.

Then, select the count column, right-click the header, select New Formula Column > Aggregate > Mean

This will design the formula for you automatically; very friendly for your new users.