Choose Language Hide Translation Bar
Highlighted

## Summary statistics with multiple binning columns.

I often summarize a column by some combination of other categorical columns using a formula column.  For example in the Detergent sample data I might want to know the average count for each combination of softness and temperature. Below are three methods to do this, which one I use depends on the size of the data table and who my audience is.  Is there an easier way?

I shy away from an add-in containing a custom 'Col Mean by Multiple Columns()' function due to portability issues.

Tabluate

Create summary statistics in a new table using tabulate and then join the results to the original data table.

Pro: Easy to make very complex summary statistics.

Con: Not live, need to refresh the data if a referenced column changes during an analysis, no way for users to follow the source of the data.

Binning Formula

Create a binning formula and then use the second argument of the Col Mean() function to refrence that bin.

Pro: Easy to set up and relatively easy for users to follow.

Con: Creates extra columns that clutter the data table and have no use except for this formula, especially when creating many summary columns.

``````dt = Open( "\$sample_data/Detergent.jmp" );

//binning formula
dt << New Column(
"Bin by Temp and Softness",
Character,
"Nominal",
Formula( :softness || "-" || :temperature )
);

//summary statistic
dt << New Column(
"Mean Count by Temp and Softness Binned",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula( Col Mean( :count, :Bin by Temp and Softness) )
);``````

Identify bins and calculate summary statistic using JSL inside a formula

Identify similar rows in the data table then calculate the summary statistic on that data.  I suspect JMP does not cache results using this method so the whole table is re-evaluted for every row. In any case it is slow for large data tables.

Pro: Tons of flexibility

Cons: Slow calculations, very hard for users to follow.

``````//summary statistic by bin
dt << New Column(
"Mean Count by Temp and Softness Direct",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula(
matches = Function( {a, b},
out = {};
For( i = 1, i <= N Items( a ), i++,
found = "no";
For( j = 1, j <= N Items( b ), j++,
If( a[i] == b[j],
found = "yes"
)
);
If( found == "yes",
out = Insert( out, a[i] )
);
);
out;
);
soft = Loc( :softness << get values(), :softness );
temp = Loc( :temperature << get values(), :temperature );
Mean( (:count << get values())[matches( soft, temp )] );
)
);``````
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted  julian Community Manager

## Re: Column formula to summarize column by multiple columns

Hi @ih,

I find myself doing just this thing pretty often. One way that works for me is to use a column formula like Col Mean(:column, byvariable), and for the second parameter I use a concatenation of the two categorical columns directly (rather than creating any additional columns in the table). For the detergent example you mentioned this would be:

``Col Mean( :count, :softness || :temperature )``

Does this do what you need or is there something else you're looking for?

@julian

3 REPLIES 3
Highlighted  julian Community Manager

## Re: Column formula to summarize column by multiple columns

Hi @ih,

I find myself doing just this thing pretty often. One way that works for me is to use a column formula like Col Mean(:column, byvariable), and for the second parameter I use a concatenation of the two categorical columns directly (rather than creating any additional columns in the table). For the detergent example you mentioned this would be:

``Col Mean( :count, :softness || :temperature )``

Does this do what you need or is there something else you're looking for?

@julian

Highlighted

## Re: Column formula to summarize column by multiple columns

This is perfect, and so obvious once you point it out.

Thanks!

Highlighted  julian Community Manager

## Re: Column formula to summarize column by multiple columns

Great, I'm glad that helps! I should have mentioned that in recent versions of JMP (I think since JMP 11) you can have multiple column parameters directly in ColMean(). So, rather than concatenating you could have:

``Col Mean( :count, :softness, :temperature )``

I'm so used to concatenating that I still do that. Old habits die hard :)

Additionally, in JMP 12 and on, with Instant Formulas in the data table you can select those columns in the table, right-click one of their headers, select New Formula Column > Group By. Then, select the count column, right-click the header, select New Formula Column > Aggregate > Mean This will design the formula for you automatically; very friendly for your new users.

Article Labels

There are no labels assigned to this post.