Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Summary statistics with multiple binning columns.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Oct 13, 2017 9:34 AM
| Last Modified: Oct 13, 2017 9:46 AM
(5237 views)

I often summarize a column by some combination of other categorical columns using a formula column. For example in the Detergent sample data I might want to know the average count for each combination of softness and temperature. Below are three methods to do this, which one I use depends on the size of the data table and who my audience is. Is there an easier way?

I shy away from an add-in containing a custom 'Col Mean by Multiple Columns()' function due to portability issues.

**Tabluate**

Create summary statistics in a new table using tabulate and then join the results to the original data table.

Pro: Easy to make very complex summary statistics.

Con: Not live, need to refresh the data if a referenced column changes during an analysis, no way for users to follow the source of the data.

**Binning Formula**

Create a binning formula and then use the second argument of the Col Mean() function to refrence that bin.

Pro: Easy to set up and relatively easy for users to follow.

Con: Creates extra columns that clutter the data table and have no use except for this formula, especially when creating many summary columns.

```
dt = Open( "$sample_data/Detergent.jmp" );
//binning formula
dt << New Column(
"Bin by Temp and Softness",
Character,
"Nominal",
Formula( :softness || "-" || :temperature )
);
//summary statistic
dt << New Column(
"Mean Count by Temp and Softness Binned",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula( Col Mean( :count, :Bin by Temp and Softness) )
);
```

**Identify bins and calculate summary statistic using JSL inside a formula**

Identify similar rows in the data table then calculate the summary statistic on that data. I suspect JMP does not cache results using this method so the whole table is re-evaluted for every row. In any case it is slow for large data tables.

Pro: Tons of flexibility

Cons: Slow calculations, very hard for users to follow.

```
//summary statistic by bin
dt << New Column(
"Mean Count by Temp and Softness Direct",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula(
matches = Function( {a, b},
out = {};
For( i = 1, i <= N Items( a ), i++,
found = "no";
For( j = 1, j <= N Items( b ), j++,
If( a[i] == b[j],
found = "yes"
)
);
If( found == "yes",
out = Insert( out, a[i] )
);
);
out;
);
soft = Loc( :softness << get values(), :softness );
temp = Loc( :temperature << get values(), :temperature );
Mean( (:count << get values())[matches( soft, temp )] );
)
);
```

1 ACCEPTED SOLUTION

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Hi @ih,

I find myself doing just this thing pretty often. One way that works for me is to use a column formula like Col Mean(:column, byvariable), and for the second parameter I use a concatenation of the two categorical columns directly (rather than creating any additional columns in the table). For the detergent example you mentioned this would be:

`Col Mean( :count, :softness || :temperature )`

Does this do what you need or is there something else you're looking for?

3 REPLIES 3

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Hi @ih,

I find myself doing just this thing pretty often. One way that works for me is to use a column formula like Col Mean(:column, byvariable), and for the second parameter I use a concatenation of the two categorical columns directly (rather than creating any additional columns in the table). For the detergent example you mentioned this would be:

`Col Mean( :count, :softness || :temperature )`

Does this do what you need or is there something else you're looking for?

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Column formula to summarize column by multiple columns

This is perfect, and so obvious once you point it out.

Thanks!

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Column formula to summarize column by multiple columns

Great, I'm glad that helps! I should have mentioned that in recent versions of JMP (I think since JMP 11) you can have multiple column parameters directly in ColMean(). So, rather than concatenating you could have:

`Col Mean( :count, :softness, :temperature )`

I'm so used to concatenating that I still do that. Old habits die hard :)

Additionally, in JMP 12 and on, with Instant Formulas in the data table you can select those columns in the table, right-click one of their headers, select New Formula Column > Group By.

Then, select the count column, right-click the header, select New Formula Column > Aggregate > Mean

This will design the formula for you automatically; very friendly for your new users.

Article Labels

There are no labels assigned to this post.