I often summarize a column by some combination of other categorical columns using a formula column. For example in the Detergent sample data I might want to know the average count for each combination of softness and temperature. Below are three methods to do this, which one I use depends on the size of the data table and who my audience is. Is there an easier way?
I shy away from an add-in containing a custom 'Col Mean by Multiple Columns()' function due to portability issues.
Tabluate
Create summary statistics in a new table using tabulate and then join the results to the original data table.
Pro: Easy to make very complex summary statistics.
Con: Not live, need to refresh the data if a referenced column changes during an analysis, no way for users to follow the source of the data.
Binning Formula
Create a binning formula and then use the second argument of the Col Mean() function to refrence that bin.
Pro: Easy to set up and relatively easy for users to follow.
Con: Creates extra columns that clutter the data table and have no use except for this formula, especially when creating many summary columns.
dt = Open( "$sample_data/Detergent.jmp" );
//binning formula
dt << New Column(
"Bin by Temp and Softness",
Character,
"Nominal",
Formula( :softness || "-" || :temperature )
);
//summary statistic
dt << New Column(
"Mean Count by Temp and Softness Binned",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula( Col Mean( :count, :Bin by Temp and Softness) )
);
Identify bins and calculate summary statistic using JSL inside a formula
Identify similar rows in the data table then calculate the summary statistic on that data. I suspect JMP does not cache results using this method so the whole table is re-evaluted for every row. In any case it is slow for large data tables.
Pro: Tons of flexibility
Cons: Slow calculations, very hard for users to follow.
//summary statistic by bin
dt << New Column(
"Mean Count by Temp and Softness Direct",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula(
matches = Function( {a, b},
out = {};
For( i = 1, i <= N Items( a ), i++,
found = "no";
For( j = 1, j <= N Items( b ), j++,
If( a[i] == b[j],
found = "yes"
)
);
If( found == "yes",
out = Insert( out, a[i] )
);
);
out;
);
soft = Loc( :softness << get values(), :softness );
temp = Loc( :temperature << get values(), :temperature );
Mean( (:count << get values())[matches( soft, temp )] );
)
);