I am trying to find an efficient way to calculate Rolling 3-day CUMULATIVE SUM.
I have some for loops that work, but the data sets I am working with are ~500k rows with many more GROUPS, SUBGROUPS and VALUE columns so it takes ~30 minutes to calculate. The example data set is very stripped down and basic.
I moved the values into a matrix and that speeds it up a little, but not enough.
There is a GROUP and SUBGROUP columns that are to be CUMULATIVE SUM seperately from eachother. Also for each row the CUMULATIVESUM can only be for DATE - 3 days. Anything before :DATE[]-3 days, and after :DATE[] is not used for that row's CUMULATIVE SUM.
I have a feeling a column formula using LAG will probably be fastest, but I cannot get it to work properly.
Speed is very important.
I have tried subsetting the tables, and that just made it slower.
See table and JSL. The top two JSL methods work perfectly, the third JSL method doesn't work properly.
Thank You!
[edit] JMP Pro 12.2.0
dt = Current Data Table();
For( i = 1, i <= N Rows( dt ), i = i + 1,
:CUMULATIVE_SUM_VALUE[i] = :VALUE[i];
For( c = i - 1, c > 0, c = c - 1,
If( :SUBGROUP[c] == :SUBGROUP[i] & :GROUP[c] == :GROUP[i] & (:DATE[i] - :DATE[c]) <= In Days( 3 ),
:CUMULATIVE_SUM_VALUE[i] = :CUMULATIVE_SUM_VALUE[i] + :VALUE[c],
c = -1
)
);
);
//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////
//1=GROUP, 2=SUBGROUP, 3=DATE, 4=VALUE, 5=CUMULATIVE_SUM_VALUE
TT = dt << Get All Columns As Matrix;
For( i = 1, i <= N Rows( TT ), i = i + 1,
TT[i, 5] = TT[i, 4];
For( c = i - 1, c > 0, c = c - 1,
If( TT[c, 2] == TT[i, 2] & TT[c, 1] == TT[i, 1] & (TT[i, 3] - TT[c, 3]) <= In Days( 3 ),
TT[i, 5] = TT[i, 5] + TT[c, 4],
c = -1
)
);
);
For Each Row( :CUMULATIVE_SUM_VALUE = TT[Row(), 5] );
//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////
dt << New Column( "CUMULATIVE_SUM_VALUE_2",
Formula(
//:VALUE;
If(
Row() == 1, :VALUE,
:GROUP == Lag( :GROUP, 1 ) & :SUBGROUP == Lag( :SUBGROUP, 1 ) & :DATE - Lag( :DATE, 1 ) <= In Days( 3 ), :VALUE + Lag( :CUMULATIVE_SUM_VALUE, 1 )
,
:GROUP != Lag( :GROUP, 1 ) & :SUBGROUP != Lag( :SUBGROUP, 1 ) & :DATE - Lag( :DATE, 1 ) <= In Days( 3 ), :VALUE
,
:GROUP == Lag( :GROUP, 1 ) & :SUBGROUP == Lag( :SUBGROUP, 1 ) & :DATE - Lag( :DATE, 1 ) > In Days( 3 ), :VALUE
,
:GROUP != Lag( :GROUP, 1 ) | :SUBGROUP != Lag( :SUBGROUP, 1 ), :VALUE
)
),
Eval Formula
);
This should be a little faster. It is a simple offset counting routine, it hinges on the the power of the Loc() and matrix functions.
Items to Note:
Names Default to Here(1);
dt = Open("c:\users\<user_id>\downloads\cumulative_sum_test_table.jmp");
//This assumes the data table is soted by Group and SubGroup and Date
scol = dt << new column("Sequence", numeric, continuous);
scol << set each value(Row());
ccol = dt << new Column("CUMULATIVE_SUM_VALUE_2", numeric, continuous);
summarize(Data Table(dt), _ttt = By(:GROUP, :SUBGROUP), _ibeg=Min(scol), _iend=Max(scol) );
dt << delete column(scol); //no longer needed
dmat = :DATE << get as matrix;
vmat = :VALUE << get as matrix;
//for each group
drng = In Days(3); //259200
For(i=1, i<=nitems(_ttt[1]), i++,
b = _ibeg[i];
e = _iend[i];
dsub = dmat[ b::e ];
vsub = vmat[ b::e ];
For(j = b, j <= e, j++,
//j=37
k= j-b+1;
//JMP12 converted a 1x1 matrix to a scalar
if(k==1,
ccol[j] = vsub[1]
, //else
idx = loc( 0<= -1 * dsub[1::k] + dmat[j] <= drng ); //should this be < not <=
//idx = loc( 0<= Floor( (-1 * dsub[1::k] + dmat[j])/86400) <= 3 ); // < ??
ccol[j] = sum(vsub[idx])
); //end if
); // end for j
);
This should be a little faster. It is a simple offset counting routine, it hinges on the the power of the Loc() and matrix functions.
Items to Note:
Names Default to Here(1);
dt = Open("c:\users\<user_id>\downloads\cumulative_sum_test_table.jmp");
//This assumes the data table is soted by Group and SubGroup and Date
scol = dt << new column("Sequence", numeric, continuous);
scol << set each value(Row());
ccol = dt << new Column("CUMULATIVE_SUM_VALUE_2", numeric, continuous);
summarize(Data Table(dt), _ttt = By(:GROUP, :SUBGROUP), _ibeg=Min(scol), _iend=Max(scol) );
dt << delete column(scol); //no longer needed
dmat = :DATE << get as matrix;
vmat = :VALUE << get as matrix;
//for each group
drng = In Days(3); //259200
For(i=1, i<=nitems(_ttt[1]), i++,
b = _ibeg[i];
e = _iend[i];
dsub = dmat[ b::e ];
vsub = vmat[ b::e ];
For(j = b, j <= e, j++,
//j=37
k= j-b+1;
//JMP12 converted a 1x1 matrix to a scalar
if(k==1,
ccol[j] = vsub[1]
, //else
idx = loc( 0<= -1 * dsub[1::k] + dmat[j] <= drng ); //should this be < not <=
//idx = loc( 0<= Floor( (-1 * dsub[1::k] + dmat[j])/86400) <= 3 ); // < ??
ccol[j] = sum(vsub[idx])
); //end if
); // end for j
);
This works great. I am seeing >200% speed increase using this method with my dataset.
There was a small bug where if there was only one row for a SubGroup (you had a JSL comment about a 1x1 matrix converted to scalar). I have no idea why this would happen.
I made sure these would stay a matrix as so:
dsub = Matrix(dmat[b :: e]);
vsub = Matrix(vmat[b :: e]);
Good catch. I made a rookie mistake not catching the corner case for a 1 row subgroup. I am so glad JMP no longer converts a 1x1 matrix to a scalar.
//JMP12 converted a 1x1 matrix to a scalar