Solved: Lag based on Month

Dukowitz62 · Jun 10, 2023 4:35 PM

Our goal is to create a 1 month lag column (rather than by row). The LAG(M/Y) column output of 01/2000 should line up with 02/2000 in the M/Y column, etc. Attached is a tiny subset example of water use by m/y; whereas, our full table contains more than 59 million rows. The full table consists of water use for more than 430,000 customers for each each month over many years. We're open to all suggestions, if using a different column such as CONS_MONTH_NUM is easier or more straightforward. Could this possibly be resolved using Group By?

peng_liu · Aug 19, 2021 09:29 PM

What Lag function does is to shift data by row. There are not other definitions of this function.

I might misunderstand the objective. Do you mean lagging variables (not M/Y) by a month, not by row? An issue which row-shifting based Lag is not desired is that you have multiple ID in the data, and you don't want to Customer A's data spill off to Customer B's rows. A JSL trick can address it. Another situation which by-row shift is not desired is that there are missing observations. For an example, you have monthly series for a year, but missing June. The row for June just does not exist in the data. A row-shifting based Lag will totally mess up the data. There may not have simple JSL tricks.

If my understand correctly about the objective, I have a thought. First, let's simplify the data for the illustration purpose. I cut the data down to three columns: SP_ID, M/Y, WTR_GAL. The objective is to shift WTR_GAL, by SP_ID, based on time M/Y.

I create a copy of this table, and create two more columns:

The formula for Lag(M/Y) is

The formula for Lag WTR_GAL is the following, which is basically a copy of WTR_GAL.

Then I join the two tables together by matching SP_ID and M/Y==Lag(M/Y). Here is the JOIN dialog

And here is the result

Notice the highlighted cell is something that one has to pay attention if using the row-shifting based Lag. But here, without row-shifting based lag, the result is what I expect. The entire operation is piecing this month data with last month data by JOIN, the result is the same as the "ideal" LAG operation.

View solution in original post

peng_liu · Aug 19, 2021 09:29 PM

What Lag function does is to shift data by row. There are not other definitions of this function.

I might misunderstand the objective. Do you mean lagging variables (not M/Y) by a month, not by row? An issue which row-shifting based Lag is not desired is that you have multiple ID in the data, and you don't want to Customer A's data spill off to Customer B's rows. A JSL trick can address it. Another situation which by-row shift is not desired is that there are missing observations. For an example, you have monthly series for a year, but missing June. The row for June just does not exist in the data. A row-shifting based Lag will totally mess up the data. There may not have simple JSL tricks.

If my understand correctly about the objective, I have a thought. First, let's simplify the data for the illustration purpose. I cut the data down to three columns: SP_ID, M/Y, WTR_GAL. The objective is to shift WTR_GAL, by SP_ID, based on time M/Y.

I create a copy of this table, and create two more columns:

The formula for Lag(M/Y) is

The formula for Lag WTR_GAL is the following, which is basically a copy of WTR_GAL.

Then I join the two tables together by matching SP_ID and M/Y==Lag(M/Y). Here is the JOIN dialog

And here is the result

Notice the highlighted cell is something that one has to pay attention if using the row-shifting based Lag. But here, without row-shifting based lag, the result is what I expect. The entire operation is piecing this month data with last month data by JOIN, the result is the same as the "ideal" LAG operation.

Lag based on Month

Re: Lag based on Month

Re: Lag based on Month