cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Dukowitz62
Level II

Lag based on Month

Our goal is to create a 1 month lag column (rather than by row).  The LAG(M/Y) column output of 01/2000 should line up with 02/2000 in the M/Y column, etc.  Attached is a tiny subset example of water use by m/y; whereas, our full table contains more than 59 million rows.   The full table consists of water use for more than 430,000 customers for each each month over many years.  We're open to all suggestions, if using a different column such as CONS_MONTH_NUM is easier or more straightforward.  Could this possibly be resolved using Group By?

1 ACCEPTED SOLUTION

Accepted Solutions
peng_liu
Staff

Re: Lag based on Month

What Lag function does is to shift data by row. There are not other definitions of this function.

I might misunderstand the objective. Do you mean lagging variables (not M/Y) by a month, not by row? An issue which row-shifting based Lag is not desired is that you have multiple ID in the data, and you don't want to Customer A's data spill off to Customer B's rows. A JSL trick can address it. Another situation which by-row shift is not desired is that there are missing observations. For an example, you have monthly series for a year, but missing June. The row for June just does not exist in the data. A row-shifting based Lag will totally mess up the data. There may not have simple JSL tricks.

If my understand correctly about the objective, I have a thought. First, let's simplify the data for the illustration purpose. I cut the data down to three columns: SP_ID, M/Y, WTR_GAL. The objective is to shift WTR_GAL, by SP_ID, based on time M/Y.

peng_liu_0-1629422156918.png

I create a copy of this table, and create two more columns:

peng_liu_1-1629422277460.png

The formula for Lag(M/Y) is

peng_liu_2-1629422329386.png

The formula for Lag WTR_GAL is the following, which is basically a copy of WTR_GAL.

peng_liu_3-1629422363002.png

Then I join the two tables together by matching SP_ID and M/Y==Lag(M/Y). Here is the JOIN dialog

peng_liu_4-1629422496109.png

And here is the result

peng_liu_5-1629422574356.png

Notice the highlighted cell is something that one has to pay attention if using the row-shifting based Lag. But here, without row-shifting based lag, the result is what I expect. The entire operation is piecing this month data with last month data by JOIN, the result is the same as the "ideal" LAG operation.

 

 

View solution in original post

1 REPLY 1
peng_liu
Staff

Re: Lag based on Month

What Lag function does is to shift data by row. There are not other definitions of this function.

I might misunderstand the objective. Do you mean lagging variables (not M/Y) by a month, not by row? An issue which row-shifting based Lag is not desired is that you have multiple ID in the data, and you don't want to Customer A's data spill off to Customer B's rows. A JSL trick can address it. Another situation which by-row shift is not desired is that there are missing observations. For an example, you have monthly series for a year, but missing June. The row for June just does not exist in the data. A row-shifting based Lag will totally mess up the data. There may not have simple JSL tricks.

If my understand correctly about the objective, I have a thought. First, let's simplify the data for the illustration purpose. I cut the data down to three columns: SP_ID, M/Y, WTR_GAL. The objective is to shift WTR_GAL, by SP_ID, based on time M/Y.

peng_liu_0-1629422156918.png

I create a copy of this table, and create two more columns:

peng_liu_1-1629422277460.png

The formula for Lag(M/Y) is

peng_liu_2-1629422329386.png

The formula for Lag WTR_GAL is the following, which is basically a copy of WTR_GAL.

peng_liu_3-1629422363002.png

Then I join the two tables together by matching SP_ID and M/Y==Lag(M/Y). Here is the JOIN dialog

peng_liu_4-1629422496109.png

And here is the result

peng_liu_5-1629422574356.png

Notice the highlighted cell is something that one has to pay attention if using the row-shifting based Lag. But here, without row-shifting based lag, the result is what I expect. The entire operation is piecing this month data with last month data by JOIN, the result is the same as the "ideal" LAG operation.