cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
jj_jmp
Level II

How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

Hello, 

I have a column A (around 15,000 rows) which has numbers and many null values (maybe 70-80% null).

I would like to construct a column B which at row i, is equal to the the mean of:   the *last 10 non-null values from A* 

--> this means take  A[row i-1] , A[row i-2], ... and go back until you have 10 non-null values (or if you reach row 1 then  take whatever number <= 10 of  non-null values you get)

 

I would also like to do in addition to mean, a median and also mode, but a smoothed version of the mode where:

I first do a Gaussian smoothing of the 10 non-null values ...  and take the argmax (mode) of that distributions. 

 

Thanks for any thoughts or suggestions!

 

***

Solution ideas: One thought I had was because I want to do several different calculations with the last 10 non-null values, I should first construct a column C whose entries are arrays equal to those last 10 non-null values ... then I can do 3 formulas which operate on column C to produce B1 (mean), B2 (medial), B3 (smoothed mode).

 

Here are some code fragments moving towards constructing this column C of arrays:

//step 1 grab the last say 50 entries (some fixed amount that will likely have 10 non-nulls)

For Each Row( dt_join, 

	to = Row();

	If( to - 50 < 0,

	from = 1,

	from = to - 50); // Quantile( 0.5, 

		values = :away_1g_pull_seconds[Index( From, To )] <<Get Values //this is an array 

		Show(values);

	If(Row() > 50, Break());

); 

//step 2 convert array to list, drop nulls, convert list of first 10 entries back to array ... 
//I don't know how to assign this to a row of C 

num_lag = 10; 

values = dt_join:A[Index( From, To )] ;

 MyList = Remove(as list(values)[1], as list(Loc(as list(values)[1], .)) );   //convert array to list and remove nulls

 Mat = Number Col Box("", MyList[1:10]) << get as matrix; //hack convert to array

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

I'd guess it is still N^2 for the helper=AsConstant. That will be copying a 100K array 400K times...

 

edit: guess confirmed. But even lifting the helper out of the formula still leaves my code about half as fast as Brady's.

 

But...Jim's code can be tweaked to be faster...

 

dt = Current Data Table();
startTime = Tick Seconds();
theMatrix = {., ., ., ., ., ., ., ., ., .};

z = .;
dt << New Column( "B" );
dt << begindataupdate;
For Each Row(
    v = :A;
    If( v == v,
        Insert Into( theMatrix, v );
        Remove From( theMatrix, 1 );
        z = Mean( theMatrix );
    );
    :B = z;
);
stopTime = Tick Seconds();
Show( stopTime - startTime );
dt << enddataupdate;
Craige

View solution in original post

8 REPLIES 8
txnelson
Super User

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

Here is a beginning for your first request.  It will create a mean for the last 10 values for column A and put the mean into column B

I don't have time right now to work more on this.  Maybe later this weekend.

Names Default To Here( 1 );
dt = Current Data Table();
theMatrix = [];
dt << New Column( "B" );
For Each Row(
	If( Is Missing( :A ) == 0,
		theMatrix = Matrix( :a ) || theMatrix
	);
	If( Length( theMatrix ) > 10,
		theMatrix[1, 11] = []
	);
	:B = Mean( theMatrix );
);
Jim
Craige_Hales
Super User

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

@DonMcCormack this would make a good challenge problem.

I think this will work and be reasonably efficient.

nr = 15000;
dt = New Table( "Untitled",
    Add Rows( nr ),
    New Column( "a", Numeric, "Continuous", Format( "Best", 12 ), Set Values( J( nr, 1, If( Random Uniform() < .3, Random Integer( 0, 1 ), . ) ) ) )
);

lookback = 10;

dt << New Column( "b" || Char( lookback ),
    formula(
        // make a helper to help look up the non-missing rows
        helper = As Constant( // AsConstant is for efficiency
            temp = dt << getrowswhere( !Is Missing( a ) );
            // temp is the non-missing row numbers. Make sure row 1
            // is included, even if missing, because LocSorted will
            // return [1] when a smaller value would have been better.
            If( temp[1] != 1,
                Insert Into( temp, 1, 1 )
            );
            temp; // this is assigned to 'helper'
        );
        currentrow = Row(); // this is the target for a new mean, etc
        end = Loc Sorted( helper, currentrow )[1]; // end is the last non missing row before or at the target
        start = Max( 1, end - lookback + 1 ); // start is the row 'lookback' rows earlier
        rowset = helper[start :: end]; // rowset is the set of nonmissing rows, possibly including a missing value at row 1.
        // it is OK if a missing value is present; mean/median/mode ignore it.
        // grab the rowset from column a and report the statistic...
        Mean( dt[rowset, {a}] );// median, mode
    )
);

 

Craige

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

Hello,

 

Here is another way to do this, using an intermediate table with no missings, which allows a simpler formula.

Names Default To Here( 1 );

dt = Current Data Table();

//get nonmissing rows of table and put them in a new table
m = dt << get rows where( !Is Missing( :a ) );
dt2 = New Table( "xxx", invisible, <<add rows( N Row( m ) ) );
dt2[0, 1] = dt[m, 1];

//run formula on the new table. it has no missings, so direct indexing can be used
dt2 << New Column( "Xx",
	formula(
		tab = :column1 << get data table;
		Mean( tab[Index( Max( Row() - 9, 1 ), Row() ), 1] );
	)
);

//place these answers in their corresponding rows in the orignal table, in a new column.
dt << New Column( "ans" );
dt[m, N Col( dt )] = dt2[0, 2];

//span empty data to fill
dt << New Column( "Mean10",
	formula( If( Row() == 1, :ans, !Is Missing( :ans ), :ans, Lag( :Mean10 ) ) )
);

//cleanup
dt:Mean10 << Delete Formula;
dt << delete columns( :ans );
Close( dt2, nosave );


 

When the original table is small (like yours... about 15K rows), the two approaches perform similarly--the first is a touch faster:

 

brady_brady_1-1665844715162.png

 

As the original table grows, the "separate table" approach seems to scale appreciably better. Here is a 150K row example:

 

brady_brady_0-1665844547704.png

 

A possible deal-breaker (??) is that this approach does not produce a formula in the source table, so you would need to re-run this whenever new data is appended.

 

I've attached the 150K row file in case you want to experiment.

 

Cheers,

Brady

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

Here are some more results. The "separate" method seems to be scaling linearly. The "same table" method seems to hit a wall somewhere north of 450,000 rows. I'm not sure why.

 

brady_brady_2-1665849827059.png

brady_brady_1-1665849460162.png

Craige_Hales
Super User

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

I'd guess it is still N^2 for the helper=AsConstant. That will be copying a 100K array 400K times...

 

edit: guess confirmed. But even lifting the helper out of the formula still leaves my code about half as fast as Brady's.

 

But...Jim's code can be tweaked to be faster...

 

dt = Current Data Table();
startTime = Tick Seconds();
theMatrix = {., ., ., ., ., ., ., ., ., .};

z = .;
dt << New Column( "B" );
dt << begindataupdate;
For Each Row(
    v = :A;
    If( v == v,
        Insert Into( theMatrix, v );
        Remove From( theMatrix, 1 );
        z = Mean( theMatrix );
    );
    :B = z;
);
stopTime = Tick Seconds();
Show( stopTime - startTime );
dt << enddataupdate;
Craige

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

That is really nice--much more elegant than my approach.

txnelson
Super User

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

Wow......did this exercise teach me a bunch.  My original code worked, but was sooooo much slower than the alternate methods, that I took the time to study them.  Very fast and clever code.  Then @Craige_Hales provided his latest code, which is down right speedy, but takes an approach very close to what my original code in approach.  @Craige_Hales code is much faster.  And most of the difference is in my lack of using

Begin Data Update;
.....
End Data Update;

So I have now gained a large appreciation for the holding up of the Update Messages until the end of the updating.

Thanks for the fun!

Jim
jj_jmp
Level II

Re: How do I construct a column A, whose entries are equal to the mean/median/mode of the last 10 non-null entries of another column B?

Thank so much for all your responses - I am blown away! 

I have not had chance to read them in detail but I will respond more when I do. 

 

Best,

 

Joe