Re: How to calculate moving average of column with 100 million entries in JMP 11...

jojmp · May 29, 2019 1:01 AM

I am trying to run the below formula on a column with 100 million rows to calculate the moving average.Could you please suggest an JMP11 alternative for Col Moving Average

Col Moving Average( :x, 1, 9999999, 0 ) is not supported by JMP11.

Please find below the code I have tried:

NamesDefaultToHere(1);

// Use a function to open a .CSV file: Given the path to such a file, returns a reference to the resulting table
openMyCSV = 
function({path2File}, {Default Local},
			dt = Open(
					path2File,
					columns(
						New Column( "column 1", Numeric, "Continuous", Format( "Best", 12 ) ),
						New Column( "items", Numeric, "Continuous", Format( "Best", 12 ) )
						),
					Import Settings(
						End Of Line( CRLF, CR, LF ),
						End Of Field( Comma, CSV( 0 ) ),
						Strip Quotes( 1 ),
						Use Apostrophe as Quotation Mark( 0 ),
						Use Regional Settings( 0 ),
						Scan Whole File( 1 ),
						Treat empty columns as numeric( 0 ),
						CompressNumericColumns( 0 ),
						CompressCharacterColumns( 0 ),
						CompressAllowListCheck( 0 ),
						Labels( 0 ),
						Column Names Start( 1 ),
						Data Starts( 1 ),
						Lines To Read( "All" ),
						Year Rule( "20xx" )
					)
				)
	);
	
dt = New Table( "My data file");
dt << New Column( "cc", Character, width( 40 ) );
dt << New Column( "MovAvgMean", Numeric );
dt << New Column( "MovAvgMax", Numeric );

// Put all the .csv files to process into a list
dir = Pick Directory(
	"Directory containing your csv files",
	"C:\Users\xxx\Documents\test\"
);
files = Files In Directory( dir );
For( f = N Items( files ), f >= 1, f--,
	If( !Ends With( Uppercase( files[f] ), ".CSV" ),
		Remove From( files, f )
	)
);

// Process each file - Incomplete!
For( f = 1, f <= N Items( files ), f++,
	thisTable = openMyCSV( dir || files[f] );
	thisTable << New Column( "x",
		Numeric,
		Continuous,
		Formula( (87 *:items)-(0.03998113) ));
	thisTable << New Column( "av", Numeric, Continuous, Formula( Col Moving Average( :x, 1, 9999999, 0 ) ) );
	
	// This will insure that all formula processing is complet before the next statement is run
	thisTable << run formulas; 
	
	FileName = files[f];
	// Add the summary statistics to the summary table
	dt << Add Rows( 1 );
    dt:cc[N Rows( dt )] = FileName;
	dt:MovAvgMean[N Rows( dt )] = Col Mean( thisTable:x );
	dt:MovAvgMax[N Rows( dt )] = Col Max( thisTable:x );
	
	Close( thisTable, nosave );
);

9

gzmorgan0 · May 28, 2019 06:18 PM

@jojmp, your formula for the moving average is a bit foreign

thisTable << New Column( "av", Numeric, Continuous, Formula( Col Moving Average( :x, 1, 9999999, 0 ) ) );

The third argument, 99999999, is creating some type of (historical) mean by Sequence (row). The third argument is the span of the average. More common is a span of 5 or 10 or ?? to provide a Mean for values near in time, that does not include the entire history.

Assuming that is what you want, an alternative formula can do the same thing

If( Row() == 1,
	:x,
	(:x + (Row() - 1) * Lag( :av, 1 )) / Row()
)

Attached is a version of sample data Big Class.jmp with two columns with formulas

Col Moving Average( :weight, 1, 9999, 0 )

and the comparable Custom Formula

If( Row() == 1,
	:weight,
	(:weight + (Row() - 1) * Lag( :Custom Formula, 1 )) / Row()
)

jojmp · May 29, 2019 01:16 AM

In this custom formula how do we know it is the moving average of 9999 items?

gzmorgan0 · May 29, 2019 03:42 AM

@jojmp ,

Either I do not understand your request or you do not understand Col Moving Avg(). You stated that you have 100 million rows and the span you specified in your script was 99,999,999. So the last last row would be the mean of the entire column, and the previous rows would be all values up to and including the current row. Below is a picture of Big Class and a column moving average with a span of 5 (n before). Note rows less than 6 just use all rows, and will result in a mean based on less than 6 values. From row 6 and later, the mean is computed using the current row and the previous 5 rows.

So please provide more details. Now if your files have more than 100 million rows, then the custom formula would need to be modified.

jojmp · May 29, 2019 04:06 AM

sorry for not being clear I have changed the question to more than 10000 million entries and the number of entries are not fixed it can increase to any number. Looking for a generic formula to calculate the moving average over any number of items.

gzmorgan0 · May 29, 2019 06:48 AM

@jojmp,

Given your original script where you did not even save your table and you reference a column that was never created, and given that the mean of a large number of values has a very small change by dropping one row and adding another, your question is very strange.

Below is a script, that creates a table with nr rows versus reading in a file where nr = nrow(dt). If you had 100 million rows, memory management is very important and calculations shoudl try to be order n, O(n) and not O(n*rng)

If you show the embedded log the time to do the calculations will be printed out. Also Begin Data Update() and End Data Update() are used to save on IO (rendering) time. On my PC here are some of my results:

nr	nrg	seconds
100,000	100	0.34
10,000,000	1,000,000	10.66
100,000,000	100,000,000	85.74
100,000,000	10,000,000	104.50

Note, this script assumes no missing values. The script will need to be modified and performance will degrade immensely to accomodate missing values. Change the values of nr and nrg to test your PC's performance.

Names Default to Here(1);

nr = 100000;
dt = new table("test", new column("x"), Add rows(nr));

dt:x << set each value(Random Integer(100,200));

nrg = 100;

dt << begin data update();
t0 = hp time();
mvcol = dt << new column("MV");
mvcol[1] = :x[1];
for(j=2, j<=nrg, j++,
	mvcol[j] = (:x[j] + (j-1)*mvcol[j-1])/j; //assumes no missing values
);
if(nrg < nr,
	for(j=nrg+1, j<=nr, j++,
		mvcol[j] = (:x[j] + (nrg)*mvcol[j-1] - :x[j - nrg])/ (nrg)
	);
); //end if, 
wait(0);
dt << end data update();
t1= hp time();

show((t1-t0)*1e-6);

Given all this, I still get the feeling what you described miht not be what you need.

How to calculate moving average of column with more than 10000 million entries in JMP 11?

Re: How to calculate moving average of column with 100 million entries in JMP 11?

Re: How to calculate moving average of column with 100 million entries in JMP 11?

Re: How to calculate moving average of column with 100 million entries in JMP 11?

Re: How to calculate moving average of column with 100 million entries in JMP 11?

Re: How to calculate moving average of column with 100 million entries in JMP 11?