cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
nshivan
Level I

Replace missing values with 0

Hello,

I'm trying to replace all missing values in a data table with 0.   The below script doesn't work.  I use JMP 13.  Is there any other way to do this?

 

dt = Current Data Table();
nc = dt << get column names( Numeric );
For( i = 1, i <= N Items( nc ), i++,
  nc[i][dt << get rows where( Is Missing( nc[i][] ) )] = 0
);

 Thanks! 

13 REPLIES 13

Re: Replace missing values with 0

Hi all,

I have been working on this exact problem with a hundreds of columns with hundreds of missing rows.  I wasn't happy with the performance and tried many approaches.  It seems that Recode solves this problem about 150 times faster than the current solution.  Sharing the code with anyone else that might care about speed.  Please let me know if you find something faster.  Every millisecond counts!  

replaceMissingStart = HP Time();

// replace missings with zeros
nc = dmdt << get column names( Numeric );

For( i = 1, i <= N Items( nc ), i++,
	col = Column(dt,nc[i]);
	colname = col << getname();

   	Eval(Parse(EvalInsert("\[
		dmdt << Recode Column(
			:Name("^colname^"),
			{Map Value( _rcOrig, {., 0}, Unmatched( _rcNow ) )},
			Update Properties( 1 ),
			Target Column( :Name("^colname^") )
		);
	]\")));
	
	//Code from the Community
	//Column( nc[i] )[dmdt << get rows where( Is Missing( As Column( nc[i] ) ) )] = 0;

	// standardize column attributes, change all to nominal
	col << Set Modeling Type( "Nominal" ) 
);

replaceMissingStop = HP Time();
totalReplaceTime = replaceMissingStop-replaceMissingStart;
show(totalReplaceTime/1000000);

Thanks to Rebecca Lyzinski for the idea!!!

ian_jmp
Staff

Re: Replace missing values with 0

FWIW, this sort of approach might be better. For 10,000 by 10,000 it took about 10 seconds on my laptop (including making the starting table):

NamesDefaultToHere(1);

// Size the problem . . .
nr = 10000;		// Number of rows
nc = 10000;		// Number of columns
nm = 0.1;		// Proportion of missing values

// Make a table and inject some missing values
mat = J(nr*nc, 1, RandomNormal());
mat[RandomIndex(nr*nc, nr*nc*nm)] = .;
dt = AsTable(Shape(mat, nr, nc));
dt << setName("Data with some missingness");

// Start here: Get values into a matrix (assumes contiguous numeric columns)
mat = dt << getAsMatrix;
// Reshape as a vector
mat = Shape(mat, nr*nc, 1);
// Replace missing
mat[Loc(mat, .)] = 0;
// Reshape back to a matrix
mat = Shape(mat, nc, nc);
// Make a new table
dt2 = AsTable(Shape(mat, nr, nc));
dt2 << setName("Data with missing set to zero values");

 

Craige_Hales
Super User

Re: Replace missing values with 0

Combining solutions, with data table subscripting...

nr = 10000;		// Number of rows
nc = 10000;		// Number of columns
nm = 0.9;		// Proportion of missing values

// Make a table and inject some missing values
mat = J(nr*nc, 1, RandomNormal());
mat[RandomIndex(nr*nc, nr*nc*nm)] = .;
dt = AsTable(Shape(mat, nr, nc));
dt << setName("Data with some missingness");


replaceMissingStart = HP Time();

nc = dt << get column names( Numeric );
mat = dt[0,nc];
mat[Loc(mat, .)] = 0;
dt[0,nc] = mat;


replaceMissingStop = HP Time();
totalReplaceTime = replaceMissingStop-replaceMissingStart;
show(totalReplaceTime/1000000);


About 10 seconds.

 

Craige

Re: Replace missing values with 0

Dear Ian and Craige,

Thank you so much for your submissions.  I ran some preliminary tests and Ian's solution is more than 4 times faster than Recode for this and Craige's solution is almost 10 times faster.  If anyone would like to compare all 3 solutions, try the following:

 

//Recode solution

nr = 10000;		// Number of rows
nc = 10000;		// Number of columns
nm = 0.9;		// Proportion of missing values

// Make a table and inject some missing values
mat = J(nr*nc, 1, RandomNormal());
mat[RandomIndex(nr*nc, nr*nc*nm)] = .;
dt = AsTable(Shape(mat, nr, nc));
dt << setName("Data with some missingness");

replaceMissingStart = HP Time();

// replace missings with zeros
nc = dt << get column names( Numeric );

For( i = 1, i <= N Items( nc ), i++,
	col = Column(dt,nc[i]);
	colname = col << getname();

   	Eval(Parse(EvalInsert("\[
		dt << Recode Column(
			:Name("^colname^"),
			{Map Value( _rcOrig, {., 0}, Unmatched( _rcNow ) )},
			Update Properties( 1 ),
			Target Column( :Name("^colname^") )
		);
	]\")));
	
	//Code from the Community
	//Column( nc[i] )[dmdt << get rows where( Is Missing( As Column( nc[i] ) ) )] = 0;

	// standardize column attributes, change all to nominal
	//col << Set Modeling Type( "Nominal" ) 
);

replaceMissingStop = HP Time();
totalReplaceTime = replaceMissingStop-replaceMissingStart;
show(totalReplaceTime/1000000);

Russ Wolfinger had mentioned he thought matrices would solve the problem more quickly, I guess that was what you two were thinking as well!