BookmarkSubscribeRSS Feed
Choose Language Hide Translation Bar
Highlighted
vince_faller
Super User

KDTable() K nearest rows with a sparse matrix

I have a data table that has a structure similar to this. A stacked table of dimensions with about 10-15k dimensions possible.  Each Thing has roughly 10 dimensions.  And there's some Y associated with it.  

 

Names default to here(1);
ndim = 12000;
dt = new table("Example", <<Add Rows(3000000),
	New Column("Thing", Nominal, <<Set Each Value(floor(Row()/10))),
	New Column("Dim", Nominal,
		<<Set Each Value((1::ndim)[Random Integer(ndim)])
	), 
	New Column("Y", <<Set Each Value(Random Normal()))
);

which I am then splitting into a table that is almost entirely sparse but EXPLODES memory. 

dt << Split(
	Split By( :Dim ),
	Split( :Y ),
	Group( :Thing ),
	Sort by Column Property
);

My end goal is to do KDTable() << K nearest neighbors on it.  But it bogs down a lot because now there's a ton more data than there originally was. And in reality, the missings should be treated as 0s.

 

//this takes quite some time(in fact I've never actually finished it because I runout of memory)
mat = dt[0, [2::ncols(dt)]];
//changing the missings to zeros
mat[loc(ismissing(mat))] = 0;

kdt = KDTable(mat);

{rows, dist} = kdt << K nearest rows( {3, 2.0}, 1 ); 

Does anyone know a better way to do this?  

Vince Faller - Predictum
0 Kudos