Discussions

Angela24996 · Jun 10, 2023 4:28 PM

I have a big data table and a normalization factor table.

The data table looks like this:

ID	Type	parameter1	parameter2	parameter3
01	1	data	data	data
02	1	data	data	data
03	2	data	data	data
04	2	data	data	data

The normalization factor table looks like this:

Type	parameter1	parameter2	parameter3
1	normalization factor1	normalization factor2	normalization factor3
2	normalization factor4	normalization factor5	normalization factor6

How can I multiply the first table by the second table based on type and parameter and obtain a new table like this:

ID	Type	parameter1	parameter2	parameter3
01	1	data*normalization factor1	data*normalization factor2	data*normalization factor3
02	1	data*normalization factor1	data*normalization factor2	data*normalization factor3
03	2	data*normalization factor4	data*normalization factor5	data*normalization factor6
04	2	data*normalization factor4	data*normalization factor5	data*normalization factor6

My raw data table is huge in both rows and columns so I want a fast solution.

Thank you very much.

brady_brady · Apr 1, 2021 9:56 PM

Hi,

I took a look at this and noticed a performance wall once the table got near the size of yours (actually, well before that.) I was able to improve things a bit once I began to process the table in "blocks" of about 10K rows by 1K columns. You could play with the size to see what works best, but these dimensions seemed to work well.

This method does not create a new table--it overwrites the original... so you'd want to save the original, then save the result under a different name.

The script below makes a couple of assumptions, as it is just meant to illustrate the overall concept--

- corresponding columns are in the same order in each table.

- for p parameters, the main table has p+2 columns, and the factor table has p+1 columns. there are no other columns in either table.

The actual processing time for a 1million row table with a thousand columns was about 40 seconds... each "block" took about 0.3 - 0.4 seconds to process.

The table setup portion, to mock the tables you have, took a while... probably a couple of minutes. Currently the script is set for a smaller table... modify the lines under the //table settings comment to change this.

The code is neither elegant nor compact, but it runs faster than anything else I could come up with (including using element-wise multiplication on the entire block at once... for some reason, iterating over the columns took about half the time.) I didn't try Craige's formula-based approach, though.

The initial sort is needed to keep from having to fetch rows using where() (if you have to do that, it slows things considerably.)

Cheers,

Brady

Names Default To Here( 1 );

//set limits on block size
blockRowLimit = 1e4;
blockColLimit = 1e3;

// table setup
rowsdt1 = 1e4;
colsdt1 = 1e2;

dt1 = As Table( (1 :: rowsDt1)` || J( rowsDt1, 1, Random Integer( 1, 2 ) ) || J( rowsdt1, colsdt1, Random Uniform() ), <<invisible );
Column( dt1, 1 ) << set name( "ID" );
Column( dt1, 2 ) << set name( "Type" );

dt2 = As Table( [1 1, 2 2] || J( 2, colsdt1, Random Uniform() ), <<invisible );
Column( dt2, 1 ) << set name( "Type" );
dt2 << Delete Columns( 2 );


////processing begins here

dt1 << sort( by( :type ), order( ascending ), replacetable );
type2Begin = Contains( dt1:type << getasmatrix, 2 );		//row # of first occurrence of type == 2 in sorted table.

//build matrices that show where row & column divisions of "blocks" will be
rowBreaks = (1 :: N Row( dt1 ) :: blockRowLimit) || Matrix( type2Begin ) || Matrix( N Row( dt1 ) );
rowbreaks = rowbreaks[Rank( rowbreaks )]; //sort the matrix

colBreaks = (3 :: N Col( dt1 ) :: blockColLimit) || Matrix( N Col( dt1 ) );
colBreaks = colBreaks[Rank( colBreaks )]; //sort the matrix

nrowBreaks = N Row( rowBreaks );
ncolBreaks = N Row( colBreaks );

For( i = 1, i < nrowBreaks, i++,
	lowRow = rowBreaks[i];
	highRow = if ( i == nRowBreaks - 1, rowBreaks[i+1], rowBreaks[i+1] -1 ); //we don't need to subtract 1 off the last upper bound.
	rowRange = lowRow :: highRow;
	
	If( rowrange[1] < type2Begin, factorRow = 1, factorRow = 2); //determine which row of factors we'll work with for this row range of data
	
	For( j = 1, j < ncolBreaks, j++,		
		lowCol = colBreaks[j];
		highCol = if ( j == ncolBreaks - 1, colbreaks[j+1], colbreaks[j+1] -1 );  //we don't need to subtract 1 off the last upper bound.
		For( c = lowCol, c <= highCol, c++,
			dt1[rowRange, c] *= dt2[factorRow, c - 1]   // -1 because dt1 has an ID column while dt2 does not
		);				
	);
);

dt1 << bring window to front;

View solution in original post

Craige_Hales · Apr 1, 2021 3:05 AM

You could use a virtual join and formula columns.

Virtual join adds the blue and gold keys

You can link them interactively or with JSL.

This may be using JMP16 names with the :"parameter3[Type]"n syntax; you might need name("parameter3[Type]") instead.

// make example tables. **save them** so virtual join works.
factorDT = New Table( "Imported Data 2",
	New Column( "Type", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 2] ) ),
	New Column( "parameter1", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1.1, 1.2] ) ),
	New Column( "parameter2", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [0.8, 0.9] ) ),
	New Column( "parameter3", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1.1, 0.9] ) )
);
factorDT << save( "$temp/factor.jmp" );
dataDT = New Table( "Imported Data",
	New Column( "ID", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 2, 3, 4] ) ),
	New Column( "Type", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 1, 2, 2] ) ),
	New Column( "parameter1", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 2, 3, 4] ) ),
	New Column( "parameter2", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [4, 3, 2, 1] ) ),
	New Column( "parameter3", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [2, 3, 3, 2] ) )
);
dataDT << save( "$temp/data.jmp" );

// join the tables
factorDT:type << Set Property( "Link ID", 1 );
dataDT:type << Set Property( "Link Reference", Reference Table( "factor.jmp" ) );

// add formula columns
dataDT << New Column( "p 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( :"parameter1[Type]"n * :parameter1 ) );
dataDT << New Column( "p 2", Numeric, "Continuous", Format( "Best", 12 ), Formula( :"parameter2[Type]"n * :parameter2 ) );
dataDT << New Column( "p 3", Numeric, "Continuous", Format( "Best", 12 ), Formula( :"parameter3[Type]"n * :parameter3 ) );

// force the evaluation to complete
dataDT<<RunFormulas;

Craige

Angela24996 · Apr 1, 2021 03:29 PM

Thank you for your reply. However, I have many columns and column names are complicated. Therefore I want to write a for loop to create new columns. Can you explain more on this statement?

Formula( :"parameter1[Type]"n * :parameter1 )

I am thinking about writing a for loop like this:

colnames = dataDT << Get Column Names();
nc = ncols(dataDT);

DT = New Table("normalized", New Column("ID", values(dataDT:ID << get values)),
	New Column("Type", values(dataDT:Type << get values)));
For(i = 3, i<= nc, i++,
DT << New Column( char( colnames[3]), Numeric, "Continuous", Format( "Best", 12 ), Formula( XXXX ) ));

But I don't know how to construct the formula correctly. It keeps giving me errors. Can you help me?

Thank you.

Craige_Hales · Apr 1, 2021 07:22 PM

The loop is not easy; I think this will do it:

// make example tables. **save them** so virtual join works.
factorDT = New Table( "Imported Data 2",
	New Column( "Type", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 2] ) ),
	New Column( "parameter1", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1.1, 1.2] ) ),
	New Column( "parameter2", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [0.8, 0.9] ) ),
	New Column( "parameter3", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1.1, 0.9] ) )
);
factorDT << save( "$temp/factor.jmp" );
dataDT = New Table( "Imported Data",
	New Column( "ID", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 2, 3, 4] ) ),
	New Column( "Type", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 1, 2, 2] ) ),
	New Column( "parameter1", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [1, 2, 3, 4] ) ),
	New Column( "parameter2", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [4, 3, 2, 1] ) ),
	New Column( "parameter3", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [2, 3, 3, 2] ) )
);
dataDT << save( "$temp/data.jmp" );

originalColumns = dataDT<<getColumnNames("string");

// join the tables
factorDT:type << Set Property( "Link ID", 1 );
dataDT:type << Set Property( "Link Reference", Reference Table( "factor.jmp" ) );

// add formula columns
for(i=3,i<=nitems(originalColumns),i+=1,
	e = "dataDT << New Column("
		|| "\!"" || originalColumns[i] || "_scaled\!"" // "parameter1_scaled"
		|| ", Numeric, \!"Continuous\!", Format( \!"Best\!", 12 ), Formula( dataDT:name(\!"" 
		|| originalColumns[i] || "[Type]" // this is the factor, in quotation marks, in name()
		|| "\!") * dataDT:name(\!"" 
		|| originalColumns[i] || // this is the data, in quotation marks, in name()
		"\!")))";
	// write(e,"\!n"); // dataDT << New Column("parameter1_scaled", Numeric, "Continuous", Format( "Best", 12 ), Formula( dataDT:name("parameter1[Type]") * dataDT:name("parameter1")))
	eval(parse(e));
);
// force the evaluation to complete
dataDT<<RunFormulas;

capture the original columns before adding the hidden columns with virtual join
build the expression e using \!" to escape the embedded quotation marks
use write() to debug your work, not print
<<RunFormulas is important for speed; the default background evaluation of column formulas is a lot slower

Craige

Angela24996 · Apr 2, 2021 06:40 PM

Thank you for the updates, I tried and it works on a trial dataset. However I do meet performance problems as pointed out by another person. Therefore I accepted his answer as the best solution.

ian_jmp · Apr 1, 2021 10:22 AM

And if you don't need the formulas, you could consider element wise multiplication of matrices which may or may not be 'better'. On my MacBook the code below took about 15 seconds.

NamesDefaultToHere(1);

nr = 10000; 	// Number of rows
nc = 10000;		// Number of Columns

dt1 = AsTable(J(nr, nc, RandomInteger(1, 20)));
dt2 = AsTable(J(nr, nc, RandomUniform(0, 1)));

m1 = dt1 << getAsMatrix;
m2 = dt2 << getAsMatrix;

dtFinal = AsTable(m1 :* m2);

Angela24996 · Apr 1, 2021 03:38 PM

Thank you, it works on a small dataset. But I have millions of rows in table 1 and only two rows in table 2. Therefore I want to avoid creating another huge table. If I can't find a better solution, I will try this method.

brady_brady · Apr 1, 2021 9:56 PM