Multithreading troubleshooting using Parallel Assign

msulzer_bsi · Oct 24, 2019 8:25 AM

I'm on a Mac system trying to get multithreading to work on a large datatable (>8MM) on row-wise operations. They're just simple calculations like division or copying values as is. I can't attach my proprietary data, but in this post I'll include in the JSL a random matrix the same size as my dataset but with random data.

Names Default To Here(1);

//For JMP community.
dt = J(Random Integer(6000000 , 8000000) , 5 , Random Integer(10)); 
dt = As Table(dt); 

//dt = Current Data Table(); 	//Table needs to be formatted such that data table has 5 columns with names X , Y , Z , Filename , Radius. 
							//columns must be in that order for matrix functionality to work

edgeExclusion = 0.01; //in units of radial meters
stepSize = 0.0025; //in meters; this is the resolution of the desired particle size analysis. This is the side length of the square of the defect area resolution desired.
waferSize = 0.2; //in meters, diameter

If( Floor( waferSize / stepSize ) != Ceiling( waferSize / stepSize ) , Throw( "!Defect resolution size not a common factor of wafer size"));

//Takes data table as matrix (fDT) and returns matrix of x, y, stddev, stddev points, and UID.
//Requires datatable to have X , Y , Z , Filename , Radius , and UID so that matrix fDT has X , Y , Z , Radius , and UID; in that order.
stdDevCalculator2 = Function ( { fDT , fFirstRow , fLastRow , fWS , fEE , fSS } , {Default Local} , 
	
	fSubDT = fDT[ fFirstRow :: fLastRow , 0];
	outputMat = J(0,5);
	
	For( i = 1 , i <= N Rows(fSubDT) , i++ , 
	
		//If( fSubDT[i,4] > (fWS/2)-fEE , Continue()); //Possibly remove to keep number of columns same for Parallel Assign
					
		rowMat = Matrix( { Floor(fSubDT[i,1]/fSS) , Floor(fSubDT[i,2]/fSS) , fSubDT[i,3] , fSubDT[i,4] , fSubDT[i,5] } );
		rowMat = Shape(rowMat,1,5);
		If(N Rows(outputMat) == 0 , outputMat = rowMat , Concat To(outputMat , rowMat));
		
	);
		
	outputMat[1,0];
	
);

t1 = Tick Seconds();

numThreads = 8; 

//Number of rows to delete to make Parallel Assign function operate. Trimming original data table to even multiple of thread count will allow tOutMat to be a square matrix,
//so that each thread is assigning the same number of columns.
delNum = N Rows(dt) - (Floor(N Rows(dt)/numThreads)*numThreads); 

dt << Select Randomly(delNum);
dt << Delete Rows(); //Trim data set

numCols = 5 * (N Rows(dt)/numThreads); //5 for number of colums in original stacked data table.

//outMat = stdDevCalculator2(dtMat , 1 , N Rows(dtMat) , waferSize , edgeExclusion , stepSize); //For single threaded testing

dtMat = dt << GetAsMatrix;

tOutMat = J(numThreads,numCols);

Parallel Assign(
	{tDT = dtMat, tWS = waferSize, tEE = edgeExclusion, tSS = stepSize, tSDC = Name Expr(stdDevCalculator2), tNum = numThreads },
	
	tOutMat[i,0] = tSDC( tDT, 1 + Floor((i - 1) * N Rows(tDT) / tNum) , Floor( i * N Rows(tDT) / tNum) , tWS, tEE, tSS );
);

Show(tOutMat);

Write( "elapsed time=", Tick Seconds() - t1 );

My problem is that on my Mac system, it appears like the Parallel Assign works because all my core utilizations jump to 100% when this script runs. However, after giving it sufficient time to end (long after the single-threaded version does) I have to force quit.

I've done some playing around to understand how JSL prefers me to "inject" a row-wise set of values into another matrix. Take the example below:

Names Default to Here(1);
mat1 = J(3,5,0);
mat2 = J(1,5,Random Integer(5));

show(mat1);
Show(mat2);

mat1[2,0] = mat2[1,0];
Show(mat1);

mat1 will now have its second row a complete copy of mat2. I've tried to emulate this same sort of matrix assignment by returning a single row with N Rows(table)/N Threads. Does anyone have any thoughts as to why I can't get this to function?

Craige_Hales · Oct 24, 2019 01:21 PM

I think you are hoping for something Parallel Assign can't do. This code

tOutMat[i,0] =

in the Parallel Assign surprised me; the only thing Parallel Assign expects is JSL variable names for those subscripts, not a zero. It appears that the zero is quietly ignored and the right-hand-side of the assignment has no way to know what the 2nd index is. Here's some examples; you can change the 2nd subscript to zero and see it is ignored.

One. Use a variable i, local to each thread, and increment it after each use. The j,k (or j,0) index values are not used to determine the value calculated for the element.

x = J( 7, 11 );
Parallel Assign( {i = 0}, x[j, k] = (i++) );
Show( x );

x =
[ 0 0 0 0 1 1 1 1 2 2 2,
2 3 3 3 3 4 4 4 4 5 5,
5 5 6 6 6 6 7 7 7 7 8,
8 8 8 9 9 9 9 10 10 10 10,
11 11 11 11 12 12 12 12 13 13 13,
13 14 14 14 14 15 15 15 15 16 16,
16 16 17 17 17 17 18 18 18 18 19];

Two. Similar, shows the same thing another way. The threads each get a unique value that remains unchanged. h=4 hints how many processors are available. You should not depend on JMP always interleaving the calculations as shown, but that's how it is for now.

x = J( 7, 11 );
h = 0;
Parallel Assign( {i = (h++)}, x[j, k] = i );
Show( x, h );

x =
[ 0 1 2 3 0 1 2 3 0 1 2,
3 0 1 2 3 0 1 2 3 0 1,
2 3 0 1 2 3 0 1 2 3 0,
1 2 3 0 1 2 3 0 1 2 3,
0 1 2 3 0 1 2 3 0 1 2,
3 0 1 2 3 0 1 2 3 0 1,
2 3 0 1 2 3 0 1 2 3 0];
h = 4;

Three. Use a single global variable. Increment after each use. The threads lock each other out and might or might not run in any particular order.

x = J( 7, 11 );
global:g = 0;
Parallel Assign( {i = (h++)}, x[j, k] = (global:g++) );
Show( x, g );

x =
[ 0 20 39 58 1 21 40 59 2 22 41,
60 3 23 42 61 4 24 43 62 5 25,
44 63 6 26 45 64 7 27 46 65 8,
28 47 66 9 29 48 67 10 30 49 68,
11 31 50 69 12 32 51 70 13 33 52,
71 14 34 53 72 15 35 54 73 16 36,
55 74 17 37 56 75 18 38 57 76 19];
g = 77;

In all cases above, replacing k with 0 changes nothing. Internally, Parallel Assign is still computing all the index pairs and asking one of the threads to compute a value for each pair.

If you continue exploring down this path I'd suggest making separate instances of your function for each thread, something like

x = J( 2, 3 );
Parallel Assign( {f = Function( {x, y}, x * y )}, x[i, j] = f( i, j ) );
Show( x );

x = [1 2 3, 2 4 6];

which causes the local f for each thread to have a separate copy of the JSL function. I think the nameexpr does not make separate copies, and they may need to be separate to run in parallel correctly.

You may also be able to use an assignment to a global array element. Use a one-dimensional array of the right length to drive the parallel assign: (also notice the multi-statement right-hand-side value in parentheses)

global:results = J( 5, 7, . );
driver = J( N Rows( results ), 1, 0 );
h = 0;
Parallel Assign(
	{procNum = (1 + h++), NC = N Cols( results )},
	driver[j] = 
	(
		For( k = 1, k <= NC, k += 1,
			global:results[j, k] = procNum * 100 + j * 10 + k
		); 
		procNum// value assigned to driver
	)
);
Show( global:results, h, driver );

global:results =
[ 111 112 113 114 115 116 117,
221 222 223 224 225 226 227,
331 332 333 334 335 336 337,
441 442 443 444 445 446 447,
151 152 153 154 155 156 157];
h = 4;
driver = [1, 2, 3, 4, 1];

But the global variables are probably going to make it run as slow as single threaded.

And finally: Parallel Assign tries to keep you from writing unsafe code that might allow two threads to access some internal structure, at the same time, in a way that will either produce incorrect results or a crash. It is possible to write JSL that will evade Parallel Assign's defences...save your work often if you are pushing the envelope.

Craige

msulzer_bsi · Oct 24, 2019 01:44 PM

Hi Craige,

Thanks so much for your response to my question; I was hoping that at least you would see the thread because my learning on Parallel Assign was due largely to some examples you have posted previously. I will try to digest what the core meaning of your statements are and get back to you with any results I can apply to my specific case.

msulzer_bsi · Oct 25, 2019 1:33 PM

Hi @Craige_Hales,

I utilized your advice of creating a function definition within the parallel assign. See below:

Names Default To Here(1);
dt = Current Data Table(); 	//Table needs to be formatted such that data table has 5 columns with names X , Y , Z , Filename , Radius. 
							//columns must be in that order for matrix functionality to work

edgeExclusion = 0.01; //in units of radial meters
stepSize = 0.0025; //in meters; this is the resolution of the desired particle size analysis. This is the side length of the square of the defect area resolution desired.
waferSize = 0.2; //in meters, diameter

If( Floor( waferSize / stepSize ) != Ceiling( waferSize / stepSize ) , Throw( "!Defect resolution size not a common factor of wafer size"));

t1 = Tick Seconds();

numThreads = 8; 

//Number of rows to delete to make Parallel Assign function operate. Trimming original data table to even multiple of thread count will allow tOutMat to be a square matrix,
//so that each thread is assigning the same number of columns.
delNum = N Rows(dt) - (Floor(N Rows(dt)/numThreads)*numThreads); 
If( delNum != 0 , dt << Select Randomly(delNum));
dt << Delete Rows(); //Trim data set

dtMat = dt << GetAsMatrix;
dtMat = Shape(dtMat , numThreads);

tOutMat = J(numThreads , N Cols(dtMat));

Parallel Assign(
	{tDT = dtMat, tSS = stepSize, tf = Function ( { fDT , x , y , fSS } , 
	
			If( Mod(y,5) == 1 , Floor(fDT[x,y]/fSS) ,
				Mod(y,5) == 2 , Floor(fDT[x,y]/fSS) , 
				fDT[x,y];
			);
				
		); },
	
	tOutMat[i,j] = tf( tDT, i , j , tSS );
);

Show(tOutMat);

Write( "elapsed time=", Tick Seconds() - t1 );

I'm now getting the script to execute and not give me any weird output or throw an error. I re-evaluated what you explained about using parallel assign to evaluate each combination of the output matrix as a function of the location of the output matrix. In my case, I'm only interested in performing calculations on the values in the first two columns of the original long-rowed data table, hence the Mod() ==1 and Mod() ==2 statements. This isn't the most flexible script, because of the strict table formatting input requirements, but that's a problem for later.

However, now I'm getting a weird phenomenon where when I execute the multithread version of the script I'm getting execution times ~200 seconds long. When I rewrite the parallel assign statement to be a nested for loop like below as a singlethreaded script:

For(i = 1 , i <= numThreads , i++ , 

	For(j = 1 , j <= numCols , j++ , 
		
		If( Mod(j,5) == 1 , tOutMat[i,j] = Floor(dtMat[i,j]/stepSize) , 
			Mod(j,5) == 2 , tOutMat[i,j] = Floor(dtMat[i,j]/stepSize) , 
			tOutMat[i,j] = dtMat[i,j];
		);
		
	);

);

I can get execution times ~2 seconds... I shaped the input matrix even though it's unnecessary so that I would be more evenly comparing the process of singlethreaded vs multithreaded operations. I can't understand what sort of overhead is going on that's causing the multithreaded version to be so much slower. I've trimmed down the matrix calculation quite a bit from when I first posted my question. Do you have any insights?

Thanks, msulzer

martindemel · Feb 14, 2020 1:43 AM

Hi @msulzer_bsi

Some general statement: Multithreading is a bad idea when the problem you are trying to solve is not parallelizable, or when the problem you are trying to compute is so small, that spawning threads will cost more time than computing in a single thread.

If your problem can be broken into parts that can run in parallel, it might be a good idea to do so, but it depends on how much can run in parallel and how easy it is to join the sub problem results. If your algorithm would require multiple joins (to one thread) to break up into new sub problems on new threads, it have to be very computation intensive to be worth the effort to parallelise, if your result, on the other hand, just is a list of the sub problem results joined once, then there is potential.

I do not see much benefit in your operation and the speed improvement you would gain over 2secs could not be much, so why do you want to make this effort and block with htis other tasks which might could/have run on the other threads. Sometimes it is not worth to gain another 0.5s for what you need to do. Unless it is a high throughput application where you get all 2-3s another chunk of data to be processed, I do not see a good reason to fasten up this operation due to data of >8MM as you said.

/****NeverStopLearning****/

Craige_Hales · Feb 14, 2020 4:42 AM

re-puzzling over the slow JSL: I think the tDT value is a huge matrix that is being copied a huge number of times to pass to the tf function. 8E6 * 8E6 numbers are copied. The threads are waiting for memory access.

tOutMat[i,j] = tf( tDT, i , j , tSS )

That copy does not happen in the fast JSL.

Craige

Multithreading troubleshooting using Parallel Assign

Re: Multithreading troubleshooting using Parallel Assign

Re: Multithreading troubleshooting using Parallel Assign

Re: Multithreading troubleshooting using Parallel Assign

Re: Multithreading troubleshooting using Parallel Assign

Re: Multithreading troubleshooting using Parallel Assign