Matrix Operations with By variables

uday_guntupalli · ‎07-12-2018

I am hoping this feature exists, if not would be a great addition. I am looking for the ability to perform matrix operations using by variable. Currently this feature exists only in data tables as far as I am aware. Having this featute will enable sorting matrixes possible without using loops.

A = Random Index(10^4,10^2);

B = Shape(A,10,10);

C = Sort Descending(B);  // Fails since function only supports vector or list 

D = Rank(B);// Fails since function only supports vector or list

gzmorgan0 · ‎07-15-2018

I am assuming your "By Variable" request is for By Column. JMP already provides several by column functions: V Mean, V Min, etc.

I see a benefit of V Rank, but V Sort Descending, would be strange. Once you create the matrix (created with Shape or Get as Matrix), typically, the items in a row are "linked", that is, kept together. What do you expect from Sort Descending(B)? I am not knowledgable of all languages, but I have only seen a by row sort in Matlab, which uses the first element in the row to specify the row order. Do you have an example in another languge or statistical package that sorts and does not maintain row linkage?

Here is an example ByColMat() function. Uncomment "Usage," and test it out. Note Sort Descending breaks the row linkage

Names Default to Here(1);

//================================================================================= 
byColMat = Function({mat, expr_name}, {nc, imat, i, rslt},
//mat is a matrix, expr_name is a string of a valid vector function
//returns a matrix typically a row vector if the name_expr function returns 1 value
//or a matrix ofd the same size. Summary is a special case
   nc = ncol(mat);
   if(expr_name=="Summary",
   	imat=J(10,nc,empty());
	imat[1,0] = V Mean(mat);
	imat[2,0] = V Std(mat);
	imat[3,0] = V Min(mat);
	imat[4,0] = V Max(mat);
	imat[5,0] = V Sum(mat);
   	For(i = 1, i<=nc, i++,	
   	   imat[6,i] = Quantile(.25, mat[0,i]);
   	   imat[7,i] = Quantile(.50, mat[0,i]);
   	   imat[8,i] = Quantile(.75, mat[0,i]);
   	   imat[9,i] = imat[8,i] - imat[6,i];
   	   imat[10,i] = nc - Nmissing(mat[0,i]);			
   	); 
   	Write("Stats are: Mean, Std Dev, Min, Max, Sum, Q25, Median, Q75, IQR, N"); 
   , //else
    imat=[];
    mexpr = parse(expr_name || "(mat[0,i])");
   	For(i = 1, i<=nc, i++,	
       imat ||= eval(mexpr)
    );
  );//end if
  imat
); //end function

/* Usage Example
//create a random matrix to emulate a data matrix
A = J(10,10, RandomInteger(1,10^4)); 

show(A, byColMat(A, "Summary"));
show(A, byColMat(A, "Sort Descending"));
show(A, byColMat(A, "Rank"));

*/

uday_guntupalli · ‎07-16-2018

@gzmorgan0,
Your interpretation is fairly accurate. I agree with you that JMP provides some by-column functions. My request was to expand on these beyond mean and sum.

Coming to your implementation of these functions, I don't want to use loops if it is possible to avoid it. If an in-built function already uses it, typically they are optimized and I am ok with it. However, if there is a possibility to avoid using a loop, I would want to avoid it.
Now regarding the rank and sort, the other language where I have seen sort by row is in Matlab (https://www.mathworks.com/help/matlab/ref/sort.html). I have also been trying to learn SAS and one of the things that I have seen is in SAS, to get rank you sort the data set and add a row number which actually makes sense. So, the intent behind my request for a row-sort in JMP alongside a rank is to be able to have that kind of leverage where an user can just sort the dataset and automatically has the ranking order using row() function.

One example I have been looking at where I have seen use for the aforementioned request is ranking time series data (hourly / daily / weekly) .

Currently, the only way I know how to achieve the ranks on such data set is to take it back to a data table and use the Col Rank() and Col Rank() and Col Number().

A = Random Shuffle(Random Index(10^3,240)); // random hourly data for 10 days 

dt = As Table(A); 

As Column(dt,"Col1") << Set Name("TestVal"); 

dt << New Column("Day",Numeric,Continuous,Formula(Ceiling(Row()/24)))
   << New Column("RankHourByDayAsc",Numeric,Continuous,Formula(Col Rank(:TestVal,:Day)))
   << New Column("RankHourByDayDesc",Numeric,Continuous,Formula(Col Number(:TestVal,:Day) + 1 - Col Rank(:TestVal,:Day)));