BookmarkSubscribe
Choose Language Hide Translation Bar
Pat1
New Contributor

Median Absolute Deviation --> Export To Table for further tSNE

Dear all,

I am currently working on a project using 57 samples and >800 000 variables (see image attached). Now I would like to reduce the variables by using only the ones with highest median absolute deviation (MAD). The goal would be to process the data further with PC Analysis, followed by a t-SNE (I am using the nice JMP add-in).

So far I am struggling with the identification of the variables with highest MAD and directly export these variables to a data table.

Thanks in advance, best regards

Pat

0 Kudos
4 REPLIES 4
Highlighted
txnelson
Super User

Re: Median Absolute Deviation --> Export To Table for further tSNE

Here is a simple example of generating the MAD.  I suggest you turn it into a function, and then create a new data table with one row, and your >800000 columns.

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/big class.jmp" );

theMatrix = Column( dt, "height" ) << get as matrix;
MAD = Quantile( .5, Abs( theMatrix - Quantile( .5, theMatrix ) ) );
Jim
txnelson
Super User

Re: Median Absolute Deviation --> Export To Table for further tSNE

Here is a more complete example that you may be able to directly use by just changing the reference to the "dt" data table, and the column in the output table to delete(i.e. Sample ID)

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/semiconductor capability.jmp" );
dt << select where(row()==1);

dtMAD = dt <<  Subset(
	Selected Rows( 0 ),
	Rows( [1] ),
	Selected columns only( 0 )
);
dtMAD << delete columns({"Lot_id","wafer","Wafer id in lot ID","site"});

For( i = 1, i <= N Cols( dtMAD ), i++,
	theMatrix = Column( dt, Column( dtMAD, i ) << get name ) << get as matrix;
	Column( dtMAD, i )[1] = Quantile( .5, Abs( theMatrix - Quantile( .5, theMatrix ) ) );
);
Jim
ian_jmp
Staff

Re: Median Absolute Deviation --> Export To Table for further tSNE

I initially mis-read 'median' as 'mean' and wrote the following: It takes a few seconds on my Mac, but almost all of that is in making the example table:

NamesDefaultToHere(1);

// Make a table with 'nc' columns and 'nr' rows to use . . .
nc = 1000;
nr = 57;
dt = NewTable("MAD Example");
for(c=1, c<=nc, c++,
	dt << NewColumn("Col "||Char(c), Numeric, Continuous, Formula(RandomNormal(0, 1)));
);
dt << addRows(nr);
dt << runFormulas;

// Start Here: Calculate MAD for each column . . .
dt = CurrentDataTable();
mat = dt << getAsMatrix;
// (1): Get the mean of each column in a row vector
mean = VMean(mat);
// (2): 'Explode' this to make a matrix congruant with 'mat'
meanMat = Repeat(mean, nr);
// (3): Get the Absolute deviations
AD = Abs(mat - meanMat);
// (4): Get the MAD
MAD = VMean(AD);
// Find the column numbers of the 'nc2' columns that have the biggest MAD values
nc2 = Round(nc/10, 0);
rankMAD = Rank(MAD);
topRandMAD = rankMAD[(nc-nc2)+1::nc];
// Make a second table
dt2 = NewTable("Top "||Char(nc2)||" Largest MAD Values from "||(dt << getName),
			NewColumn("Column Number", Numeric, Nominal, Values(topRandMAD)),
			NewColumn("MAD", Numeric, Continuous, Values(MAD[topRandMAD]))
			);

In addition to @txnelson approach, I note that the Distribution platform will give the median absolute deviation. With 800,000 columns I might expect his approach to be quicker though. 

Pat1
New Contributor

Re: Median Absolute Deviation --> Export To Table for further tSNE

Thanks a lot @txnelson and @ian_jmp in principle your suggestions worked. Unfortunately, I used a slow workstation for the test, thus I was not able to process all variables, have to check it on a faster system. All best, Pat

0 Kudos