Solved: Re: Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram · Oct 13, 2017 05:08 PM

Hello

I am analyzing a very large data set, at a minimum of around 2000 parameters in columns. My samples have these parameter values for ~ 250k rows. My problem is with the columns, as opposed to rows.

I have a discrete classification defined for all samples, two discrete values, say class A and B.

I am analyzing data further in a different software which then fails if it detects parameters that have zero variance (or very close to).

I do not want to manually do this work (as you can imagine from 2000 columns)..

I am looking for an as automated as possible way to delete a columns that have zero variance (or below a defined very small threshold) for either of class A or B. I go through steps to complete the work, but I don't want to be manually selecting any column to do the deletion.

Any help will be appreciated. thanks in advance.

brady_brady · Oct 13, 2017 08:16 PM

Altug, I've tested my code a bit, and although it avoids explicit looping, which is good, the << get all values as matrix () function is simply too expensive when the table gets big, and the gains (if any) of vstd() vs. table ops, are too miniscule to offset this.

For a table of your size, a modification of Jim's code (to allow for your groups) is going to be faster. Central to this is the fact that JMP table operations are really fast. I'm not even convinced that the matrix ops, which are also really fast, are faster than the column functions.

Subsetting the tables initially to create two subtables--one for each group--more than pays for itself (assuming you have enough memory to house all 3 tables), as taking subsets of rows over and over is expensive.

The code below (again, a slight modification of Jim's, to allow for your two groups) ran for me in about 2.5 seconds for 100K rows and 1K columns, whereas the routine I first submitted took over 6 seconds.

This will still seem a bit slow on a table your size... hopefully someone else will have a better idea.

Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 =  dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 =  dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance 
For( i = N Items( collist ), i >= 1, i--,
	If( (Col Std Dev( Column( dt1, colList[i] ) ) ) != 0 & Col Std Dev( Column( dt2, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);


// Delete the columns with no variance
dt<<delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);

View solution in original post

txnelson · Oct 13, 2017 05:55 PM

Here is a simple example script that will do what you want.

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA\semiconductor capability.jmp" );

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance
For( i = N Items( collist ), i >= 1, i--,
	If( Col Std Dev( Column( dt, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);

// Delete the columns with no variance
dt<<delete columns(colList);

Jim

brady_brady · Oct 13, 2017 3:47 PM

Altug,

If I understand your problem correctly:

- You have a column, (named Group, for example), with 2 values ("A" and "B", for example).

- You have 2000 columns with numeric data.

- You wish to delete any column where the std deviation of rows belonging to the "A" group, OR the rows belonging to the "B" group is below some threshhold (or both.)

Performance (and memory usage) is going to be an issue here for a table of your size. I am not sure whether the below will work, or work quickly, on a table of the size you have, but give it a go.

First, make a table where the "a/b" column is the first column, and the remaining columns are the parameter columns of interest.

Then try this.

Cheers,

Brady

dt = Current Data Table();

mat = dt << Get All Columns As Matrix(); //this get the A/B column as 1s and 2s. which is which not important.

//create subtables... note A & B will be reversed if B occurs first in dt; the result is unaffected.
matA = mat[loc(mat[0,1]==1), 0];
matB = mat[loc(mat[0,1]==2), 0];

//compute stddevs for each column in each subtable
stdvA = vstd(matA);
stdvB = vstd(matB);

//locate columns of sufficiently low variance
cols = loc(stdvA == 0 | stdvB == 0);

//delete them. Ignore row (1) of the cols vector; it is 1 (corresponding to column 1 in dt).
try(dt << delete columns((dt << get column names)[cols[2::nrow(cols)]]));

altug_bayram · Oct 15, 2017 02:57 PM

Your understanding is correct Brady. Only thing to emphasize, with 250,000 rows , this becomes a matrix of 2000 x 250,000 .

altug_bayram · Oct 15, 2017 02:58 PM

Jim,

thanks for your guidance here. I will try it out. thx again.

brady_brady · Oct 13, 2017 08:16 PM

Altug, I've tested my code a bit, and although it avoids explicit looping, which is good, the << get all values as matrix () function is simply too expensive when the table gets big, and the gains (if any) of vstd() vs. table ops, are too miniscule to offset this.

For a table of your size, a modification of Jim's code (to allow for your groups) is going to be faster. Central to this is the fact that JMP table operations are really fast. I'm not even convinced that the matrix ops, which are also really fast, are faster than the column functions.

Subsetting the tables initially to create two subtables--one for each group--more than pays for itself (assuming you have enough memory to house all 3 tables), as taking subsets of rows over and over is expensive.

The code below (again, a slight modification of Jim's, to allow for your two groups) ran for me in about 2.5 seconds for 100K rows and 1K columns, whereas the routine I first submitted took over 6 seconds.

This will still seem a bit slow on a table your size... hopefully someone else will have a better idea.

Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 =  dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 =  dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance 
For( i = N Items( collist ), i >= 1, i--,
	If( (Col Std Dev( Column( dt1, colList[i] ) ) ) != 0 & Col Std Dev( Column( dt2, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);


// Delete the columns with no variance
dt<<delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);

altug_bayram · Oct 15, 2017 09:26 PM

Brady

I tested this on a sample dataset (not the huge one yet) to understand the mechanism. I've been able to modify it to put a threshold on st. dev. rather than asking for 0. My remaining issue is, I do not want to delete data in cases where

stdev<=threshold (i.e. close to 0) AND mean of data (of parameter wrt to class A and B) are different.

Hence I need to add a test to compare the means wrt to A and B as a further additional constraint.

I was thinking Kolmogorov-Smirnov test ... in which case I can simply state "delete the parameter if KS <= thrreshold (practically close to 0)"

In fact, what I am trying to do is to downselect parameters that are different for the next step. So I started thinking maybe the best thing to do is just apply KS (no need for stdev).

JMP help provides

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );
obj = Oneway( Y( :Height ), X( :sex ) );
obj << Kolmogorov Smirnov Exact Test( 1 );

could this be translated to my case as (within the current loop)

obj=Oneway( Y(:Column( dt1, colList[i] ), X( Column( dt2, colList[i] ) ) );

obj << Kolmogorov Smirnov Exact Test( 1 );

but I do not know how to extract KS value from this obj object.

By the way, I don't think the cpu time for 2000 columns is going to bother me as I am willing to do anything to avoid any manual touch on the data.

thx so much.

Duane_Hayes · Oct 16, 2017 8:12 AM

Exact tests can take a very long time to compute, and given you have 250,000 rows in your data table, I would bet they will not be calculated in this situation. I recommend, instead, using the

Summarize YByX()

command which calculates all Fit Y By X combinations and produces a data table of p-values and LogWorth values for each y/x combination. You can then determine the correct columns to investigate further. You will need to test for zero variance in a separate step.

Duane Hayes

altug_bayram · Oct 16, 2017 08:22 PM

thanks Duane.

I am trying to determine within the script via specific outputs and comparisons if the parameter is different wrt classes A and B. I am not trying to analyze an intermediate table of results to then take the decision.

I've got 2000 parameter columns in a given dataset and many datasets of comparable size.

The procedure needs to auto-delete the related columns found as nearly the same between the two classes A and B, for a given parameter.

I initially wrote this down as zero variance seeking logic but the issue is that two parameters w/ zero variance but with different means should not be deleted. That's when I started changing the logic to include a test for comparing the 2 sample mean test.

My problem is I am very novice at scripting and do not know how to call out a parameter from the results of a given platform.

My other issue in our data is where you specified all parameters in the script, doing some like this even could prove to be very time consuming (maybe there is a easy enough way).

Brady's script actually is nearly perfect except I need to add a 2-sample test for means and know how I can use its output as an additional filter.

thanks for your help.

brady_brady · Oct 16, 2017 6:27 PM

<<Edited to reflect that there are ~250,000 rows, not ~2,000, which is the number of columns.>>

Altug, there is something I'd like to clarify: you've got ~250K rows of data in two groups, which I'm assuming (by default) are of roughly the same size. If so, a t-test with 125,000 samples in each group has power = .9 to detect a difference of less than 1/50th sigma. Given you're only getting to this point (testing means) when one or both sigma levels is very low, is such a small (relative to sigma) difference one you feel is of practical importance? I.e., if you had 2 processes with sigma roughly = 1 unit, would a difference in means of 0.018 unit matter to you? This is how sensitive the t-test is going to be when there are 125,000 items in each group.

If in your setting such a small difference is not important, you'd be better served by an equavlence test (which is in its simplest form 2, simultaneous, 1-sided t-tests), which allows you to specify how large a difference IS meaningful. Means that differ by less than this amount are considered practically equivalent. Using this approach, a given column would be deleted as long as the means are practically equivalent and at least one of the group standard deviations is below threshhold.

Here is some information and an example:

http://www.jmp.com/support/help/Equivalence_Test.shtml