topic Re: Big Data / Automatic Column Elimination due to Zero Variance in Discussions

Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram — Fri, 13 Oct 2017 21:08:02 GMT

Hello

I am analyzing a very large data set, at a minimum of around 2000 parameters in columns. My samples have these parameter values for ~ 250k rows. My problem is with the columns, as opposed to rows.

I have a discrete classification defined for all samples, two discrete values, say class A and B.

I am analyzing data further in a different software which then fails if it detects parameters that have zero variance (or very close to).

I do not want to manually do this work (as you can imagine from 2000 columns)..

I am looking for an as automated as possible way to delete a columns that have zero variance (or below a defined very small threshold) for either of class A or B. I go through steps to complete the work, but I don't want to be manually selecting any column to do the deletion.

Any help will be appreciated. thanks in advance.

Re: Big Data / Automatic Column Elimination due to Zero Variance

txnelson — Fri, 13 Oct 2017 21:55:37 GMT

Here is a simple example script that will do what you want.

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA\semiconductor capability.jmp" );

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance
For( i = N Items( collist ), i >= 1, i--,
	If( Col Std Dev( Column( dt, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);

// Delete the columns with no variance
dt<<delete columns(colList);

Re: Big Data / Automatic Column Elimination due to Zero Variance

brady_brady — Fri, 13 Oct 2017 22:47:23 GMT

Altug,

If I understand your problem correctly:

- You have a column, (named Group, for example), with 2 values ("A" and "B", for example).

- You have 2000 columns with numeric data.

- You wish to delete any column where the std deviation of rows belonging to the "A" group, OR the rows belonging to the "B" group is below some threshhold (or both.)

Performance (and memory usage) is going to be an issue here for a table of your size. I am not sure whether the below will work, or work quickly, on a table of the size you have, but give it a go.

First, make a table where the "a/b" column is the first column, and the remaining columns are the parameter columns of interest.

Then try this.

Cheers,

Brady

dt = Current Data Table();

mat = dt << Get All Columns As Matrix(); //this get the A/B column as 1s and 2s. which is which not important.

//create subtables... note A & B will be reversed if B occurs first in dt; the result is unaffected.
matA = mat[loc(mat[0,1]==1), 0];
matB = mat[loc(mat[0,1]==2), 0];

//compute stddevs for each column in each subtable
stdvA = vstd(matA);
stdvB = vstd(matB);

//locate columns of sufficiently low variance
cols = loc(stdvA == 0 | stdvB == 0);

//delete them. Ignore row (1) of the cols vector; it is 1 (corresponding to column 1 in dt).
try(dt << delete columns((dt << get column names)[cols[2::nrow(cols)]]));

Re: Big Data / Automatic Column Elimination due to Zero Variance

brady_brady — Sat, 14 Oct 2017 00:16:59 GMT

Altug, I've tested my code a bit, and although it avoids explicit looping, which is good, the << get all values as matrix () function is simply too expensive when the table gets big, and the gains (if any) of vstd() vs. table ops, are too miniscule to offset this.

For a table of your size, a modification of Jim's code (to allow for your groups) is going to be faster. Central to this is the fact that JMP table operations are really fast. I'm not even convinced that the matrix ops, which are also really fast, are faster than the column functions.

Subsetting the tables initially to create two subtables--one for each group--more than pays for itself (assuming you have enough memory to house all 3 tables), as taking subsets of rows over and over is expensive.

The code below (again, a slight modification of Jim's, to allow for your two groups) ran for me in about 2.5 seconds for 100K rows and 1K columns, whereas the routine I first submitted took over 6 seconds.

This will still seem a bit slow on a table your size... hopefully someone else will have a better idea.

Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 =  dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 =  dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance 
For( i = N Items( collist ), i >= 1, i--,
	If( (Col Std Dev( Column( dt1, colList[i] ) ) ) != 0 & Col Std Dev( Column( dt2, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);


// Delete the columns with no variance
dt<<delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);

Re: Big Data / Automatic Column Elimination due to Zero Variance

brady_brady — Sat, 14 Oct 2017 04:14:04 GMT

And this chunk vs. the previous takes another tenth of a second off... due to the selection inversion vs. twice selecting. Bigger gains on bigger tables.

dtcol1 = column(dt, 1);
dt << select where(ascolumn(dtcol1) == dtcol1[1]);
dt1 =  dt << subset(selected rows(1), invisible, selected columns(0));
dt << Invert Row Selection;
dt2 =  dt << subset(selected rows(1), invisible, selected columns(0));

Re: Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram — Sun, 15 Oct 2017 18:57:54 GMT

Your understanding is correct Brady. Only thing to emphasize, with 250,000 rows , this becomes a matrix of 2000 x 250,000 .

Re: Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram — Sun, 15 Oct 2017 18:58:59 GMT

Jim,

thanks for your guidance here. I will try it out. thx again.

Re: Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram — Mon, 16 Oct 2017 01:26:57 GMT

Brady

I tested this on a sample dataset (not the huge one yet) to understand the mechanism. I've been able to modify it to put a threshold on st. dev. rather than asking for 0. My remaining issue is, I do not want to delete data in cases where

stdev<=threshold (i.e. close to 0) AND mean of data (of parameter wrt to class A and B) are different.

Hence I need to add a test to compare the means wrt to A and B as a further additional constraint.

I was thinking Kolmogorov-Smirnov test ... in which case I can simply state "delete the parameter if KS <= thrreshold (practically close to 0)"

In fact, what I am trying to do is to downselect parameters that are different for the next step. So I started thinking maybe the best thing to do is just apply KS (no need for stdev).

JMP help provides

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );
obj = Oneway( Y( :Height ), X( :sex ) );
obj << Kolmogorov Smirnov Exact Test( 1 );

could this be translated to my case as (within the current loop)

obj=Oneway( Y(:Column( dt1, colList[i] ), X( Column( dt2, colList[i] ) ) );

obj << Kolmogorov Smirnov Exact Test( 1 );

but I do not know how to extract KS value from this obj object.

By the way, I don't think the cpu time for 2000 columns is going to bother me as I am willing to do anything to avoid any manual touch on the data.

thx so much.

Re: Big Data / Automatic Column Elimination due to Zero Variance

Duane_Hayes — Mon, 16 Oct 2017 15:12:25 GMT

Exact tests can take a very long time to compute, and given you have 250,000 rows in your data table, I would bet they will not be calculated in this situation. I recommend, instead, using the

Summarize YByX()

command which calculates all Fit Y By X combinations and produces a data table of p-values and LogWorth values for each y/x combination. You can then determine the correct columns to investigate further. You will need to test for zero variance in a separate step.

Re: Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram — Tue, 17 Oct 2017 00:22:43 GMT

thanks Duane.

I am trying to determine within the script via specific outputs and comparisons if the parameter is different wrt classes A and B. I am not trying to analyze an intermediate table of results to then take the decision.

I've got 2000 parameter columns in a given dataset and many datasets of comparable size.

The procedure needs to auto-delete the related columns found as nearly the same between the two classes A and B, for a given parameter.

I initially wrote this down as zero variance seeking logic but the issue is that two parameters w/ zero variance but with different means should not be deleted. That's when I started changing the logic to include a test for comparing the 2 sample mean test.

My problem is I am very novice at scripting and do not know how to call out a parameter from the results of a given platform.

My other issue in our data is where you specified all parameters in the script, doing some like this even could prove to be very time consuming (maybe there is a easy enough way).

Brady's script actually is nearly perfect except I need to add a 2-sample test for means and know how I can use its output as an additional filter.

thanks for your help.

Re: Big Data / Automatic Column Elimination due to Zero Variance

brady_brady — Tue, 17 Oct 2017 01:27:00 GMT

<<Edited to reflect that there are ~250,000 rows, not ~2,000, which is the number of columns.>>

Altug, there is something I'd like to clarify: you've got ~250K rows of data in two groups, which I'm assuming (by default) are of roughly the same size. If so, a t-test with 125,000 samples in each group has power = .9 to detect a difference of less than 1/50th sigma. Given you're only getting to this point (testing means) when one or both sigma levels is very low, is such a small (relative to sigma) difference one you feel is of practical importance? I.e., if you had 2 processes with sigma roughly = 1 unit, would a difference in means of 0.018 unit matter to you? This is how sensitive the t-test is going to be when there are 125,000 items in each group.

If in your setting such a small difference is not important, you'd be better served by an equavlence test (which is in its simplest form 2, simultaneous, 1-sided t-tests), which allows you to specify how large a difference IS meaningful. Means that differ by less than this amount are considered practically equivalent. Using this approach, a given column would be deleted as long as the means are practically equivalent and at least one of the group standard deviations is below threshhold.

Here is some information and an example:

http://www.jmp.com/support/help/Equivalence_Test.shtml

Re: Big Data / Automatic Column Elimination due to Zero Variance

altug_bayram — Tue, 17 Oct 2017 14:22:48 GMT

Brady,

Below script which is a slight modification from yours and should be the starting point .(w/ some minor modification on stdev being compared to a threshold). As you know, I need an actual script to further progress as this deletion would be done automatically by the script. The next phase internal code – after JMP data selection - is particularly having an issue when zero variance is encountered. Your prior version (ie. below) proved that these can be taken out. I am though trying to avoid deleting parameters w/ mean differences but with zero variance.

I cannot use a threshold that is dependent on actual parameter mean & distribution. e.g. I cannot specify practical difference, as this requires in depth examination of all 2000 parameters, not feasible. Not sure if I mentioned, but 2000 is just the tip of the iceberg, more coming. However, a single dataset contains roughly 2000 parameters.

I can use p_value to filter parameters in or out depending on which ones cause a failure in our internal code. I think from that stand point, t-test may be a better option. However, I do not know how to insert the t-test and extract a p_value .... so I can add it as an additional filter.

I feel like I have two options ...

1) Either I will use the code below as is w/ no further 2-sample means comparison. It would delete all such cases if even means are different. I need to then record the column names of those that were deleted (need help w/ that) , so that at the end of the overall study, I can turn my attention to them and analyze them in jmp to detect means difference if any. This would be a much smaller set of parameter analysis, percentage-wise, I expect zero variance cases to be a very minimal portion of the dataset.

2) Or my second option is to add a t-test or any other test which offers a normalized threshold to fine tune the population of significant differences. Eg. I can use p_value to confirm or reject significance.

What do you think ?

thanks

-----------------------------------------------------------------------------------------------------------------

Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 = dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 = dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance
For( i = N Items( collist ), i >= 1, i--,
If( (Col Std Dev( Column( dt1, colList[i] ) ) ) >= 0.01 & Col Std Dev( Column( dt2, colList[i] ) ) >= 0.01,
colList = Remove( colList, i, 1 )
)
);

// Delete the columns with no variance
dt<<delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);

note: the thresholds on stdevs are not settled down. 0.01 is just one example. Will fine tune that.