Solved: Big Data / Automatic Column Elimination due to Zero Variance - Page 2

altug_bayram · Oct 13, 2017 05:08 PM

Hello

I am analyzing a very large data set, at a minimum of around 2000 parameters in columns. My samples have these parameter values for ~ 250k rows. My problem is with the columns, as opposed to rows.

I have a discrete classification defined for all samples, two discrete values, say class A and B.

I am analyzing data further in a different software which then fails if it detects parameters that have zero variance (or very close to).

I do not want to manually do this work (as you can imagine from 2000 columns)..

I am looking for an as automated as possible way to delete a columns that have zero variance (or below a defined very small threshold) for either of class A or B. I go through steps to complete the work, but I don't want to be manually selecting any column to do the deletion.

Any help will be appreciated. thanks in advance.

altug_bayram · Oct 17, 2017 10:22 AM

Brady,

Below script which is a slight modification from yours and should be the starting point .(w/ some minor modification on stdev being compared to a threshold). As you know, I need an actual script to further progress as this deletion would be done automatically by the script. The next phase internal code – after JMP data selection - is particularly having an issue when zero variance is encountered. Your prior version (ie. below) proved that these can be taken out. I am though trying to avoid deleting parameters w/ mean differences but with zero variance.

I cannot use a threshold that is dependent on actual parameter mean & distribution. e.g. I cannot specify practical difference, as this requires in depth examination of all 2000 parameters, not feasible. Not sure if I mentioned, but 2000 is just the tip of the iceberg, more coming. However, a single dataset contains roughly 2000 parameters.

I can use p_value to filter parameters in or out depending on which ones cause a failure in our internal code. I think from that stand point, t-test may be a better option. However, I do not know how to insert the t-test and extract a p_value .... so I can add it as an additional filter.

I feel like I have two options ...

1) Either I will use the code below as is w/ no further 2-sample means comparison. It would delete all such cases if even means are different. I need to then record the column names of those that were deleted (need help w/ that) , so that at the end of the overall study, I can turn my attention to them and analyze them in jmp to detect means difference if any. This would be a much smaller set of parameter analysis, percentage-wise, I expect zero variance cases to be a very minimal portion of the dataset.

2) Or my second option is to add a t-test or any other test which offers a normalized threshold to fine tune the population of significant differences. Eg. I can use p_value to confirm or reject significance.

What do you think ?

thanks

-----------------------------------------------------------------------------------------------------------------

Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 = dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 = dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance
For( i = N Items( collist ), i >= 1, i--,
If( (Col Std Dev( Column( dt1, colList[i] ) ) ) >= 0.01 & Col Std Dev( Column( dt2, colList[i] ) ) >= 0.01,
colList = Remove( colList, i, 1 )
)
);

// Delete the columns with no variance
dt<<delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);

note: the thresholds on stdevs are not settled down. 0.01 is just one example. Will fine tune that.

brady_brady · Oct 14, 2017 12:14 AM

And this chunk vs. the previous takes another tenth of a second off... due to the selection inversion vs. twice selecting. Bigger gains on bigger tables.

dtcol1 = column(dt, 1);
dt << select where(ascolumn(dtcol1) == dtcol1[1]);
dt1 =  dt << subset(selected rows(1), invisible, selected columns(0));
dt << Invert Row Selection;
dt2 =  dt << subset(selected rows(1), invisible, selected columns(0));

Big Data / Automatic Column Elimination due to Zero Variance

Re: Big Data / Automatic Column Elimination due to Zero Variance

Re: Big Data / Automatic Column Elimination due to Zero Variance