<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Big Data / Automatic Column Elimination due to Zero Variance in Discussions</title>
    <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45946#M26192</link>
    <description>&lt;P&gt;Altug, I've tested my code a bit, and although it avoids explicit looping, which is good, the &amp;lt;&amp;lt; get all values as matrix () function is&amp;nbsp;simply too expensive when the table gets big, and the gains (if any) of vstd() vs. table ops, are too miniscule to offset this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For a table of your size, a modification of Jim's&amp;nbsp;code (to allow for your groups) is going to be faster. Central to this&amp;nbsp;is the fact that JMP table operations are really fast. I'm not even&amp;nbsp;convinced&amp;nbsp;that the matrix ops, which are also really fast, are faster than the column functions.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Subsetting the tables initially to create two subtables--one for each group--more than pays for itself (assuming you have enough memory to house all 3 tables), as taking subsets of rows over and over is expensive.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The code below (again, a slight modification of Jim's, to allow for your two groups) ran for me&amp;nbsp;in about 2.5 seconds for 100K rows and 1K columns, whereas the routine I first submitted took over 6 seconds.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This will still seem a bit slow on a table your size... hopefully someone else will have a&amp;nbsp;better idea.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 =  dt &amp;lt;&amp;lt; subset(rows(dt&amp;lt;&amp;lt;get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 =  dt &amp;lt;&amp;lt; subset(rows(dt&amp;lt;&amp;lt;get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt &amp;lt;&amp;lt; get column names( numeric, continuous );

// Loop across all columns and find those with no variance 
For( i = N Items( collist ), i &amp;gt;= 1, i--,
	If( (Col Std Dev( Column( dt1, colList[i] ) ) ) != 0 &amp;amp; Col Std Dev( Column( dt2, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);


// Delete the columns with no variance
dt&amp;lt;&amp;lt;delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sat, 14 Oct 2017 00:16:59 GMT</pubDate>
    <dc:creator>brady_brady</dc:creator>
    <dc:date>2017-10-14T00:16:59Z</dc:date>
    <item>
      <title>Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45941#M26187</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am analyzing a very large data set, at a minimum of around 2000 parameters in columns. My samples have these parameter values for ~ 250k rows. My problem is with the columns, as opposed to rows.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a discrete classification defined for all samples, two discrete values, say class A and B.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am analyzing data further in a different software which then fails if it detects parameters that have zero variance (or very close to).&amp;nbsp;&lt;/P&gt;&lt;P&gt;I do not want to manually do this work (as you can imagine from 2000 columns)..&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am looking for an as automated as possible way to delete a columns that have zero variance (or below a defined very small threshold) for either of class A or B. I go through steps to complete the work, but I don't want to be manually selecting any column to do the deletion.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help will be appreciated. thanks in advance.&lt;/P&gt;</description>
      <pubDate>Fri, 13 Oct 2017 21:08:02 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45941#M26187</guid>
      <dc:creator>altug_bayram</dc:creator>
      <dc:date>2017-10-13T21:08:02Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45942#M26188</link>
      <description>&lt;P&gt;Here is a simple example script that will do what you want.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA\semiconductor capability.jmp" );

// Get all continuous data columns
colList = dt &amp;lt;&amp;lt; get column names( numeric, continuous );

// Loop across all columns and find those with no variance
For( i = N Items( collist ), i &amp;gt;= 1, i--,
	If( Col Std Dev( Column( dt, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);

// Delete the columns with no variance
dt&amp;lt;&amp;lt;delete columns(colList);&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 13 Oct 2017 21:55:37 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45942#M26188</guid>
      <dc:creator>txnelson</dc:creator>
      <dc:date>2017-10-13T21:55:37Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45944#M26190</link>
      <description>&lt;P&gt;Altug,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If I understand your problem correctly:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;- You have a column, (named Group, for example), with 2 values ("A" and "B", for example).&lt;/P&gt;
&lt;P&gt;- You have 2000 columns with numeric data.&lt;/P&gt;
&lt;P&gt;- You wish to delete any column where the std deviation of rows belonging to the "A" group, OR the rows belonging to the "B" group is below some threshhold (or both.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Performance (and memory usage) is going to be an issue here for a table of your size. I am not sure whether the below will work, or work quickly, on a table of the size you have, but give it a go.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;First, make a table where the "a/b" column is the first column, and&amp;nbsp;the remaining columns are the parameter columns of interest.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then try this.&lt;CODE class=" language-jsl"&gt;&lt;BR /&gt;&lt;/CODE&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;Brady&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;dt = Current Data Table();

mat = dt &amp;lt;&amp;lt; Get All Columns As Matrix(); //this get the A/B column as 1s and 2s. which is which not important.

//create subtables... note A &amp;amp; B will be reversed if B occurs first in dt; the result is unaffected.
matA = mat[loc(mat[0,1]==1), 0];
matB = mat[loc(mat[0,1]==2), 0];

//compute stddevs for each column in each subtable
stdvA = vstd(matA);
stdvB = vstd(matB);

//locate columns of sufficiently low variance
cols = loc(stdvA == 0 | stdvB == 0);

//delete them. Ignore row (1) of the cols vector; it is 1 (corresponding to column 1 in dt).
try(dt &amp;lt;&amp;lt; delete columns((dt &amp;lt;&amp;lt; get column names)[cols[2::nrow(cols)]]));&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Oct 2017 22:47:23 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45944#M26190</guid>
      <dc:creator>brady_brady</dc:creator>
      <dc:date>2017-10-13T22:47:23Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45946#M26192</link>
      <description>&lt;P&gt;Altug, I've tested my code a bit, and although it avoids explicit looping, which is good, the &amp;lt;&amp;lt; get all values as matrix () function is&amp;nbsp;simply too expensive when the table gets big, and the gains (if any) of vstd() vs. table ops, are too miniscule to offset this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For a table of your size, a modification of Jim's&amp;nbsp;code (to allow for your groups) is going to be faster. Central to this&amp;nbsp;is the fact that JMP table operations are really fast. I'm not even&amp;nbsp;convinced&amp;nbsp;that the matrix ops, which are also really fast, are faster than the column functions.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Subsetting the tables initially to create two subtables--one for each group--more than pays for itself (assuming you have enough memory to house all 3 tables), as taking subsets of rows over and over is expensive.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The code below (again, a slight modification of Jim's, to allow for your two groups) ran for me&amp;nbsp;in about 2.5 seconds for 100K rows and 1K columns, whereas the routine I first submitted took over 6 seconds.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This will still seem a bit slow on a table your size... hopefully someone else will have a&amp;nbsp;better idea.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 =  dt &amp;lt;&amp;lt; subset(rows(dt&amp;lt;&amp;lt;get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 =  dt &amp;lt;&amp;lt; subset(rows(dt&amp;lt;&amp;lt;get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt &amp;lt;&amp;lt; get column names( numeric, continuous );

// Loop across all columns and find those with no variance 
For( i = N Items( collist ), i &amp;gt;= 1, i--,
	If( (Col Std Dev( Column( dt1, colList[i] ) ) ) != 0 &amp;amp; Col Std Dev( Column( dt2, colList[i] ) ) != 0,
		colList = Remove( colList, i, 1 )
	)
);


// Delete the columns with no variance
dt&amp;lt;&amp;lt;delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 14 Oct 2017 00:16:59 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45946#M26192</guid>
      <dc:creator>brady_brady</dc:creator>
      <dc:date>2017-10-14T00:16:59Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45948#M26193</link>
      <description>&lt;P&gt;And this chunk vs. the previous takes another tenth of a second off... due to the selection inversion vs. twice selecting. Bigger gains on bigger tables.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;dtcol1 = column(dt, 1);
dt &amp;lt;&amp;lt; select where(ascolumn(dtcol1) == dtcol1[1]);
dt1 =  dt &amp;lt;&amp;lt; subset(selected rows(1), invisible, selected columns(0));
dt &amp;lt;&amp;lt; Invert Row Selection;
dt2 =  dt &amp;lt;&amp;lt; subset(selected rows(1), invisible, selected columns(0));
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sat, 14 Oct 2017 04:14:04 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45948#M26193</guid>
      <dc:creator>brady_brady</dc:creator>
      <dc:date>2017-10-14T04:14:04Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45977#M26219</link>
      <description>&lt;P&gt;Your understanding is correct Brady.&amp;nbsp; Only thing to emphasize, with 250,000 rows , this becomes a matrix of 2000 x 250,000 .&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 15 Oct 2017 18:57:54 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45977#M26219</guid>
      <dc:creator>altug_bayram</dc:creator>
      <dc:date>2017-10-15T18:57:54Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45978#M26220</link>
      <description>&lt;P&gt;Jim,&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks for your guidance here. I will try it out. thx again.&lt;/P&gt;</description>
      <pubDate>Sun, 15 Oct 2017 18:58:59 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45978#M26220</guid>
      <dc:creator>altug_bayram</dc:creator>
      <dc:date>2017-10-15T18:58:59Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45980#M26221</link>
      <description>&lt;P&gt;Brady&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tested this on a sample dataset (not the huge one yet) to understand the mechanism. I've been able to modify it to put a threshold on st. dev. rather than asking for 0. My remaining issue is, I do not want to delete data in cases where&lt;/P&gt;&lt;P&gt;stdev&amp;lt;=threshold (i.e. close to 0) AND mean of data (of parameter wrt to class A and B) are different.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hence I need to add a test to compare the means wrt to A and B as a further additional constraint.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was thinking Kolmogorov-Smirnov test ... in which case I can simply state&amp;nbsp; "delete the parameter if KS &amp;lt;= thrreshold (practically close to 0)"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In fact, what I am trying to do is to downselect parameters that are different for the next step. So I started thinking maybe the best thing to do is just apply KS (no need for stdev).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;JMP help provides&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Names Default To Here( 1 );&lt;BR /&gt;dt = Open( "$SAMPLE_DATA/Big Class.jmp" );&lt;BR /&gt;obj = Oneway( Y( :Height ), X( :sex ) );&lt;BR /&gt;obj &amp;lt;&amp;lt; Kolmogorov Smirnov Exact Test( 1 );&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;could this be translated to my case as (within the current loop)&lt;/P&gt;&lt;P&gt;obj=Oneway( Y(:Column( dt1, colList[i] ), X( Column( dt2, colList[i] ) ) );&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;obj &amp;lt;&amp;lt; Kolmogorov Smirnov Exact Test( 1 );&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;but I do not know how to extract KS value from this obj object.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;By the way, I don't think the cpu time for 2000 columns is going to bother me as I am willing to do anything to avoid any manual touch on the data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thx so much.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Oct 2017 01:26:57 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45980#M26221</guid>
      <dc:creator>altug_bayram</dc:creator>
      <dc:date>2017-10-16T01:26:57Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45997#M26226</link>
      <description>&lt;P&gt;Exact tests can take a very long time to compute, and given you have 250,000 rows in your data table, I would bet they will not be calculated in this situation.&amp;nbsp; I recommend, instead, using the&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;Summarize YByX() &lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;command which calculates all Fit Y By X combinations and produces a data table of p-values and LogWorth values for each y/x combination.&amp;nbsp; You can then determine the correct columns&amp;nbsp;to investigate further.&amp;nbsp; You will need to test for zero variance in a separate step.&lt;/P&gt;</description>
      <pubDate>Mon, 16 Oct 2017 15:12:25 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/45997#M26226</guid>
      <dc:creator>Duane_Hayes</dc:creator>
      <dc:date>2017-10-16T15:12:25Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/46031#M26247</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;thanks Duane.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;I am trying to determine within the script via specific outputs and comparisons if the parameter is different wrt classes A and B. I am not trying to analyze an intermediate table of results to then take the decision.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;I've got 2000 parameter columns in a given dataset and many datasets of comparable size.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;The procedure needs to auto-delete the related columns found as nearly the same between the two classes A and B, for a given parameter.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;I initially wrote this down as zero variance seeking logic but the issue is that two parameters w/ zero variance but with different means should not be deleted. That's when I started changing the logic to include a test for comparing the 2 sample mean test.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;My problem is I am very novice at scripting and do not know how to call out a parameter from the results of a given platform.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;My other issue in our data is where you specified all parameters in the script, doing some like this even could prove to be very time consuming (maybe there is a easy enough way).&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;Brady's script actually is nearly perfect except I need to add a 2-sample test for means and know how I can use its output as an additional filter.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 12.0pt; font-family: 'Arial',sans-serif; color: #333333;"&gt;thanks for your help.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Oct 2017 00:22:43 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/46031#M26247</guid>
      <dc:creator>altug_bayram</dc:creator>
      <dc:date>2017-10-17T00:22:43Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/46033#M26249</link>
      <description>&lt;P&gt;&amp;lt;&amp;lt;Edited to reflect that there are ~250,000 rows, not ~2,000, which is the number of columns.&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Altug, there is something I'd like to clarify: you've got ~250K rows of data in two groups, which I'm assuming (by default) are of roughly the same size. If so, a t-test with&amp;nbsp;125,000 samples in each group has power = .9 to detect a difference of&amp;nbsp;less than&amp;nbsp;1/50th sigma. Given you're only getting to this point (testing means) when one or both sigma levels is very low, is such a small&amp;nbsp;(relative to sigma) difference one you feel is of practical importance? I.e., if you had 2 processes with sigma roughly = 1 unit, would a difference in means of 0.018 unit matter to you? This is how sensitive the t-test is going to be when there are 125,000 items in each group.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If in your setting such a small difference is not important, you'd be better served by an equavlence test (which is in its simplest form 2, simultaneous, 1-sided t-tests), which allows you to specify how large a difference IS meaningful. Means that differ by less than this amount are considered practically equivalent. Using this approach,&amp;nbsp;a given column would be deleted as long as&amp;nbsp;the means are practically equivalent&amp;nbsp;and at least one of the group standard deviations is below threshhold.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here is some information and an example:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://www.jmp.com/support/help/Equivalence_Test.shtml" target="_blank"&gt;http://www.jmp.com/support/help/Equivalence_Test.shtml&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Oct 2017 01:27:00 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/46033#M26249</guid>
      <dc:creator>brady_brady</dc:creator>
      <dc:date>2017-10-17T01:27:00Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data / Automatic Column Elimination due to Zero Variance</title>
      <link>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/46041#M26256</link>
      <description>&lt;P&gt;Brady,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below script which is a slight modification from yours and should be the starting point .(w/ some minor modification on stdev being compared to a threshold). As you know, I need an actual script to further progress as this deletion would be done automatically by the script. The next phase internal code – after JMP data selection - is particularly having an issue when zero variance is encountered. Your prior version (ie. below) proved that these can be taken out.&amp;nbsp;I am though trying to avoid deleting parameters w/ mean differences but with zero variance.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I cannot use a threshold that is dependent on actual parameter mean &amp;amp; distribution. e.g. I cannot specify practical difference, as this requires in depth examination of all 2000 parameters, not feasible. Not sure if I mentioned, but 2000 is just the tip of the iceberg, more coming. However, a single dataset contains roughly 2000 parameters.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I can use p_value to filter parameters in or out depending on which ones cause a failure in our internal code. I think from that stand point, t-test may be a better option. However, I do not know how to insert the t-test and extract a p_value .... so I can add it as an additional filter.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I feel like I have two options ...&amp;nbsp;&lt;/P&gt;&lt;P&gt;1) Either I will use the code below as is w/ no further 2-sample means comparison. It would delete all such cases if even means are different. I need to then record the column names of those that were deleted (need help w/ that) , so that at the end of the overall study, I can turn my attention to them and analyze them in jmp to detect means difference if any. This would be a much smaller set of parameter analysis, percentage-wise, I expect zero variance cases to be a very minimal portion of the dataset.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2) Or my second option is to add a t-test or any other test which offers a normalized&amp;nbsp;threshold to fine tune the population of significant differences. Eg. I can use p_value to confirm or reject significance.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What do you think ?&lt;/P&gt;&lt;P&gt;thanks&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;-----------------------------------------------------------------------------------------------------------------&amp;nbsp;&lt;/P&gt;&lt;P&gt;Names Default To Here( 1 );&lt;/P&gt;&lt;P&gt;dt = current datatable();&lt;/P&gt;&lt;P&gt;dtcol1 = column(dt, 1);&lt;BR /&gt;dt1 = dt &amp;lt;&amp;lt; subset(rows(dt&amp;lt;&amp;lt;get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));&lt;BR /&gt;dt2 = dt &amp;lt;&amp;lt; subset(rows(dt&amp;lt;&amp;lt;get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));&lt;/P&gt;&lt;P&gt;// Get all continuous data columns&lt;BR /&gt;colList = dt &amp;lt;&amp;lt; get column names( numeric, continuous );&lt;/P&gt;&lt;P&gt;// Loop across all columns and find those with no variance&amp;nbsp;&lt;BR /&gt;For( i = N Items( collist ), i &amp;gt;= 1, i--,&lt;BR /&gt;If( (Col Std Dev( Column( dt1, colList[i] ) ) ) &amp;gt;= 0.01 &amp;amp; Col Std Dev( Column( dt2, colList[i] ) ) &amp;gt;= 0.01,&lt;BR /&gt;colList = Remove( colList, i, 1 )&lt;BR /&gt;)&lt;BR /&gt;);&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;// Delete the columns with no variance&lt;BR /&gt;dt&amp;lt;&amp;lt;delete columns(colList);&lt;/P&gt;&lt;P&gt;close(dt1, nosave);&lt;BR /&gt;close(dt2, nosave);&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;note: the thresholds on stdevs are not settled down. 0.01 is just one example. Will fine tune that.&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Oct 2017 14:22:48 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Big-Data-Automatic-Column-Elimination-due-to-Zero-Variance/m-p/46041#M26256</guid>
      <dc:creator>altug_bayram</dc:creator>
      <dc:date>2017-10-17T14:22:48Z</dc:date>
    </item>
  </channel>
</rss>

