cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Discussions

Solve problems, and share tips and tricks with other JMP users.
%3CLINGO-SUB%20id%3D%22lingo-sub-45941%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3EBig%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45941%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EHello%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20am%20analyzing%20a%20very%20large%20data%20set%2C%20at%20a%20minimum%20of%20around%202000%20parameters%20in%20columns.%20My%20samples%20have%20these%20parameter%20values%20for%20~%20250k%20rows.%20My%20problem%20is%20with%20the%20columns%2C%20as%20opposed%20to%20rows.%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20have%20a%20discrete%20classification%20defined%20for%20all%20samples%2C%20two%20discrete%20values%2C%20say%20class%20A%20and%20B.%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20am%20analyzing%20data%20further%20in%20a%20different%20software%20which%20then%20fails%20if%20it%20detects%20parameters%20that%20have%20zero%20variance%20(or%20very%20close%20to).%26nbsp%3B%3C%2FP%3E%3CP%3EI%20do%20not%20want%20to%20manually%20do%20this%20work%20(as%20you%20can%20imagine%20from%202000%20columns)..%26nbsp%3B%3C%2FP%3E%3CP%3EI%20am%20looking%20for%20an%20as%20automated%20as%20possible%20way%20to%20delete%20a%20columns%20that%20have%20zero%20variance%20(or%20below%20a%20defined%20very%20small%20threshold)%20for%20either%20of%20class%20A%20or%20B.%20I%20go%20through%20steps%20to%20complete%20the%20work%2C%20but%20I%20don't%20want%20to%20be%20manually%20selecting%20any%20column%20to%20do%20the%20deletion.%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EAny%20help%20will%20be%20appreciated.%20thanks%20in%20advance.%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-46041%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-46041%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EBrady%2C%26nbsp%3B%3C%2FP%3E%3CP%3EBelow%20script%20which%20is%20a%20slight%20modification%20from%20yours%20and%20should%20be%20the%20starting%20point%20.(w%2F%20some%20minor%20modification%20on%20stdev%20being%20compared%20to%20a%20threshold).%20As%20you%20know%2C%20I%20need%20an%20actual%20script%20to%20further%20progress%20as%20this%20deletion%20would%20be%20done%20automatically%20by%20the%20script.%20The%20next%20phase%20internal%20code%20%E2%80%93%20after%20JMP%20data%20selection%20-%20is%20particularly%20having%20an%20issue%20when%20zero%20variance%20is%20encountered.%20Your%20prior%20version%20(ie.%20below)%20proved%20that%20these%20can%20be%20taken%20out.%26nbsp%3BI%20am%20though%20trying%20to%20avoid%20deleting%20parameters%20w%2F%20mean%20differences%20but%20with%20zero%20variance.%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20cannot%20use%20a%20threshold%20that%20is%20dependent%20on%20actual%20parameter%20mean%20%26amp%3B%20distribution.%20e.g.%20I%20cannot%20specify%20practical%20difference%2C%20as%20this%20requires%20in%20depth%20examination%20of%20all%202000%20parameters%2C%20not%20feasible.%20Not%20sure%20if%20I%20mentioned%2C%20but%202000%20is%20just%20the%20tip%20of%20the%20iceberg%2C%20more%20coming.%20However%2C%20a%20single%20dataset%20contains%20roughly%202000%20parameters.%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20can%20use%20p_value%20to%20filter%20parameters%20in%20or%20out%20depending%20on%20which%20ones%20cause%20a%20failure%20in%20our%20internal%20code.%20I%20think%20from%20that%20stand%20point%2C%20t-test%20may%20be%20a%20better%20option.%20However%2C%20I%20do%20not%20know%20how%20to%20insert%20the%20t-test%20and%20extract%20a%20p_value%20....%20so%20I%20can%20add%20it%20as%20an%20additional%20filter.%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20feel%20like%20I%20have%20two%20options%20...%26nbsp%3B%3C%2FP%3E%3CP%3E1)%20Either%20I%20will%20use%20the%20code%20below%20as%20is%20w%2F%20no%20further%202-sample%20means%20comparison.%20It%20would%20delete%20all%20such%20cases%20if%20even%20means%20are%20different.%20I%20need%20to%20then%20record%20the%20column%20names%20of%20those%20that%20were%20deleted%20(need%20help%20w%2F%20that)%20%2C%20so%20that%20at%20the%20end%20of%20the%20overall%20study%2C%20I%20can%20turn%20my%20attention%20to%20them%20and%20analyze%20them%20in%20jmp%20to%20detect%20means%20difference%20if%20any.%20This%20would%20be%20a%20much%20smaller%20set%20of%20parameter%20analysis%2C%20percentage-wise%2C%20I%20expect%20zero%20variance%20cases%20to%20be%20a%20very%20minimal%20portion%20of%20the%20dataset.%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3E2)%20Or%20my%20second%20option%20is%20to%20add%20a%20t-test%20or%20any%20other%20test%20which%20offers%20a%20normalized%26nbsp%3Bthreshold%20to%20fine%20tune%20the%20population%20of%20significant%20differences.%20Eg.%20I%20can%20use%20p_value%20to%20confirm%20or%20reject%20significance.%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EWhat%20do%20you%20think%20%3F%3C%2FP%3E%3CP%3Ethanks%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3E-----------------------------------------------------------------------------------------------------------------%26nbsp%3B%3C%2FP%3E%3CP%3ENames%20Default%20To%20Here(%201%20)%3B%3C%2FP%3E%3CP%3Edt%20%3D%20current%20datatable()%3B%3C%2FP%3E%3CP%3Edtcol1%20%3D%20column(dt%2C%201)%3B%3CBR%20%2F%3Edt1%20%3D%20dt%20%26lt%3B%26lt%3B%20subset(rows(dt%26lt%3B%3CGET%20rows%3D%22%22%20where%3D%22%22%3E%3C%2FGET%3Edt2%20%3D%20dt%20%26lt%3B%26lt%3B%20subset(rows(dt%26lt%3B%3CGET%20rows%3D%22%22%20where%3D%22%22%3E%3C%2FGET%3E%3C%2FP%3E%3CP%3E%2F%2F%20Get%20all%20continuous%20data%20columns%3CBR%20%2F%3EcolList%20%3D%20dt%20%26lt%3B%26lt%3B%20get%20column%20names(%20numeric%2C%20continuous%20)%3B%3C%2FP%3E%3CP%3E%2F%2F%20Loop%20across%20all%20columns%20and%20find%20those%20with%20no%20variance%26nbsp%3B%3CBR%20%2F%3EFor(%20i%20%3D%20N%20Items(%20collist%20)%2C%20i%20%26gt%3B%3D%201%2C%20i--%2C%3CBR%20%2F%3EIf(%20(Col%20Std%20Dev(%20Column(%20dt1%2C%20colList%5Bi%5D%20)%20)%20)%20%26gt%3B%3D%200.01%20%26amp%3B%20Col%20Std%20Dev(%20Column(%20dt2%2C%20colList%5Bi%5D%20)%20)%20%26gt%3B%3D%200.01%2C%3CBR%20%2F%3EcolList%20%3D%20Remove(%20colList%2C%20i%2C%201%20)%3CBR%20%2F%3E)%3CBR%20%2F%3E)%3B%3C%2FP%3E%3CP%3E%3CBR%20%2F%3E%2F%2F%20Delete%20the%20columns%20with%20no%20variance%3CBR%20%2F%3Edt%26lt%3B%3CDELETE%20columns%3D%22%22%3E%3C%2FDELETE%3E%3C%2FP%3E%3CP%3Eclose(dt1%2C%20nosave)%3B%3CBR%20%2F%3Eclose(dt2%2C%20nosave)%3B%3C%2FP%3E%3CP%3E%3CSTRONG%3E%26nbsp%3B%3C%2FSTRONG%3E%3C%2FP%3E%3CP%3E%3CSTRONG%3Enote%3A%20the%20thresholds%20on%20stdevs%20are%20not%20settled%20down.%200.01%20is%20just%20one%20example.%20Will%20fine%20tune%20that.%26nbsp%3B%3C%2FSTRONG%3E%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-46033%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-46033%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3E%26lt%3B%3CEDITED%20to%3D%22%22%20reflect%3D%22%22%20that%3D%22%22%20there%3D%22%22%20are%3D%22%22%3E%26gt%3B%3C%2FEDITED%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EAltug%2C%20there%20is%20something%20I'd%20like%20to%20clarify%3A%20you've%20got%20~250K%20rows%20of%20data%20in%20two%20groups%2C%20which%20I'm%20assuming%20(by%20default)%20are%20of%20roughly%20the%20same%20size.%20If%20so%2C%20a%20t-test%20with%26nbsp%3B125%2C000%20samples%20in%20each%20group%20has%20power%20%3D%20.9%20to%20detect%20a%20difference%20of%26nbsp%3Bless%20than%26nbsp%3B1%2F50th%20sigma.%20Given%20you're%20only%20getting%20to%20this%20point%20(testing%20means)%20when%20one%20or%20both%20sigma%20levels%20is%20very%20low%2C%20is%20such%20a%20small%26nbsp%3B(relative%20to%20sigma)%20difference%20one%20you%20feel%20is%20of%20practical%20importance%3F%20I.e.%2C%20if%20you%20had%202%20processes%20with%20sigma%20roughly%20%3D%201%20unit%2C%20would%20a%20difference%20in%20means%20of%200.018%20unit%20matter%20to%20you%3F%20This%20is%20how%20sensitive%20the%20t-test%20is%20going%20to%20be%20when%20there%20are%20125%2C000%20items%20in%20each%20group.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EIf%20in%20your%20setting%20such%20a%20small%20difference%20is%20not%20important%2C%20you'd%20be%20better%20served%20by%20an%20equavlence%20test%20(which%20is%20in%20its%20simplest%20form%202%2C%20simultaneous%2C%201-sided%20t-tests)%2C%20which%20allows%20you%20to%20specify%20how%20large%20a%20difference%20IS%20meaningful.%20Means%20that%20differ%20by%20less%20than%20this%20amount%20are%20considered%20practically%20equivalent.%20Using%20this%20approach%2C%26nbsp%3Ba%20given%20column%20would%20be%20deleted%20as%20long%20as%26nbsp%3Bthe%20means%20are%20practically%20equivalent%26nbsp%3Band%20at%20least%20one%20of%20the%20group%20standard%20deviations%20is%20below%20threshhold.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EHere%20is%20some%20information%20and%20an%20example%3A%3C%2FP%3E%0A%3CP%3E%3CA%20href%3D%22http%3A%2F%2Fwww.jmp.com%2Fsupport%2Fhelp%2FEquivalence_Test.shtml%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3Ehttp%3A%2F%2Fwww.jmp.com%2Fsupport%2Fhelp%2FEquivalence_Test.shtml%3C%2FA%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-46031%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-46031%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3Ethanks%20Duane.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EI%20am%20trying%20to%20determine%20within%20the%20script%20via%20specific%20outputs%20and%20comparisons%20if%20the%20parameter%20is%20different%20wrt%20classes%20A%20and%20B.%20I%20am%20not%20trying%20to%20analyze%20an%20intermediate%20table%20of%20results%20to%20then%20take%20the%20decision.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EI've%20got%202000%20parameter%20columns%20in%20a%20given%20dataset%20and%20many%20datasets%20of%20comparable%20size.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EThe%20procedure%20needs%20to%20auto-delete%20the%20related%20columns%20found%20as%20nearly%20the%20same%20between%20the%20two%20classes%20A%20and%20B%2C%20for%20a%20given%20parameter.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EI%20initially%20wrote%20this%20down%20as%20zero%20variance%20seeking%20logic%20but%20the%20issue%20is%20that%20two%20parameters%20w%2F%20zero%20variance%20but%20with%20different%20means%20should%20not%20be%20deleted.%20That's%20when%20I%20started%20changing%20the%20logic%20to%20include%20a%20test%20for%20comparing%20the%202%20sample%20mean%20test.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EMy%20problem%20is%20I%20am%20very%20novice%20at%20scripting%20and%20do%20not%20know%20how%20to%20call%20out%20a%20parameter%20from%20the%20results%20of%20a%20given%20platform.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EMy%20other%20issue%20in%20our%20data%20is%20where%20you%20specified%20all%20parameters%20in%20the%20script%2C%20doing%20some%20like%20this%20even%20could%20prove%20to%20be%20very%20time%20consuming%20(maybe%20there%20is%20a%20easy%20enough%20way).%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3EBrady's%20script%20actually%20is%20nearly%20perfect%20except%20I%20need%20to%20add%20a%202-sample%20test%20for%20means%20and%20know%20how%20I%20can%20use%20its%20output%20as%20an%20additional%20filter.%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%3CSPAN%20style%3D%22font-size%3A%2012.0pt%3B%20font-family%3A%20'Arial'%2Csans-serif%3B%20color%3A%20%23333333%3B%22%3Ethanks%20for%20your%20help.%3C%2FSPAN%3E%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45997%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45997%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EExact%20tests%20can%20take%20a%20very%20long%20time%20to%20compute%2C%20and%20given%20you%20have%20250%2C000%20rows%20in%20your%20data%20table%2C%20I%20would%20bet%20they%20will%20not%20be%20calculated%20in%20this%20situation.%26nbsp%3B%20I%20recommend%2C%20instead%2C%20using%20the%3C%2FP%3E%0A%3CPRE%3E%3CCODE%20class%3D%22%20language-jsl%22%3ESummarize%20YByX()%20%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3Ecommand%20which%20calculates%20all%20Fit%20Y%20By%20X%20combinations%20and%20produces%20a%20data%20table%20of%20p-values%20and%20LogWorth%20values%20for%20each%20y%2Fx%20combination.%26nbsp%3B%20You%20can%20then%20determine%20the%20correct%20columns%26nbsp%3Bto%20investigate%20further.%26nbsp%3B%20You%20will%20need%20to%20test%20for%20zero%20variance%20in%20a%20separate%20step.%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45980%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45980%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EBrady%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20tested%20this%20on%20a%20sample%20dataset%20(not%20the%20huge%20one%20yet)%20to%20understand%20the%20mechanism.%20I've%20been%20able%20to%20modify%20it%20to%20put%20a%20threshold%20on%20st.%20dev.%20rather%20than%20asking%20for%200.%20My%20remaining%20issue%20is%2C%20I%20do%20not%20want%20to%20delete%20data%20in%20cases%20where%3C%2FP%3E%3CP%3Estdev%26lt%3B%3Dthreshold%20(i.e.%20close%20to%200)%20AND%20mean%20of%20data%20(of%20parameter%20wrt%20to%20class%20A%20and%20B)%20are%20different.%26nbsp%3B%3C%2FP%3E%3CP%3EHence%20I%20need%20to%20add%20a%20test%20to%20compare%20the%20means%20wrt%20to%20A%20and%20B%20as%20a%20further%20additional%20constraint.%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EI%20was%20thinking%20Kolmogorov-Smirnov%20test%20...%20in%20which%20case%20I%20can%20simply%20state%26nbsp%3B%20%22delete%20the%20parameter%20if%20KS%20%26lt%3B%3D%20thrreshold%20(practically%20close%20to%200)%22%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EIn%20fact%2C%20what%20I%20am%20trying%20to%20do%20is%20to%20downselect%20parameters%20that%20are%20different%20for%20the%20next%20step.%20So%20I%20started%20thinking%20maybe%20the%20best%20thing%20to%20do%20is%20just%20apply%20KS%20(no%20need%20for%20stdev).%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EJMP%20help%20provides%26nbsp%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3ENames%20Default%20To%20Here(%201%20)%3B%3CBR%20%2F%3Edt%20%3D%20Open(%20%22%24SAMPLE_DATA%2FBig%20Class.jmp%22%20)%3B%3CBR%20%2F%3Eobj%20%3D%20Oneway(%20Y(%20%3AHeight%20)%2C%20X(%20%3Asex%20)%20)%3B%3CBR%20%2F%3Eobj%20%26lt%3B%26lt%3B%20Kolmogorov%20Smirnov%20Exact%20Test(%201%20)%3B%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3Ecould%20this%20be%20translated%20to%20my%20case%20as%20(within%20the%20current%20loop)%3C%2FP%3E%3CP%3Eobj%3DOneway(%20Y(%3AColumn(%20dt1%2C%20colList%5Bi%5D%20)%2C%20X(%20Column(%20dt2%2C%20colList%5Bi%5D%20)%20)%20)%3B%3C%2FP%3E%3CP%3E%3CSPAN%3Eobj%20%26lt%3B%26lt%3B%20Kolmogorov%20Smirnov%20Exact%20Test(%201%20)%3B%3C%2FSPAN%3E%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3Ebut%20I%20do%20not%20know%20how%20to%20extract%20KS%20value%20from%20this%20obj%20object.%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3EBy%20the%20way%2C%20I%20don't%20think%20the%20cpu%20time%20for%202000%20columns%20is%20going%20to%20bother%20me%20as%20I%20am%20willing%20to%20do%20anything%20to%20avoid%20any%20manual%20touch%20on%20the%20data.%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3Ethx%20so%20much.%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45978%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45978%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EJim%2C%26nbsp%3B%3C%2FP%3E%3CP%3Ethanks%20for%20your%20guidance%20here.%20I%20will%20try%20it%20out.%20thx%20again.%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45977%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45977%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EYour%20understanding%20is%20correct%20Brady.%26nbsp%3B%20Only%20thing%20to%20emphasize%2C%20with%20250%2C000%20rows%20%2C%20this%20becomes%20a%20matrix%20of%202000%20x%20250%2C000%20.%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45948%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45948%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EAnd%20this%20chunk%20vs.%20the%20previous%20takes%20another%20tenth%20of%20a%20second%20off...%20due%20to%20the%20selection%20inversion%20vs.%20twice%20selecting.%20Bigger%20gains%20on%20bigger%20tables.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%3E%3CCODE%20class%3D%22%20language-jsl%22%3Edtcol1%20%3D%20column(dt%2C%201)%3B%0Adt%20%26lt%3B%26lt%3B%20select%20where(ascolumn(dtcol1)%20%3D%3D%20dtcol1%5B1%5D)%3B%0Adt1%20%3D%20%20dt%20%26lt%3B%26lt%3B%20subset(selected%20rows(1)%2C%20invisible%2C%20selected%20columns(0))%3B%0Adt%20%26lt%3B%26lt%3B%20Invert%20Row%20Selection%3B%0Adt2%20%3D%20%20dt%20%26lt%3B%26lt%3B%20subset(selected%20rows(1)%2C%20invisible%2C%20selected%20columns(0))%3B%0A%3C%2FCODE%3E%3C%2FPRE%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45946%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45946%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EAltug%2C%20I've%20tested%20my%20code%20a%20bit%2C%20and%20although%20it%20avoids%20explicit%20looping%2C%20which%20is%20good%2C%20the%20%26lt%3B%26lt%3B%20get%20all%20values%20as%20matrix%20()%20function%20is%26nbsp%3Bsimply%20too%20expensive%20when%20the%20table%20gets%20big%2C%20and%20the%20gains%20(if%20any)%20of%20vstd()%20vs.%20table%20ops%2C%20are%20too%20miniscule%20to%20offset%20this.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EFor%20a%20table%20of%20your%20size%2C%20a%20modification%20of%20Jim's%26nbsp%3Bcode%20(to%20allow%20for%20your%20groups)%20is%20going%20to%20be%20faster.%20Central%20to%20this%26nbsp%3Bis%20the%20fact%20that%20JMP%20table%20operations%20are%20really%20fast.%20I'm%20not%20even%26nbsp%3Bconvinced%26nbsp%3Bthat%20the%20matrix%20ops%2C%20which%20are%20also%20really%20fast%2C%20are%20faster%20than%20the%20column%20functions.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ESubsetting%20the%20tables%20initially%20to%20create%20two%20subtables--one%20for%20each%20group--more%20than%20pays%20for%20itself%20(assuming%20you%20have%20enough%20memory%20to%20house%20all%203%20tables)%2C%20as%20taking%20subsets%20of%20rows%20over%20and%20over%20is%20expensive.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThe%20code%20below%20(again%2C%20a%20slight%20modification%20of%20Jim's%2C%20to%20allow%20for%20your%20two%20groups)%20ran%20for%20me%26nbsp%3Bin%20about%202.5%20seconds%20for%20100K%20rows%20and%201K%20columns%2C%20whereas%20the%20routine%20I%20first%20submitted%20took%20over%206%20seconds.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThis%20will%20still%20seem%20a%20bit%20slow%20on%20a%20table%20your%20size...%20hopefully%20someone%20else%20will%20have%20a%26nbsp%3Bbetter%20idea.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%3E%3CCODE%20class%3D%22%20language-jsl%22%3ENames%20Default%20To%20Here(%201%20)%3B%0A%0Adt%20%3D%20current%20datatable()%3B%0A%0Adtcol1%20%3D%20column(dt%2C%201)%3B%0Adt1%20%3D%20%20dt%20%26lt%3B%26lt%3B%20subset(rows(dt%26lt%3B%3CGET%20rows%3D%22%22%20where%3D%22%22%3E%3D%201%2C%20i--%2C%0A%20If(%20(Col%20Std%20Dev(%20Column(%20dt1%2C%20colList%5Bi%5D%20)%20)%20)%20!%3D%200%20%26amp%3B%20Col%20Std%20Dev(%20Column(%20dt2%2C%20colList%5Bi%5D%20)%20)%20!%3D%200%2C%0A%20%20colList%20%3D%20Remove(%20colList%2C%20i%2C%201%20)%0A%20)%0A)%3B%0A%0A%0A%2F%2F%20Delete%20the%20columns%20with%20no%20variance%0Adt%26lt%3B%3CDELETE%20columns%3D%22%22%3E%3C%2FDELETE%3E%3C%2FGET%3E%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45944%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45944%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EAltug%2C%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EIf%20I%20understand%20your%20problem%20correctly%3A%26nbsp%3B%3C%2FP%3E%0A%3CP%3E-%20You%20have%20a%20column%2C%20(named%20Group%2C%20for%20example)%2C%20with%202%20values%20(%22A%22%20and%20%22B%22%2C%20for%20example).%3C%2FP%3E%0A%3CP%3E-%20You%20have%202000%20columns%20with%20numeric%20data.%3C%2FP%3E%0A%3CP%3E-%20You%20wish%20to%20delete%20any%20column%20where%20the%20std%20deviation%20of%20rows%20belonging%20to%20the%20%22A%22%20group%2C%20OR%20the%20rows%20belonging%20to%20the%20%22B%22%20group%20is%20below%20some%20threshhold%20(or%20both.)%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EPerformance%20(and%20memory%20usage)%20is%20going%20to%20be%20an%20issue%20here%20for%20a%20table%20of%20your%20size.%20I%20am%20not%20sure%20whether%20the%20below%20will%20work%2C%20or%20work%20quickly%2C%20on%20a%20table%20of%20the%20size%20you%20have%2C%20but%20give%20it%20a%20go.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EFirst%2C%20make%20a%20table%20where%20the%20%22a%2Fb%22%20column%20is%20the%20first%20column%2C%20and%26nbsp%3Bthe%20remaining%20columns%20are%20the%20parameter%20columns%20of%20interest.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThen%20try%20this.%3CCODE%20class%3D%22%20language-jsl%22%3E%3CBR%20%2F%3E%3C%2FCODE%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ECheers%2C%3C%2FP%3E%0A%3CP%3EBrady%3C%2FP%3E%0A%3CPRE%3E%3CCODE%20class%3D%22%20language-jsl%22%3Edt%20%3D%20Current%20Data%20Table()%3B%0A%0Amat%20%3D%20dt%20%26lt%3B%26lt%3B%20Get%20All%20Columns%20As%20Matrix()%3B%20%2F%2Fthis%20get%20the%20A%2FB%20column%20as%201s%20and%202s.%20which%20is%20which%20not%20important.%0A%0A%2F%2Fcreate%20subtables...%20note%20A%20%26amp%3B%20B%20will%20be%20reversed%20if%20B%20occurs%20first%20in%20dt%3B%20the%20result%20is%20unaffected.%0AmatA%20%3D%20mat%5Bloc(mat%5B0%2C1%5D%3D%3D1)%2C%200%5D%3B%0AmatB%20%3D%20mat%5Bloc(mat%5B0%2C1%5D%3D%3D2)%2C%200%5D%3B%0A%0A%2F%2Fcompute%20stddevs%20for%20each%20column%20in%20each%20subtable%0AstdvA%20%3D%20vstd(matA)%3B%0AstdvB%20%3D%20vstd(matB)%3B%0A%0A%2F%2Flocate%20columns%20of%20sufficiently%20low%20variance%0Acols%20%3D%20loc(stdvA%20%3D%3D%200%20%7C%20stdvB%20%3D%3D%200)%3B%0A%0A%2F%2Fdelete%20them.%20Ignore%20row%20(1)%20of%20the%20cols%20vector%3B%20it%20is%201%20(corresponding%20to%20column%201%20in%20dt).%0Atry(dt%20%26lt%3B%26lt%3B%20delete%20columns((dt%20%26lt%3B%26lt%3B%20get%20column%20names)%5Bcols%5B2%3A%3Anrow(cols)%5D%5D))%3B%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%20class%3D%22p1%22%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-SUB%20id%3D%22lingo-sub-45942%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3ERe%3A%20Big%20Data%20%2F%20Automatic%20Column%20Elimination%20due%20to%20Zero%20Variance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-45942%22%20slang%3D%22en-US%22%20mode%3D%22NONE%22%3E%3CP%3EHere%20is%20a%20simple%20example%20script%20that%20will%20do%20what%20you%20want.%3C%2FP%3E%0A%3CPRE%3E%3CCODE%20class%3D%22%20language-jsl%22%3ENames%20Default%20To%20Here(%201%20)%3B%0Adt%20%3D%20Open(%20%22%24SAMPLE_DATA%5Csemiconductor%20capability.jmp%22%20)%3B%0A%0A%2F%2F%20Get%20all%20continuous%20data%20columns%0AcolList%20%3D%20dt%20%26lt%3B%26lt%3B%20get%20column%20names(%20numeric%2C%20continuous%20)%3B%0A%0A%2F%2F%20Loop%20across%20all%20columns%20and%20find%20those%20with%20no%20variance%0AFor(%20i%20%3D%20N%20Items(%20collist%20)%2C%20i%20%26gt%3B%3D%201%2C%20i--%2C%0A%20If(%20Col%20Std%20Dev(%20Column(%20dt%2C%20colList%5Bi%5D%20)%20)%20!%3D%200%2C%0A%20%20colList%20%3D%20Remove(%20colList%2C%20i%2C%201%20)%0A%20)%0A)%3B%0A%0A%2F%2F%20Delete%20the%20columns%20with%20no%20variance%0Adt%26lt%3B%3CDELETE%20columns%3D%22%22%3E%3C%2FDELETE%3E%3C%2FCODE%3E%3C%2FPRE%3E%3C%2FLINGO-BODY%3E
Choose Language Hide Translation Bar
altug_bayram
Level IV

Big Data / Automatic Column Elimination due to Zero Variance

Hello 

 

I am analyzing a very large data set, at a minimum of around 2000 parameters in columns. My samples have these parameter values for ~ 250k rows. My problem is with the columns, as opposed to rows. 

 

I have a discrete classification defined for all samples, two discrete values, say class A and B. 

 

I am analyzing data further in a different software which then fails if it detects parameters that have zero variance (or very close to). 

I do not want to manually do this work (as you can imagine from 2000 columns).. 

I am looking for an as automated as possible way to delete a columns that have zero variance (or below a defined very small threshold) for either of class A or B. I go through steps to complete the work, but I don't want to be manually selecting any column to do the deletion. 

 

Any help will be appreciated. thanks in advance.

11 REPLIES 11
altug_bayram
Level IV

Re: Big Data / Automatic Column Elimination due to Zero Variance

Brady, 

Below script which is a slight modification from yours and should be the starting point .(w/ some minor modification on stdev being compared to a threshold). As you know, I need an actual script to further progress as this deletion would be done automatically by the script. The next phase internal code – after JMP data selection - is particularly having an issue when zero variance is encountered. Your prior version (ie. below) proved that these can be taken out. I am though trying to avoid deleting parameters w/ mean differences but with zero variance.

 

I cannot use a threshold that is dependent on actual parameter mean & distribution. e.g. I cannot specify practical difference, as this requires in depth examination of all 2000 parameters, not feasible. Not sure if I mentioned, but 2000 is just the tip of the iceberg, more coming. However, a single dataset contains roughly 2000 parameters. 

 

I can use p_value to filter parameters in or out depending on which ones cause a failure in our internal code. I think from that stand point, t-test may be a better option. However, I do not know how to insert the t-test and extract a p_value .... so I can add it as an additional filter. 

 

I feel like I have two options ... 

1) Either I will use the code below as is w/ no further 2-sample means comparison. It would delete all such cases if even means are different. I need to then record the column names of those that were deleted (need help w/ that) , so that at the end of the overall study, I can turn my attention to them and analyze them in jmp to detect means difference if any. This would be a much smaller set of parameter analysis, percentage-wise, I expect zero variance cases to be a very minimal portion of the dataset.

 

2) Or my second option is to add a t-test or any other test which offers a normalized threshold to fine tune the population of significant differences. Eg. I can use p_value to confirm or reject significance.

 

What do you think ?

thanks 

 

----------------------------------------------------------------------------------------------------------------- 

Names Default To Here( 1 );

dt = current datatable();

dtcol1 = column(dt, 1);
dt1 = dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) == dtcol1[1])), invisible, selected columns(0));
dt2 = dt << subset(rows(dt<<get rows Where(ascolumn(dtcol1) != dtcol1[1])), invisible, selected columns(0));

// Get all continuous data columns
colList = dt << get column names( numeric, continuous );

// Loop across all columns and find those with no variance 
For( i = N Items( collist ), i >= 1, i--,
If( (Col Std Dev( Column( dt1, colList[i] ) ) ) >= 0.01 & Col Std Dev( Column( dt2, colList[i] ) ) >= 0.01,
colList = Remove( colList, i, 1 )
)
);


// Delete the columns with no variance
dt<<delete columns(colList);

close(dt1, nosave);
close(dt2, nosave);

 

note: the thresholds on stdevs are not settled down. 0.01 is just one example. Will fine tune that. 

Re: Big Data / Automatic Column Elimination due to Zero Variance

And this chunk vs. the previous takes another tenth of a second off... due to the selection inversion vs. twice selecting. Bigger gains on bigger tables.

 

dtcol1 = column(dt, 1);
dt << select where(ascolumn(dtcol1) == dtcol1[1]);
dt1 =  dt << subset(selected rows(1), invisible, selected columns(0));
dt << Invert Row Selection;
dt2 =  dt << subset(selected rows(1), invisible, selected columns(0));

Recommended Articles