Subscribe Bookmark RSS Feed

Get and Filter Files in Directory

robot

Community Trekker

Joined:

Feb 27, 2012

I have a JMP script to prompt the user for a directory, and then filter for specific file types from within that directory.  My problem is that the filtering process often takes longer than just getting the files.  If the number of files is large (>100,000 files), the total time to collect and filter files can be more than 30 minutes.  Is is there a faster or more efficient way to do this?  I am using JMP11.  Thanks!


// Example.

Names Default To Here( 1 );

dir = Pick Directory( "Select a directory", "/C:/Program Files/SAS/" );

t1 = Tick Seconds();

files = Files In Directory( dir, Recursive );

t2 = Tick Seconds();

t_getfiles = t2 - t1;

n_getfiles = N Items( files );

t3 = Tick Seconds();

For( i = N Items( files ), i >= 1, i--,

  If( !Ends With( files, ".jmp" ),

  Remove From( files, i )

  )

);

t4 = Tick Seconds();

t_filterfiles = t4 - t3;

n_filterfiles = N Items( files );

Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );

1 ACCEPTED SOLUTION

Accepted Solutions
Solution

Yes, rework the filtering loop to remove the N^2 behavior.  JSL { lists } access elements by starting at the front.  You started at the back to prevent the deleted elements from messing up the indexing.  That leads to pretty much the worst case time behavior for manipulating the list, walking i elements to reach the i'th element.  Here's a reworked version that removes the front-most element from the list of files, checks it, and inserts it as the front-most element of the filtered list.


dir = "c:\";


t1 = Tick Seconds();


files = Files In Directory( dir, Recursive );


t2 = Tick Seconds();


t_getfiles = t2 - t1;


n_getfiles = N Items( files );


filteredFiles = {};


t3 = Tick Seconds();


while( (testname = Remove From( files, 1 )) != {},


  If( Ends With( testname[1], ".jmp" ),


  insertinto(filteredFiles,testname,1);


  )


);


t4 = Tick Seconds();


t_filterfiles = t4 - t3;


n_filterfiles = N Items( filteredFiles );


Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );


t_getfiles = 4.0333333333333;

n_getfiles = 171630;

t_filterfiles = 258.5;

n_filterfiles = 975;

About 5 minutes.

Another idea:  you can load a data table like this:


dir = "c:\";


t1 = Tick Seconds();


files = Files In Directory( dir, Recursive );


t2 = Tick Seconds();


t_getfiles = t2 - t1;


n_getfiles = N Items( files );


t3 = tickseconds();


dt = New Table( "directory",


  Add Rows( 0 ),


  New Column( "filename", Character, Nominal, Set Values( files ) ),


  New Column( "isTable",


  Numeric,


  Continuous,


  Format( "Best", 12 ),


  Formula( Ends With( :filename, ".jmp" ) )


  )


);


dt<<runformulas;


dt<<selectwhere(isTable==1);


dtFiltered = dt<<subset(selectedrows(1));


t4=tickseconds();


t_filterfiles = t4 - t3;


n_filterfiles = N rows( dtFiltered );


Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );


t_getfiles = 4.01666666666642;

n_getfiles = 171635;

t_filterfiles = 0.716666666666697;

n_filterfiles = 975;

About 5 seconds.  The <<runFormulas is required; the data table will still be evaluating the formula for the isTable column and the selectwhere won't find anything and the subset will be empty without it.

Craige
3 REPLIES
Solution

Yes, rework the filtering loop to remove the N^2 behavior.  JSL { lists } access elements by starting at the front.  You started at the back to prevent the deleted elements from messing up the indexing.  That leads to pretty much the worst case time behavior for manipulating the list, walking i elements to reach the i'th element.  Here's a reworked version that removes the front-most element from the list of files, checks it, and inserts it as the front-most element of the filtered list.


dir = "c:\";


t1 = Tick Seconds();


files = Files In Directory( dir, Recursive );


t2 = Tick Seconds();


t_getfiles = t2 - t1;


n_getfiles = N Items( files );


filteredFiles = {};


t3 = Tick Seconds();


while( (testname = Remove From( files, 1 )) != {},


  If( Ends With( testname[1], ".jmp" ),


  insertinto(filteredFiles,testname,1);


  )


);


t4 = Tick Seconds();


t_filterfiles = t4 - t3;


n_filterfiles = N Items( filteredFiles );


Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );


t_getfiles = 4.0333333333333;

n_getfiles = 171630;

t_filterfiles = 258.5;

n_filterfiles = 975;

About 5 minutes.

Another idea:  you can load a data table like this:


dir = "c:\";


t1 = Tick Seconds();


files = Files In Directory( dir, Recursive );


t2 = Tick Seconds();


t_getfiles = t2 - t1;


n_getfiles = N Items( files );


t3 = tickseconds();


dt = New Table( "directory",


  Add Rows( 0 ),


  New Column( "filename", Character, Nominal, Set Values( files ) ),


  New Column( "isTable",


  Numeric,


  Continuous,


  Format( "Best", 12 ),


  Formula( Ends With( :filename, ".jmp" ) )


  )


);


dt<<runformulas;


dt<<selectwhere(isTable==1);


dtFiltered = dt<<subset(selectedrows(1));


t4=tickseconds();


t_filterfiles = t4 - t3;


n_filterfiles = N rows( dtFiltered );


Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );


t_getfiles = 4.01666666666642;

n_getfiles = 171635;

t_filterfiles = 0.716666666666697;

n_filterfiles = 975;

About 5 seconds.  The <<runFormulas is required; the data table will still be evaluating the formula for the isTable column and the selectwhere won't find anything and the subset will be empty without it.

Craige
Craige_Hales

Staff

Joined:

Mar 21, 2013

The first example is much faster in JMP 12 (similar speed to the data table example); it appears to still be showing some N^2 behavior in JMP 11.

Craige
robot

Community Trekker

Joined:

Feb 27, 2012

Thanks Craige, that works great!