I have several folders and inside each folder there are several CSV files. Each CSV file has around 500,000 rows. I use a JSL script to concatenate all csv files inside each folder into one JMP table. This process is repeated on each folder and hence in the end there are as many concatenated JMP tables as the no of folders. But it takes a lot of time.
I searched for parallel programming in JSL and came to know about Parallel Assign. I need some guidance on how to put the script below inside Parallel Assign to speed up the concatenation. I will greatly appreciate it.
path = munger(Pick Directory( "Browse to the Directory of the .txt / .csv files " ),1,"/","");
Print(dirList(path));
folderlist={};
folderlist = dirList(path);
count=nitems(folderlist);
//prefolderlist = Files In Directory( path );
//filepath = Convert File Path(path, Windows )
For( j2 = 1, count >= j2, j2++,
folderpath= path || folderlist[j2] || "/" ;
Print(folderpath);
prefilelist = Files In Directory( folderpath );
n2=nitems(prefilelist);
filelist = {};
//filter out any non-txt or csv files
For( i2 = 1, n2 >= i2, i2++,
file=(prefilelist[i2]);
If( Item( 2, prefilelist[i2], "." ) == "txt" | Item( 2, prefilelist[i2], "." ) == "csv",
Insert Into( filelist,file),
show(file)
)
);
nf=nitems(filelist); //number of items in the working list
cctable= New Table( "Combined data table ");//make an empty table
cctable<<New Column( "Source", Character, Nominal );
For( iii = 1 , iii <= nf, iii++, //this starts the first loop
filenow = ( filelist[iii] );
fileopen=(folderpath||filenow);
//dt=open(fileopen,private);
dt=open(fileopen,importset,private);//Import settings used in the open argument
New Column( "Source", Character, Nominal );
:Source << set each value( filenow );
//dt<<new column("Source", character, nominal)<<set each value(9999);
dt << Run Formulas();
//add the current table to the bottom of the combined data table
cctable << Concatenate( Data Table( dt ), Append to first table );
//don't use "Create Source Column" argument
Close( dt, NoSave );//after concatenating the table, close it and move on
);//end of the first for loop
);
Take a look at File-> MultipleFileImport. It can dig through nested directories, select by filename patterns, and concatenate similar files. And it is faster than using open() on a CSV, usually even for a single file. It also makes a script that you can modify and reuse. It can add a column with source file information.
Start interactively, look for the checkbox to keep the window open, do a couple of experiments.
The GUI has four filters that choose the files selected. The folder/hidden/recursive filter is always visible, and there are three checkboxes to show and enable the filename, file time, and file size filters. The file list shows what is selected; it is not for selecting files.
Worst case, you can use MFI on each directory, one at a time, and probably get a nice speed up. Best case it might do what you want on the entire nest of directories.
Parallel Assign is designed to fill in an array AND to run the JSL that fills in each array element in an isolated environment that prevents errors between threads. It tries to avoid things like accessing a data table at the same time from two different threads. If you find a way to to it, it will probably crash.
Take a look at File-> MultipleFileImport. It can dig through nested directories, select by filename patterns, and concatenate similar files. And it is faster than using open() on a CSV, usually even for a single file. It also makes a script that you can modify and reuse. It can add a column with source file information.
Start interactively, look for the checkbox to keep the window open, do a couple of experiments.
The GUI has four filters that choose the files selected. The folder/hidden/recursive filter is always visible, and there are three checkboxes to show and enable the filename, file time, and file size filters. The file list shows what is selected; it is not for selecting files.
Worst case, you can use MFI on each directory, one at a time, and probably get a nice speed up. Best case it might do what you want on the entire nest of directories.
Parallel Assign is designed to fill in an array AND to run the JSL that fills in each array element in an isolated environment that prevents errors between threads. It tries to avoid things like accessing a data table at the same time from two different threads. If you find a way to to it, it will probably crash.
Thank you so much. I was able to incorporate Import Multiple Files in my script and now it is able to concatenate tables lot faster.