topic Re: Summary of multiple CSV files in Discussions

Summary of multiple CSV files

briankwalsh — Thu, 01 Dec 2016 19:32:56 GMT

How do I direct JMP to a file folder on my desktop that have about 400 CSV files like a database, create a bunch of columns and create a summary table with 1 row per file? I can do this individually, but it takes forever and I was hoping I could develop a script to automate.

Re: Summary of multiple CSV files

vince_faller — Thu, 01 Dec 2016 21:14:33 GMT

So I'm making a few assumptions about this. I could be off.

All csv files have the same columns.

you want to create an aggregate summary (mean for example) of each column for each file.

Names Default to here(1);
dir = pick directory();
files = files in directory(dir);

dt = New Table("concat", New Column("File", character));
for(i=1, i<=nitems(files), i++, 

	if(endswith(files[i], "csv"), //just a check to make sure it's a csv
		//break();
		dt1 = open(dir||files[i], private);
		dt1 << New Column("File", character, set each value(files[i]));
		dt<<concatenate(dt1, append to first table);
		close(dt1, no save);
	)
);

dt_Sum = dt << Summary( //you'll have to put which statistics you want for your column names here
	Group( :File ),
	Mean( :ColumnA ),
	Mean( :ColumnB ),
	Mean( :ColumnC ),
	Mean( :ColumnD ),
	Freq( "None" ),
	Weight( "None" )
);

Re: Summary of multiple CSV files

Craige_Hales — Thu, 01 Dec 2016 22:00:42 GMT

(Ignore this answer; Vince has a much better approach.)

A variation on Vince's answer, if you want per-file statistics. I also imagine the columns are consistent...

path = "$desktop/csvFiles/";
files = Files In Directory( path );

answers = New Table( "Summary",
	New Column( "N Rows", Numeric, "Continuous", Format( "Fixed Dec", 12, 0 ), Set Values( [] ) ),
	New Column( "Mean(x)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "Mean(y)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "Mean(z)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "sourceFile", character )
);


For( iFile = 1, iFile <= N Items( files ), iFile += 1,
	file = path || files[iFile];
	dt = Open( file, "text" );
	dt2 = dt << Summary( Mean( :x ), Mean( :y ), Mean( :z ), Freq( "None" ), Weight( "None" ) );
	answers << addrows( 1 );
	answers:sourceFile = file || " " || Format( Creation Date( file ), "m/d/y h:m:s" );
	answers:N Rows = dt2:N Rows[1];
	answers:name( "Mean(x)" ) = dt2:name( "Mean(x)" )[1];
	answers:name( "Mean(y)" ) = dt2:name( "Mean(y)" )[1];
	answers:name( "Mean(z)" ) = dt2:name( "Mean(z)" )[1];
	Close( dt, nosave ); // linked table closes too
);

my files looked like this

x,y,z
1,2,3
4,5,6

I used the table script from a file I made by hand to get the Summary script. There might be a better way to copy the variables; in JMP 13 you could use the ideas here to assign a whole row in one statement.

Re: Summary of multiple CSV files

briankwalsh — Sat, 10 Dec 2016 17:48:27 GMT

Sorry for the delay. Was on service and having to take care of patients. Now just getting back to this.

No they don't always have the same column (A, B, C, etc.) but they do have the same name.

Brian

Re: Summary of multiple CSV files

vince_faller — Wed, 14 Dec 2016 00:57:54 GMT

can you give a screenshot of two fake tables and what you want the end result to look like? I can't figure out what you're asking.

Re: Summary of multiple CSV files

briankwalsh — Mon, 19 Dec 2016 23:32:59 GMT

Here is what the data looks like within a CSV file. The file name is also the patient identifier. How can I get it to summarize by column heading and create a table of the summarized data with the file identifier in a column? You can see what the file name looks like at the bottom on the screen shot.

I could convert all the CSVs to JMP files? Just not sure it would help.

Does anyone know if the Data Table Tools add-in would work for this case?

Re: Summary of multiple CSV files

vince_faller — Wed, 21 Dec 2016 17:47:01 GMT

Summarized how?

Re: Summary of multiple CSV files

ih — Wed, 21 Dec 2016 22:24:31 GMT

Here is a variation on Vince's response; this allows different column names in each file.

Names Default to here(1);
dir = pick directory();
files = files in directory(dir);

//table to hold summary data
dts = New Table("Summary", New Column("filename", character));


for(i=1, i<=nitems(files), i++, //for each filename

	if(endswith(files[i], "csv"), //if file is a csv
	
		//open the file
		dt1 = open( dir||files[i], private );
		
		//add a row to the summary table
		dts << Add Rows(1);
		
		//save the filename to the data table
		Column( dts, "filename" )[i] = files[i];
		
		//get list of columns
		cols = dt1 << Get Column Names();
		
		//for each column
		for ( c=1, c <= N Items(cols), c++,
		
			//add mean column to summary table if it isn't already there
			if ( Contains( dts << Get Column Names(); , Parse( Char( cols[c] ) || "_mean" ) ) < 1,
				dts << New Column( Char( cols[c] ) || "_mean" );
			);
			
			//find the mean and write it to summary table
			Column( dts, Char( cols[c] ) || "_mean" )[i] = ColMean( Column( dt1, cols[c] ) );
			
			//add max column to summary table if it isn't already there
			if ( Contains( dts << Get Column Names(); , Parse( Char( cols[c] ) || "_max" ) ) < 1,
				dts << New Column( Char( cols[c] ) || "_max" );
			);
			
			//find the max and write it to summary table
			Column( dts, Char( cols[c] ) || "_max" )[i] = ColMax( Column( dt1, cols[c] ) );
			
		); //end loop each column
		
		//close file
		close(dt1, no save);
		
	); //end check for csv
	
); //end loop for each file

Re: Summary of multiple CSV files

txnelson — Fri, 23 Dec 2016 00:57:16 GMT

Here is my take on the script that is needed. It is pretty simple and gets all numeric columns summarized, and brings along the Excel Sheet name as a new column

Below is edited based upon the issues found in the next reply

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i], private );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt2 = dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ) );
		// Add the summary table to the summary table
		dt << concatenate( dt2, append to first table );
		
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );Close( dt2, no save );
	)
);

Re: Summary of multiple CSV files

ih — Wed, 04 Jan 2017 15:43:59 GMT

I think you meant to append the summary tables instead of the data tables?

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i] );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ), output table name("tmpdt"));
		// Add the summary table to the summary table
		dt << concatenate( Data Table("tmpdt"), append to first table );
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );
	)
);

Re: Summary of multiple CSV files

txnelson — Fri, 23 Dec 2016 00:58:27 GMT

You are correct.....thanks for catching the mistake

I modified my original entry to correct the mistakes you found