cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Choose Language Hide Translation Bar
briankwalsh
Level I

Summary of multiple CSV files

How do I direct JMP to a file folder on my desktop that have about 400 CSV files like a database, create a bunch of columns and create a summary table with 1 row per file? I can do this individually, but it takes forever and I was hoping I could develop a script to automate. 

1 ACCEPTED SOLUTION

Accepted Solutions
txnelson
Super User

Re: Summary of multiple CSV files

Here is my take on the script that is needed.  It is pretty simple and gets all numeric columns summarized, and brings along the Excel Sheet name as a new column

 

Below is edited based upon the issues found in the next reply

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i], private );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt2 = dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ) );
		// Add the summary table to the summary table
		dt << concatenate( dt2, append to first table );
		
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );Close( dt2, no save );
	)
);

 

Jim

View solution in original post

10 REPLIES 10
vince_faller
Super User (Alumni)

Re: Summary of multiple CSV files

So I'm making a few assumptions about this. I could be off.

 

All csv files have the same columns. 

you want to create an aggregate summary (mean for example) of each column for each file.  

 

Names Default to here(1);
dir = pick directory();
files = files in directory(dir);

dt = New Table("concat", New Column("File", character));
for(i=1, i<=nitems(files), i++, 

	if(endswith(files[i], "csv"), //just a check to make sure it's a csv
		//break();
		dt1 = open(dir||files[i], private);
		dt1 << New Column("File", character, set each value(files[i]));
		dt<<concatenate(dt1, append to first table);
		close(dt1, no save);
	)
);

dt_Sum = dt << Summary( //you'll have to put which statistics you want for your column names here
	Group( :File ),
	Mean( :ColumnA ),
	Mean( :ColumnB ),
	Mean( :ColumnC ),
	Mean( :ColumnD ),
	Freq( "None" ),
	Weight( "None" )
);

 

Vince Faller - Predictum
Craige_Hales
Super User

Re: Summary of multiple CSV files

(Ignore this answer; Vince has a much better approach.)

 

A variation on Vince's answer, if you want per-file statistics.  I also imagine the columns are consistent...

 

path = "$desktop/csvFiles/";
files = Files In Directory( path );

answers = New Table( "Summary",
	New Column( "N Rows", Numeric, "Continuous", Format( "Fixed Dec", 12, 0 ), Set Values( [] ) ),
	New Column( "Mean(x)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "Mean(y)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "Mean(z)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "sourceFile", character )
);


For( iFile = 1, iFile <= N Items( files ), iFile += 1,
	file = path || files[iFile];
	dt = Open( file, "text" );
	dt2 = dt << Summary( Mean( :x ), Mean( :y ), Mean( :z ), Freq( "None" ), Weight( "None" ) );
	answers << addrows( 1 );
	answers:sourceFile = file || " " || Format( Creation Date( file ), "m/d/y h:m:s" );
	answers:N Rows = dt2:N Rows[1];
	answers:name( "Mean(x)" ) = dt2:name( "Mean(x)" )[1];
	answers:name( "Mean(y)" ) = dt2:name( "Mean(y)" )[1];
	answers:name( "Mean(z)" ) = dt2:name( "Mean(z)" )[1];
	Close( dt, nosave ); // linked table closes too
);

my files looked like this

x,y,z
1,2,3
4,5,6

 

I used the table script from a file I made by hand to get the Summary script.  There might be a better way to copy the variables; in JMP 13 you could use the ideas here to assign a whole row in one statement.

 

Craige
briankwalsh
Level I

Re: Summary of multiple CSV files

Sorry for the delay. Was on service and having to take care of patients. Now just getting back to this. 

 

No they don't always have the same column (A, B, C, etc.) but they do have the same name. 

 

Brian

vince_faller
Super User (Alumni)

Re: Summary of multiple CSV files

can you give a screenshot of two fake tables and what you want the end result to look like?  I can't figure out what you're asking.  

Vince Faller - Predictum
briankwalsh
Level I

Re: Summary of multiple CSV files

Here is what the data looks like within a CSV file. The file name is also the patient identifier. How can I get it to summarize by column heading and create a table of the summarized data with the file identifier in a column? You can see what the file name looks like at the bottom on the screen shot. 

 

I could convert all the CSVs to JMP files? Just not sure it would help. 

 

Does anyone know if the Data Table Tools add-in would work for this case?

 

vince_faller
Super User (Alumni)

Re: Summary of multiple CSV files

Summarized how?  

Vince Faller - Predictum
ih
Super User (Alumni) ih
Super User (Alumni)

Re: Summary of multiple CSV files

Here is a variation on Vince's response; this allows different column names in each file.

 

Names Default to here(1);
dir = pick directory();
files = files in directory(dir);

//table to hold summary data
dts = New Table("Summary", New Column("filename", character));


for(i=1, i<=nitems(files), i++, //for each filename

	if(endswith(files[i], "csv"), //if file is a csv
	
		//open the file
		dt1 = open( dir||files[i], private );
		
		//add a row to the summary table
		dts << Add Rows(1);
		
		//save the filename to the data table
		Column( dts, "filename" )[i] = files[i];
		
		//get list of columns
		cols = dt1 << Get Column Names();
		
		//for each column
		for ( c=1, c <= N Items(cols), c++,
		
			//add mean column to summary table if it isn't already there
			if ( Contains( dts << Get Column Names(); , Parse( Char( cols[c] ) || "_mean" ) ) < 1,
				dts << New Column( Char( cols[c] ) || "_mean" );
			);
			
			//find the mean and write it to summary table
			Column( dts, Char( cols[c] ) || "_mean" )[i] = ColMean( Column( dt1, cols[c] ) );
			
			//add max column to summary table if it isn't already there
			if ( Contains( dts << Get Column Names(); , Parse( Char( cols[c] ) || "_max" ) ) < 1,
				dts << New Column( Char( cols[c] ) || "_max" );
			);
			
			//find the max and write it to summary table
			Column( dts, Char( cols[c] ) || "_max" )[i] = ColMax( Column( dt1, cols[c] ) );
			
		); //end loop each column
		
		//close file
		close(dt1, no save);
		
	); //end check for csv
	
); //end loop for each file
txnelson
Super User

Re: Summary of multiple CSV files

Here is my take on the script that is needed.  It is pretty simple and gets all numeric columns summarized, and brings along the Excel Sheet name as a new column

 

Below is edited based upon the issues found in the next reply

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i], private );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt2 = dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ) );
		// Add the summary table to the summary table
		dt << concatenate( dt2, append to first table );
		
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );Close( dt2, no save );
	)
);

 

Jim
ih
Super User (Alumni) ih
Super User (Alumni)

Re: Summary of multiple CSV files

I think you meant to append the summary tables instead of the data tables?

 

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i] );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ), output table name("tmpdt"));
		// Add the summary table to the summary table
		dt << concatenate( Data Table("tmpdt"), append to first table );
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );
	)
);