Solved: Summary of multiple CSV files

Report Inappropriate Content · Dec 1, 2016 02:32 PM

How do I direct JMP to a file folder on my desktop that have about 400 CSV files like a database, create a bunch of columns and create a summary table with 1 row per file? I can do this individually, but it takes forever and I was hoping I could develop a script to automate.

txnelson · Dec 22, 2016 4:57 PM

Here is my take on the script that is needed. It is pretty simple and gets all numeric columns summarized, and brings along the Excel Sheet name as a new column

Below is edited based upon the issues found in the next reply

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i], private );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt2 = dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ) );
		// Add the summary table to the summary table
		dt << concatenate( dt2, append to first table );
		
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );Close( dt2, no save );
	)
);

Jim

View solution in original post

vince_faller · Dec 1, 2016 04:14 PM

So I'm making a few assumptions about this. I could be off.

All csv files have the same columns.

you want to create an aggregate summary (mean for example) of each column for each file.

Names Default to here(1);
dir = pick directory();
files = files in directory(dir);

dt = New Table("concat", New Column("File", character));
for(i=1, i<=nitems(files), i++, 

	if(endswith(files[i], "csv"), //just a check to make sure it's a csv
		//break();
		dt1 = open(dir||files[i], private);
		dt1 << New Column("File", character, set each value(files[i]));
		dt<<concatenate(dt1, append to first table);
		close(dt1, no save);
	)
);

dt_Sum = dt << Summary( //you'll have to put which statistics you want for your column names here
	Group( :File ),
	Mean( :ColumnA ),
	Mean( :ColumnB ),
	Mean( :ColumnC ),
	Mean( :ColumnD ),
	Freq( "None" ),
	Weight( "None" )
);

Vince Faller - Predictum

Craige_Hales · Dec 1, 2016 2:00 PM

(Ignore this answer; Vince has a much better approach.)

A variation on Vince's answer, if you want per-file statistics. I also imagine the columns are consistent...

path = "$desktop/csvFiles/";
files = Files In Directory( path );

answers = New Table( "Summary",
	New Column( "N Rows", Numeric, "Continuous", Format( "Fixed Dec", 12, 0 ), Set Values( [] ) ),
	New Column( "Mean(x)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "Mean(y)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "Mean(z)", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [] ) ),
	New Column( "sourceFile", character )
);


For( iFile = 1, iFile <= N Items( files ), iFile += 1,
	file = path || files[iFile];
	dt = Open( file, "text" );
	dt2 = dt << Summary( Mean( :x ), Mean( :y ), Mean( :z ), Freq( "None" ), Weight( "None" ) );
	answers << addrows( 1 );
	answers:sourceFile = file || " " || Format( Creation Date( file ), "m/d/y h:m:s" );
	answers:N Rows = dt2:N Rows[1];
	answers:name( "Mean(x)" ) = dt2:name( "Mean(x)" )[1];
	answers:name( "Mean(y)" ) = dt2:name( "Mean(y)" )[1];
	answers:name( "Mean(z)" ) = dt2:name( "Mean(z)" )[1];
	Close( dt, nosave ); // linked table closes too
);

my files looked like this

x,y,z
1,2,3
4,5,6

I used the table script from a file I made by hand to get the Summary script. There might be a better way to copy the variables; in JMP 13 you could use the ideas here to assign a whole row in one statement.

Craige

briankwalsh · Dec 10, 2016 12:48 PM

Sorry for the delay. Was on service and having to take care of patients. Now just getting back to this.

No they don't always have the same column (A, B, C, etc.) but they do have the same name.

Brian

vince_faller · Dec 13, 2016 07:57 PM

can you give a screenshot of two fake tables and what you want the end result to look like? I can't figure out what you're asking.

Vince Faller - Predictum

briankwalsh · Dec 19, 2016 06:32 PM

Here is what the data looks like within a CSV file. The file name is also the patient identifier. How can I get it to summarize by column heading and create a table of the summarized data with the file identifier in a column? You can see what the file name looks like at the bottom on the screen shot.

I could convert all the CSVs to JMP files? Just not sure it would help.

Does anyone know if the Data Table Tools add-in would work for this case?

vince_faller · Dec 21, 2016 12:47 PM

Summarized how?

Vince Faller - Predictum

ih · Dec 21, 2016 05:24 PM

Here is a variation on Vince's response; this allows different column names in each file.

Names Default to here(1);
dir = pick directory();
files = files in directory(dir);

//table to hold summary data
dts = New Table("Summary", New Column("filename", character));


for(i=1, i<=nitems(files), i++, //for each filename

	if(endswith(files[i], "csv"), //if file is a csv
	
		//open the file
		dt1 = open( dir||files[i], private );
		
		//add a row to the summary table
		dts << Add Rows(1);
		
		//save the filename to the data table
		Column( dts, "filename" )[i] = files[i];
		
		//get list of columns
		cols = dt1 << Get Column Names();
		
		//for each column
		for ( c=1, c <= N Items(cols), c++,
		
			//add mean column to summary table if it isn't already there
			if ( Contains( dts << Get Column Names(); , Parse( Char( cols[c] ) || "_mean" ) ) < 1,
				dts << New Column( Char( cols[c] ) || "_mean" );
			);
			
			//find the mean and write it to summary table
			Column( dts, Char( cols[c] ) || "_mean" )[i] = ColMean( Column( dt1, cols[c] ) );
			
			//add max column to summary table if it isn't already there
			if ( Contains( dts << Get Column Names(); , Parse( Char( cols[c] ) || "_max" ) ) < 1,
				dts << New Column( Char( cols[c] ) || "_max" );
			);
			
			//find the max and write it to summary table
			Column( dts, Char( cols[c] ) || "_max" )[i] = ColMax( Column( dt1, cols[c] ) );
			
		); //end loop each column
		
		//close file
		close(dt1, no save);
		
	); //end check for csv
	
); //end loop for each file

txnelson · Dec 22, 2016 4:57 PM

Here is my take on the script that is needed. It is pretty simple and gets all numeric columns summarized, and brings along the Excel Sheet name as a new column

Below is edited based upon the issues found in the next reply

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i], private );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt2 = dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ) );
		// Add the summary table to the summary table
		dt << concatenate( dt2, append to first table );
		
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );Close( dt2, no save );
	)
);

Jim

ih · Jan 4, 2017 7:43 AM

I think you meant to append the summary tables instead of the data tables?

Names Default To Here( 1 );
dir = Pick Directory();
files = Files In Directory( dir );

dt = New Table( "concat", New Column( "File", character ) );
For( i = 1, i <= N Items( files ), i++, 

	If( Ends With( files[i], "csv" ), //just a check to make sure it's a csv
		// Open the file
		dt1 = Open( dir || files[i] );
		//Get all of the numeric columns from the just opened file
		thecolumns = dt1 << get column names( numeric );
		// Create an output table of one row that contains the summary
		// As many stats as desired can be added to the below to get all of the
		// ones that are needed
		dt1 << Summary( Mean( Eval( thecolumns ) ), Freq( "None" ), Weight( "None" ), output table name("tmpdt"));
		// Add the summary table to the summary table
		dt << concatenate( Data Table("tmpdt"), append to first table );
		// Since what is wanted is the ID, not the file name, and the ID is
		// the Sheeet name in the Excel spreadsheet, one has to get that value
		// the the name of the data table, since that is where JMP 13 places the
		// sheet name
		dt:File[N Rows( dt ) ] = dt1 << Get Name;
		// Close the no longer needed raw data table
		Close( dt1, no save );
	)
);

Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Re: Summary of multiple CSV files

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Multiple-Group Analysis in Structural Equation Modeling

Analytics with Confidence 2: Models That Don't Generalise

Distribution new features for JMP 17