Discussions

SamKing · Dec 12, 2019 04:50 PM

I have a script that uses an input list to pull files from a directory. The file format is unusual so it takes some manipulation and each file is on the order of 60Mb so imports can take more time than I would want. My issue is that most files have the desired header row on row 15 but about 10% of the files have a header that could be anywhere from row 10-14 and there is no way to know until i open the file.

My current solution is to have the script open the first 20 rows, search for the row that contains the headers and assign that row number to a variable, close that dt, then reopen but using the variable as the header and variable + 1 as the first line. It works but it means I'm opening these files twice, though one is a limited preview only.

1) Is there a better way to do this?

2) And while I'm asking, is there a way to import a table but only import the columns that I want, based on name? My files have about 1000 columns and I only need about 15 of them.

Craige_Hales · Dec 12, 2019 07:30 PM

1) I think two imports is probably the easiest way and can be made faster...

2) You'll need to specify more than just the columns to import, but the speedup will be worth it...

use the text import wizard to mark the unwanted columns

Grab the source script, it has the JSL to omit columns

(for 1000 omitted columns it might be easier to edit the JSL than exclude with the GUI.)

Now, you can tweak the source script. Below it reads the first 100 lines. It is fast because (1) everything about the columns is specified and (2) it stops after 100 lines. For this data I should have specified Labels(0) and DataStarts(1).

Open(
	"f:\logOriginalAntenna.txt",// 100MB file takes about 2 seconds to load 100 lines
	columns(
		Omitted Column( . ),// use the import wizard to
		Omitted Column( . ),// get a source script with
		Omitted Column( . ),// omitted columns
		New Column( "c000004",
			Numeric,
			"Continuous",
			Format( "yyyy-mm-ddThh:mm:ss", 23, 3 ),
			Input Format( "yyyy-mm-ddThh:mm:ss", 3 )
		),
		New Column( "c000005", Character, "Nominal" )
	),
	Import Settings(
		End Of Line( CRLF, CR, LF ),
		End Of Field( Tab, Comma, CSV( 1 ) ),
		Strip Quotes( 1 ),
		Use Apostrophe as Quotation Mark( 0 ),
		Use Regional Settings( 0 ),
		Scan Whole File( 1 ),
		Treat empty columns as numeric( 0 ),
		CompressNumericColumns( 0 ),
		CompressCharacterColumns( 0 ),
		CompressAllowListCheck( 0 ),
		Labels( 1 ),
		Column Names Start( 1 ),
		Data Starts( 2 ),
		Lines To Read( 100 ),// specify the number of rows
		Year Rule( "20xx" )
	)
)

If your files always have the same column ordering, then specifying the format explicitly (above) will mean JMP does not need to study the file before importing it. I think JMP still studies the entire file even if you stop at 100 lines, so this could reduce the processing time, a lot.

If the files don't always have the same number of columns or order of columns, this won't work. You might be able to import the table, fix up the column names and data types, delete rows you don't want...using a single import.

Craige

View solution in original post

Craige_Hales · Dec 12, 2019 07:30 PM

1) I think two imports is probably the easiest way and can be made faster...

2) You'll need to specify more than just the columns to import, but the speedup will be worth it...

use the text import wizard to mark the unwanted columns

Grab the source script, it has the JSL to omit columns

(for 1000 omitted columns it might be easier to edit the JSL than exclude with the GUI.)

Now, you can tweak the source script. Below it reads the first 100 lines. It is fast because (1) everything about the columns is specified and (2) it stops after 100 lines. For this data I should have specified Labels(0) and DataStarts(1).

Open(
	"f:\logOriginalAntenna.txt",// 100MB file takes about 2 seconds to load 100 lines
	columns(
		Omitted Column( . ),// use the import wizard to
		Omitted Column( . ),// get a source script with
		Omitted Column( . ),// omitted columns
		New Column( "c000004",
			Numeric,
			"Continuous",
			Format( "yyyy-mm-ddThh:mm:ss", 23, 3 ),
			Input Format( "yyyy-mm-ddThh:mm:ss", 3 )
		),
		New Column( "c000005", Character, "Nominal" )
	),
	Import Settings(
		End Of Line( CRLF, CR, LF ),
		End Of Field( Tab, Comma, CSV( 1 ) ),
		Strip Quotes( 1 ),
		Use Apostrophe as Quotation Mark( 0 ),
		Use Regional Settings( 0 ),
		Scan Whole File( 1 ),
		Treat empty columns as numeric( 0 ),
		CompressNumericColumns( 0 ),
		CompressCharacterColumns( 0 ),
		CompressAllowListCheck( 0 ),
		Labels( 1 ),
		Column Names Start( 1 ),
		Data Starts( 2 ),
		Lines To Read( 100 ),// specify the number of rows
		Year Rule( "20xx" )
	)
)

If your files always have the same column ordering, then specifying the format explicitly (above) will mean JMP does not need to study the file before importing it. I think JMP still studies the entire file even if you stop at 100 lines, so this could reduce the processing time, a lot.

If the files don't always have the same number of columns or order of columns, this won't work. You might be able to import the table, fix up the column names and data types, delete rows you don't want...using a single import.

Craige

SamKing · Dec 13, 2019 10:58 AM

Thanks Craige. For the initial import to get the row with the column headers I used your advice and used the omit column for columns 2 to 1000 so I only import column 1 with 20 rows, grab what I need then close it. Then I reopen the file but using the information from the first part of the loop to properly select the column header row.

I do wish there was a better way to import specific columns for the 2nd part of the loop. In my situation the columns on my raw input file are likely to change order and quantity over time without me knowing. Would be cool if JMP had the ability to specify by name only the columns you want to import.

Craige_Hales · Dec 13, 2019 11:17 AM

glad that will help, seems like a worthy wish list item.

Craige

SamKing · Jan 24, 2020 12:24 PM

@Craige_Hales

I'm searching for a way to shorten the code for the import of these .csv-like files. I have one import that now has about 1000 columns and another with about 360, and this loops through hundreds of files now in a list called "FileList". I only want to import column 1 for the reasons stated above. My current code looks like this:

For( x = 1, x <= N Items( FileList ), x++,
		dt = Open(
			FileList[x],
			eval(columns(
				New Column( "Col1", Character, "Nominal" ), 
				Omitted Column( . ),
                Omitted Column( . ),
                Omitted Column( . ), (repeated 400 times!)
			),
			Import Settings(
				End Of Line( CRLF, CR, LF ),
				End Of Field( Spaces, Space, CSV( 0 ) ),
				Strip Quotes( 1 ),
				Column Names Start( 1 ),
				Data Starts( 1 ),
			),
			//invisible
		);

But i was trying to figure out how to use an eval() type expression but I'm not doing it correctly. I've tried the eval() expression in various places but can't seem to get it. I'm new to eval(). I'm thinking that maybe my issue is that my variable "Omit" ends up being a string but enclosed in quotes??

	Omit = repeat("Omitted Column( . ),",400)||"Omitted Column( . )";

	For( x = 1, x <= N Items( FileList ), x++,
		dt = Open(
			FileList[x],
			eval(columns(
				New Column( "Col1", Character, "Nominal" ), 
				Omit);
			),
			Import Settings(
				End Of Line( CRLF, CR, LF ),
				End Of Field( Spaces, Space, CSV( 0 ) ),
				Strip Quotes( 1 ),
				Column Names Start( 1 ),
				Data Starts( 1 ),
			),
		);

Craige_Hales · Jan 24, 2020 06:55 PM

I think you are right; you might need to use parse() to convert the string to an expression.

Here's another approach you can take which avoids strings and manipulates the expressions directly:

// testing data (I modified your end of field with a tab, below)
filelist = {"$sample_data\..\Import Data\bigclass.txt"};
x = 1; 

// build an expression with a placeholder COLUMNSEXPR
openstmt = Expr(
	Open(
		FileList[x],
		COLUMNSEXPR, // << placeholder for columns(...)
		Import Settings( End Of Line( CRLF, CR, LF ), End Of Field( tab, CSV( 0 ) ), 
		Strip Quotes( 1 ), Column Names Start( 1 ), Data Starts( 1 ) )
	)
);

// if you run the openstmt function now, it will import all the columns
//
// eval(openstmt);

// build a list using {} to make it easy to manipulate
COLUMNSLIST = {};
// insert the column you want to keep
Insert Into( COLUMNSLIST, Expr( New Column( "Col1", Character, "Nominal" ) ) );
// insert 5 omit columns...
For( i = 1, i < 5, i += 1,
	Insert Into( COLUMNSLIST, Expr( Omitted Column( . ) ) )
);
// COLUMNSLIST looks like this:
//{New Column( "Col1", Character, "Nominal" ), Omitted Column( . ),
//Omitted Column( . ), Omitted Column( . ), Omitted Column( . )}

// convert the {...} to columns(...)
Substitute Into( COLUMNSLIST, Expr( {} ), Expr( columns ) );

//show(nameexpr(COLUMNSLIST)); shows this:
//columns(New Column("Col1", Character, "Nominal"), 
//Omitted Column(.), Omitted Column(.), 
//Omitted Column(.), Omitted Column(.))
//
// if you leave out the nameexpr() it will (try to) evaluate rather than
// retrieving the expression for display. columns() is not a function in
// JSL, so evaluating makes an error message.

// nameexpr(openstmt) still show COLUMNSEXPR. replace it:
Substitute Into( openstmt, Expr( COLUMNSEXPR ), Name Expr( COLUMNSLIST ) );

// nameexpr(openstmt) is now 
//Open(
//	FileList[x],
//	columns(
//		New Column( "Col1", Character, "Nominal" ),
//		Omitted Column( . ),
//		Omitted Column( . ),
//		Omitted Column( . ),
//		Omitted Column( . )
//	),
//	Import Settings(
//		End Of Line( CRLF, CR, LF ),
//		End Of Field( tab, CSV( 0 ) ),
//		Strip Quotes( 1 ),
//		Column Names Start( 1 ),
//		Data Starts( 1 )
//	)
//)

Eval( openstmt ); // reads just the first column of the table

You could also do something similar by manipulating text, rather than expressions, and then

eval(parse( " the text you built ") )

Use the technique you are most comfortable with; there will be a minor speed improvement manipulating the expressions, but insignificant compared to reading a file from disk.

Note that the final statement above still depends on the value of X to choose a file to open, so there is no need to rebuild the expression.

Craige

Discussions

JSL Table import - Selecting correct row as headers when desired row number is variable

Re: JSL Table import - Selecting correct row as headers when desired row number is variable

Re: JSL Table import - Selecting correct row as headers when desired row number is variable

Re: JSL Table import - Selecting correct row as headers when desired row number is variable

Re: JSL Table import - Selecting correct row as headers when desired row number is variable

Re: JSL Table import - Selecting correct row as headers when desired row number is variable

Re: JSL Table import - Selecting correct row as headers when desired row number is variable

Recommended Articles