Solved: Re: Read CSV line by line to find where data starts

miguello · Jun 9, 2023 9:56 AM

I need to load aCSV file which might be in number of different formats. And I need to find what line my data starts on. I know that the column names start after a certain line, let's say "[Data starts here]".

This is what I do:

ex = Load Text File(
	filePath
);
words = Words( ex, "\!n\!r" );
headersLocation = Loc(words, "[Data starts here]");

and then I use headersLocation+1 to specify where my headers start.

The problem happens when I have an empty line in my CSV file:

line1
line2

[Data Starts Here]
Header1, header2
0,0
1,1

When I break this file into words, line "[Data STarts Here]" will be the third one, Loca() function will give me 3 in this case, while in reality it's 4th line.

So when breaking file by Words function - that empty line just doesn't make it into the list.

And I don't know how many empty lines I may have in the file before data starts.

How do I go about it?

I can break by either "\!n" or "\!r" only, and then use Left() or Right() in a for loop to remove the remaining special symbol, but it seems too brute force and not very optimal and I'm not sure it is correct. Something like this:

For Each( {value, index}, words, words[index] = Right(value, Length(value)-1););

The whole script being:

ex = Load Text File(
	filePath
);
words = Words( ex, "\!n\!r" );
For Each( {value, index}, words, words[index] = Right(value, Length(value)-1););
headersLocation = Loc(words, "[Data starts here]");

Then it seems to be working. But is this the best and even a correct solution?

jthi · Apr 7, 2022 03:56 AM

Do you need the location for JMP's data import platform? If you have no need for the data before [Data Starts Here], you could just drop it

Names Default To Here(1);
csv = "line1
line2

[Data Starts Here]
Header1, header2
0,0
1,1
";

string_to_search = "[Data Starts Here]";
data_start_idx = Contains(csv, string_to_search);

data_str = Substr(csv, data_start_idx + Length(string_to_search) + 1);

And if you need the JMP's import, you can use Char To Blob() with the string you now have:

dt = Open(
	Char To Blob(data_str),
	Import Settings(
		End Of Line(CRLF, CR, LF),
		End Of Field(Comma, CSV(0)),
		Strip Quotes(1),
		Use Apostrophe as Quotation Mark(0),
		Use Regional Settings(0),
		Scan Whole File(1),
		Treat empty columns as numeric(0),
		CompressNumericColumns(0),
		CompressCharacterColumns(0),
		CompressAllowListCheck(0),
		Labels(1),
		Column Names Start(1),
		Data Starts(2),
		Lines To Read("All"),
		Year Rule("20xx")
	)
);

-Jarmo

View solution in original post

jthi · Apr 7, 2022 03:56 AM

Do you need the location for JMP's data import platform? If you have no need for the data before [Data Starts Here], you could just drop it

Names Default To Here(1);
csv = "line1
line2

[Data Starts Here]
Header1, header2
0,0
1,1
";

string_to_search = "[Data Starts Here]";
data_start_idx = Contains(csv, string_to_search);

data_str = Substr(csv, data_start_idx + Length(string_to_search) + 1);

And if you need the JMP's import, you can use Char To Blob() with the string you now have:

dt = Open(
	Char To Blob(data_str),
	Import Settings(
		End Of Line(CRLF, CR, LF),
		End Of Field(Comma, CSV(0)),
		Strip Quotes(1),
		Use Apostrophe as Quotation Mark(0),
		Use Regional Settings(0),
		Scan Whole File(1),
		Treat empty columns as numeric(0),
		CompressNumericColumns(0),
		CompressCharacterColumns(0),
		CompressAllowListCheck(0),
		Labels(1),
		Column Names Start(1),
		Data Starts(2),
		Lines To Read("All"),
		Year Rule("20xx")
	)
);

-Jarmo

miguello · Apr 7, 2022 04:02 AM

Hmmm... Interesting suggestion. Let me try it. It might be easier after all.