Solved: Re: How to split a column with unstructured string

ReliabilityWolf · Sep 8, 2024 10:31 AM

Hi, experts

I like to split the column with unstructured string into bunches of columns. But I fail to make it because it is complicated.

my thinking is to split by line break, but can't make it from text to columns by delimiter. the alternative is to replace line break with any signals available in delimiter. the one option is to run Regex ( :Column 1, "^\R", "#",GLOBALREPLACE ), but it shows unexpected result.

hogi · Sep 8, 2024 9:24 AM

I had a similar idea : use substitute and search for 2x newlines.

To get the column names right, one could use the ":" as a separator, but:
- ":" is used also within the entries

- sample has a ":" - but misses a newline afterwards.

So I combened the two approaches:
- search 2x linebreak

- search ":"

to find the headers.

To separate the column name and the entry, one can stack the columns with the setting Contiguous.

Col2 is just used temporarily, so either it can be deleted at the end - or added as a Transform column.

Columns which are generated via Text to Columns from a Transform Column are the first ones in the table.
This makes it easy to select and stack them. Just count the new columns and select them by index.

With the Contiguous setting, some dummy columns are added, those are removed in the last step.

dt = Open( "$DOWNLOADS/example3.jmp" );
tmpCol = dt << Transform Column ( "split",
	Character,
	Formula(
		Substitute(
				Regex(
					Substitute( :Column 1, "\!r\!n", "#", "Sample:", "Sample^\!n" ),
					"#{2,}(.*?):",
					"^\1^",
					GLOBALREPLACE
				),
			"#", "\!r"
		)
	)
);
before = NItems(dt << get column names());
dt << Text to Columns(
	columns (tmpCol),
	Delimiters( "^" )
);
newCols = NItems(dt << get column names()) - before;

result = dt << Stack(
	columns(
		Eval(1::newCols)
	),
	Number of Series( Eval(newCols/2) ),
	Contiguous,
	Drop All Other Columns( 1 )
);


// Move column names up
result << Move up;

cols= result << get column names();
cols = Filter each({col}, cols, Contains(col << get name, "split"))
result << Delete Columns(cols);

Doesn't look nice and smooth ...

Plan B: set up a Turing machine to calculate the entries ; )

View solution in original post

BHarris · Sep 13, 2024 01:11 PM

And if you love regex but still find JSL mostly incomprehensible, you can recode the column, and do a string replace (see red-triangle, "Replace string...") and do a regex replace to an oddball character ("%" is one I often use), then split the column (Cols -> Text to columns) on that character.

Then you can look like a JMP super-hero by binding alt-ctrl-shift-r to "Recode" and alt-ctrl-shift-s to "Text to Columns", and it becomes just a few keystrokes to do all of that.

View solution in original post

ReliabilityWolf · Sep 8, 2024 10:32 AM

jthi · Sep 8, 2024 10:58 AM

You can get this by substituting and then replacing double linebreaks (or JMP's \!N) with something (I used ¤

Names Default To Here(1);

dt = Open("$DOWNLOADS/example3.jmp");

dt << New Column("Col2", Character, Nominal, Formula(
	Substitute(:Column 1, "\!N\!N", "¤")
));

And then one (or few more) extra steps are needed for the column names

-Jarmo

hogi · Sep 8, 2024 9:24 AM

I had a similar idea : use substitute and search for 2x newlines.

To get the column names right, one could use the ":" as a separator, but:
- ":" is used also within the entries

- sample has a ":" - but misses a newline afterwards.

So I combened the two approaches:
- search 2x linebreak

- search ":"

to find the headers.

To separate the column name and the entry, one can stack the columns with the setting Contiguous.

Col2 is just used temporarily, so either it can be deleted at the end - or added as a Transform column.

Columns which are generated via Text to Columns from a Transform Column are the first ones in the table.
This makes it easy to select and stack them. Just count the new columns and select them by index.

With the Contiguous setting, some dummy columns are added, those are removed in the last step.

dt = Open( "$DOWNLOADS/example3.jmp" );
tmpCol = dt << Transform Column ( "split",
	Character,
	Formula(
		Substitute(
				Regex(
					Substitute( :Column 1, "\!r\!n", "#", "Sample:", "Sample^\!n" ),
					"#{2,}(.*?):",
					"^\1^",
					GLOBALREPLACE
				),
			"#", "\!r"
		)
	)
);
before = NItems(dt << get column names());
dt << Text to Columns(
	columns (tmpCol),
	Delimiters( "^" )
);
newCols = NItems(dt << get column names()) - before;

result = dt << Stack(
	columns(
		Eval(1::newCols)
	),
	Number of Series( Eval(newCols/2) ),
	Contiguous,
	Drop All Other Columns( 1 )
);


// Move column names up
result << Move up;

cols= result << get column names();
cols = Filter each({col}, cols, Contains(col << get name, "split"))
result << Delete Columns(cols);

Doesn't look nice and smooth ...

Plan B: set up a Turing machine to calculate the entries ; )

hogi · Sep 8, 2024 12:31 PM

thanks @jthi! for Make New Formula Column return references to the newly created columns :
Since JMP18, New Formula Column returns a list of the newly generated column

Text to Columns could return as well a list with references to the columns which were generated:
here is the Wish: saving list of cols created by text to column

Please note:
the workaround with the associative array and remove operation doesn't work here - it shuffles the columns by sorting then "alphabetically" with 10 <2 - which ruins the stack operation.

BHarris · Sep 13, 2024 01:11 PM

And if you love regex but still find JSL mostly incomprehensible, you can recode the column, and do a string replace (see red-triangle, "Replace string...") and do a regex replace to an oddball character ("%" is one I often use), then split the column (Cols -> Text to columns) on that character.

Then you can look like a JMP super-hero by binding alt-ctrl-shift-r to "Recode" and alt-ctrl-shift-s to "Text to Columns", and it becomes just a few keystrokes to do all of that.

ReliabilityWolf · Sep 13, 2024 11:48 PM

@BHarris thanks for your advice. BTW, how do I do a regex replace linebreak with an oddball character over there?

BHarris · Sep 14, 2024 01:43 PM

A linebreak is "\n" in regex speak. So search for "\n" (no quotes), and click "Use regular expressions", then replace with "%".

ReliabilityWolf · Sep 12, 2024 10:50 AM

this is a wonderful idea to approach.