Solved: Parse string of a column and fill another column

cchueng · Oct 6, 2021 01:58 PM

Hi,

I have a table containing ProductName and I want to generate the OUTPUT column. The OUTPUT column a concatenation of ProductID with Serial#. How do I accurately parse the Serial# and create the OUTPUT column?

Serial # examples:

07-08_13 means Serial# 07, 08, 13
09-12 means Serial# 09, 10, 11, 12
01-03_05-08 means Serial# 01, 02, 03, 05, 06, 07, 08
01_04_05-08 means Serial# 01, 04, 05, 06, 07, 08
09_11_13_14 means Serial# 09,11,13,14

ProductName	OUTPUT
AABBCC_07-08_13	AABBCC-07
AABBCC_07-08_13	AABBCC-08
AABBCC_07-08_13	AABBCC-13
AABBCC_09-12	AABBCC-09
AABBCC_09-12	AABBCC-10
AABBCC_09-12	AABBCC-11
AABBCC_09-12	AABBCC-12
EEFFGG_02	EEFFGG-02
HHIIJJ_01-07	HHIIJJ-01
HHIIJJ_01-07	HHIIJJ-02
HHIIJJ_01-07	HHIIJJ-03
HHIIJJ_01-07	HHIIJJ-04
HHIIJJ_01-07	HHIIJJ-05
HHIIJJ_01-07	HHIIJJ-06
HHIIJJ_01-07	HHIIJJ-07
KKLLLMM_01-03_05-08	KKLLLMM-01
KKLLLMM_01-03_05-08	KKLLLMM-02
KKLLLMM_01-03_05-08	KKLLLMM-03
KKLLLMM_01-03_05-08	KKLLLMM-05
KKLLLMM_01-03_05-08	KKLLLMM-06
KKLLLMM_01-03_05-08	KKLLLMM-07
KKLLLMM_01-03_05-08	KKLLLMM-08
NNOOPP_01_04_05-08	NNOOPP-01
NNOOPP_01_04_05-08	NNOOPP-04
NNOOPP_01_04_05-08	NNOOPP-05
NNOOPP_01_04_05-08	NNOOPP-06
NNOOPP_01_04_05-08	NNOOPP-07
NNOOPP_01_04_05-08	NNOOPP-08
QQRRSS_09_11_13_14	QQRRSS_09
QQRRSS_09_11_13_14	QQRRSS_11
QQRRSS_09_11_13_14	QQRRSS_13
QQRRSS_09_11_13_14	QQRRSS_14

txnelson · Oct 11, 2021 9:19 PM

It just had a minor error where a -1 was not removed when I restructured my code.

Names Default To Here( 1 );
dt = Current Data Table();

dt2 = New Table( "The Output",
	New Column( "Product Name", character ),
	New Column( "ProductID", Character ),
	New Column( "OUTPUT", Character )
);

For( dtRow = 1, dtRow <= N Rows( dt ), dtRow++,
	If( dtRow == 1 | dt:ProductName[dtRow - 1] != dt:ProductName[dtRow],
		prodID = Word( 1, dt:ProductName[dtRow], "_" );
		remainder = Substr( dt:ProductName[dtRow], Length( prodID ) + 2 );
		While( Word( 1, remainder, "_" ) != "",
			theWord = Word( 1, remainder, "_" );
			If( Contains( theWord, "-" ),
				numDigets = Length( Word( 1, theWord, "-" ) ) - 1;
				zeros = Repeat( "0", numDigets );
				For( loop = Num( Word( 1, theWord, "-" ) ),
					loop <= Num( Word( 2, theWord, "-" ) ), loop++,
					dt2 << Add rows( 1 );
					theRow = N Rows( dt2 );
					dt2:ProductName[theRow] = dt:ProductName[dtRow];
					dt2:ProductID[theRow] = Word( 1, dt:ProductName[dtRow], "_-" );
					dt2:output[theRow] = dt2:ProductID[theRow] || "-" ||
					Substr( zeros, Length( Char( loop ) ) ) || Char( loop );
				);
			,
				dt2 << Add Rows( 1 );
				theRow = N Rows( dt2 );
				dt2:output[theRow] = prodID || "-" || Word( 1, theWord, "-" );
				dt2:ProductName[theRow] = dt:ProductName[dtRow];
				dt2:ProductID[theRow] = Word( 1, dt:ProductName[dtRow], "_-" );
			);
			remainder = Substr( remainder, Length( theWord ) + 2 );
		);
	)
);

Please make sure you take the time to study the code, so you completely understand how it is working.

Jim

View solution in original post

jthi · Oct 6, 2021 02:21 PM

Most likely you could write a formula with Lag to do that (or script with JSL), but you can also attempt to do that interactively but it takes quite a few steps:

1. Select ProductName column

2. Go to Cols/Utilities/Text To columns and use both - and _ as delimiters

3. Go to Tables and Stack and add everything else besides first column to Stack Columns selection

5. Remove Label column

6. Select both columns which are left and go to Rows / Row Selection / Select Dublicate Rows and delete those rows

7. Select one cell with empty value and choose Select Matching Cells and delete those rows

8. Now choose both colums and right click on the column header and create formula

9. Open the just created formula and change the , to _

10. Rename and delete columns/formulas

11. If you need to have ProductName there, you should be able to keep it there if you don't accidentally delete it in first step like i did

If you have to have - or _ depending on the serialnumber you could handle it in the concatenate formula most likely

Edit: I checked the question a bit more accurately and this is a bit more complicated than I showed here. You will have to add calculations for some of the serial numbers because they are between numbers. This solution won't work in those because some rows will be removed when duplicates are removed.

-Jarmo

Chris_Kirchberg · Oct 6, 2021 02:29 PM

@jthi ,

You beat me to it. Your explanation is more detailed and what I was thinking of too.

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Chris_Kirchberg · Oct 6, 2021 02:38 PM

Oh, your right. the - is interpreted as a range and the _ is interpreted as a true delimiter of items

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

cchueng · Oct 6, 2021 03:44 PM

Thanks but I prefer a jsl rather than the interactive delimiter. Your method did not address the scenario when the Serial # 09-12. It should output rows like below. Not just AABBCC-09 and AABBCC-12. The "-" means "to" meaning range. So 09-12 means 09, 10, 11, 12.

AABBCC_09-12	AABBCC-09
AABBCC_09-12	AABBCC-10
AABBCC_09-12	AABBCC-11
AABBCC_09-12	AABBCC-12

jthi · Oct 6, 2021 03:47 PM

Ok, this might work. Requires JMP16 due to usage of For Each (can be replaced with For for other versions. This is heavily scripted solution, with little or no comments and will require specific starting table to work (needs OUTPUT column).

Names Default To Here(1);

dt = New Table("Untitled 4993",
	Add Rows(32),
	Compress File When Saved(1),
	New Column("ProductName",
		Character,
		"Nominal",
		Set Values(
			{"AABBCC_07-08_13", "AABBCC_07-08_13", "AABBCC_07-08_13", "AABBCC_09-12", "AABBCC_09-12", "AABBCC_09-12", "AABBCC_09-12", "EEFFGG_02",
			"HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "KKLLLMM_01-03_05-08",
			"KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08",
			"NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08",
			"QQRRSS_09_11_13_14", "QQRRSS_09_11_13_14", "QQRRSS_09_11_13_14", "QQRRSS_09_11_13_14"}
		)
	),
	New Column("OUTPUT",
		Character(16),
		"Nominal",
		Set Values({"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""})
	)
);


Summarize(dt, uniqProd = by(:ProductName));

valList = {};
For Each({prod}, uniqProd,
	prodPart = Word(1, prod, "_");
	serialPart = Substr(prod, Contains(prod, "_") + 1); //remove start
	possibleRanges = Words(serialPart, "_"); //if they contain -
	For Each({val}, possibleRanges,
		If(Contains(val, "-"), //range
			rangeVals = Words(val, "-");
			For Each({val1}, Index(Num(rangeVals[1]), Num(rangeVals[2])), 
				valToAdd = Substr("0", Length(Char(val1))) || Char(val1);
				Insert Into(valList, prodPart || "-" || valToAdd)
			);
		,
			Insert Into(valList, prodPart || "-" || Substr("0", Length(Char(val))) || Char(val))
		)
	);
);

dt[0,"OUTPUT"] = valList;

Wondering if something like this could be done "easily" with JMP's own pattern matching @Craige_Hales

-Jarmo

cchueng · Oct 6, 2021 05:44 PM

Hi jthi,

Thanks for the script but my table is actually dynamic. Each day, I will pull up this table from a database with a different date range and different sets of ProductName will appear. So it is not always 32 rows. Your script can almost work but can you address the array size? Thanks.

jthi · Oct 7, 2021 04:19 AM

It should be fairly easy. Do you have as many rows in the pulled data as you will have serial numbers?

-Jarmo

cchueng · Oct 9, 2021 01:22 AM

Hi jthi,

Yes, the number of rows matched the number of serial numbers.

jthi · Oct 9, 2021 03:03 AM

If that is the case, it should already be able to handle it. I did small modifications (creation of OUTPUT column if it is missing). Currently it does rely on the fact that the datatable is sorted by ProductNames, which could cause issues at some point, or then you can make sure the input table is sorted (easiest solution).

You should be able to use the For Each part can be even used without datatable as long as you have list variable named uniqProd which has ProductNames. After that is just figuring out how to add values to datatable.

Names Default To Here(1);

//Test table
/*
dt = New Table("Untitled 4993",
	Compress File When Saved(1),
	New Column("ProductName",
		Character,
		"Nominal",
		Set Values(
			{"AABBCC_07-08_13", "AABBCC_07-08_13", "AABBCC_07-08_13", "AABBCC_09-12", "AABBCC_09-12", "AABBCC_09-12", "AABBCC_09-12", "EEFFGG_02",
			"HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "HHIIJJ_01-07", "KKLLLMM_01-03_05-08",
			"KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08", "KKLLLMM_01-03_05-08",
			"NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08", "NNOOPP_01_04_05-08",
			"QQRRSS_09_11_13_14", "QQRRSS_09_11_13_14", "QQRRSS_09_11_13_14", "QQRRSS_09_11_13_14"}
		)
	)
);
*/

dt = Current Data Table();

Summarize(dt, uniqProd = by(:ProductName));

valList = {};
For Each({prod}, uniqProd, //loop for each ProductName
	prodPart = Word(1, prod, "_");
	serialPart = Substr(prod, Contains(prod, "_") + 1); //remove start
	possibleRanges = Words(serialPart, "_"); //if they contain -
	For Each({val}, possibleRanges, //Loop for serialnumber inside the ProductName
		If(Contains(val, "-"), //range
			rangeVals = Words(val, "-");
			For Each({val1}, Index(Num(rangeVals[1]), Num(rangeVals[2])), 
				valToAdd = Substr("0", Length(Char(val1))) || Char(val1);
				Insert Into(valList, prodPart || "-" || valToAdd)
			);
		,
			Insert Into(valList, prodPart || "-" || Substr("0", Length(Char(val))) || Char(val))
		)
	);
);

//if starting table doesn't have OUTPUT column create it
If(!Contains(dt << Get Column Names("String"), "OUTPUT"), 
	dt << New Column("OUTPUT", Character, "Nominal");
);

//dt[0,"OUTPUT"] = valList; subscripting, maybe a bit more difficult to understand
dt:OUTPUT << Set Values(valList);

Simplified version what they script tries to do:

1. Script gets all unique values in ProductName column

2. Script loops over each of those in two part For Each loop

3. First loop loops over the unique values and splits them into Product and number parts

4. Serialnumber part is then looped again to split it into single values. These single values are concatenated with Product part and then the string is inserted into valList list (this 4. isn't exactly as simple as I explained here).

5. After both loops have finished you should have a list, which has all possible serialnumbers. Product part will be in alphabetical order.

6. This list is then added to OUTPUT if it exists, if it doesn't it is first created.

-Jarmo

Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column

Re: Parse string of a column and fill another column