topic Re: Regex for filenames with poor data standards in Discussions

Regex for filenames with poor data standards

Alicia — Thu, 08 Jun 2023 16:30:53 GMT

Hello there,

I just saw a great talk at the 2023 Discovery Summit in Sitges on Regex and I wondered if this function could be used to extract key variables from my filenames which I get into JMP using the PDF Import Wizard. The problem is that the filenames are created by multiple operators and there is no common standard. I usually just use the re-code function and some general cleaning to get out what I need but it's very time-consuming, especially when I'm importing 100+ PDFs into JMP per week.

Here is an example script to show you what my data looks like. The key variables I want to extract are "Product Type", "Batch Number" and "Box Sampled". It was mentioned in the talk that another JMP User had a similar problem and a script was created to pull out the data they need. I'm not an expert in JSL so would love to get some help with this problem but appreciate it may be too complex to automate as there are so many inconsistences in the filenames. Any help/advice would be amazing! Best regards, Alicia

New Table( "REGEX Function Filename Example",
	Add Rows( 8 ),
	New Column( "Source Table",
		Character,
		"Nominal",
		Set Selected,
		Set Values(
			{
			"TYPE-A Z012345 BOX 10 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"TYPE-A Z012345 BOX 129 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"TYPE-B Z012346 BOX12 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"Z012347 TYPEB BOX16 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"TYPEC Z12347 BOX 103 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"BOX 14 TYPE-C Z012350 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"TYPE-D     Z012450 BOX 142 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)",
			"TYPED Z12451 BOX7 ImpactTemplate_01.pdf - 2023-02-14 14-09-03 Page(2)  Table(1)"
			}
		)
	),
	New Column( "Product Type",
		Character,
		"Nominal",
		Set Selected,
		Set Values(
			{"TYPE-A", "TYPE-A", "TYPE-B", "TYPE-B", "TYPE-C", "TYPE-C", "TYPE-D",
			"TYPE-D"}
		)
	),
	New Column( "Batch Number",
		Character,
		"Nominal",
		Set Selected,
		Set Values(
			{"Z012345", "Z012345", "Z012346", "Z012347", "Z012347", "Z012350",
			"Z012450", "Z012451"}
		)
	),
	New Column( "Box Sampled",
		Character,
		"Nominal",
		Set Selected,
		Set Values(
			{"BOX 10", "BOX 129", "BOX 12", "BOX 16", "BOX 103", "BOX 14", "BOX 142",
			"BOX 7"}
		)
	)
)

Re: Regex for filenames with poor data standards

vince_faller — Tue, 07 Mar 2023 17:27:28 GMT

I think these hit you.

dt << New Column("MyProdType", character, formula(Regex( :Source Table, "TYPE-?(\w)", "TYPE-\1" )));
dt << New Column("MyBatchNum", character, formula(
	v=Regex( :Source Table, "Z0{0,6}(\d{0,6})", "\1"); // getting just the non zero values
	"Z"||repeat("0", 6-length(v))||v
));
dt << New Column("My Box", character, formula(Regex( :Source Table, "BOX ?(\d+)", "BOX \1" )));

Re: Regex for filenames with poor data standards

brady_brady — Tue, 07 Mar 2023 21:26:50 GMT

As Vince's solution shows, Regex makes quick work of this problem, and is definitely the preferred method for handling a problem like this.

Regex is one of those things you can't learn overnight though... you learn it (and, if you're like me, RElearn it) in bits and pieces as you use it.

While you're picking up Regex, be sure to learn how to manipulate strings with the substitute ( ) and items ( ) commands, and learn how to manipulate lists with the sort list ( ) and concat items ( ) commands. These commands are not as powerful as regex, nor are they as flexible. But, it is surprising what you can do with them, and their simplicity makes them easy to understand and remember. Think of them as hitting the "80/20" of functionality/ease of use, along with functions like Eval Insert ( ). They are well worth learning.

As an illustration of this, run the code below against the data table you've provided, and examine the 4 columns it creates.

dt = current data table ( );

dt  << new column ( "lis", character, 
	formula( 
		concatItems(
			sortList (
				items(	[1  3], 
					substitute(:source table,
						" ", "",
						"-", "",
						"TYPE", 	",TYPE-",
						"BOX", 		",BOX ",
						"Z",		",Z",
						"Impact", 	","
					), 
					","
				)
			), 
		","	
		)
	)
);

dt << Text to Columns( columns( :lis ), Delimiters( "," ) );

Re: Regex for filenames with poor data standards

Alicia — Wed, 08 Mar 2023 15:34:56 GMT

Huge thanks to @brady_brady and @vince_faller. These solutions work perfectly and will save me so much time!! Thank you both