Solved: Re: How to determine if a string has duplicate characters and extract it out？

lehaofeng · Jun 9, 2023 9:10 AM

I now have a table with over 1000 columns and most of the column names have duplicate words that need to be extracted. Please ask the big guys, how to achieve this with JSL?
For example:
column name is abp-g5.GAIV_0099_P013_GAIV_0099_P013,then extract GAIV_0099_P013;
The column name is abp-g5.GAI1997_GAI1997, then extract GAI1997

txnelson · May 19, 2023 01:14 AM

Here is one way to get rid of duplicates

theList = Words( "abp-g5.GAIV_0099_P013_GAIV_0099_P013", "_." );
For( i = N Items( theList ), i >= 1, i--,
	If( N Rows( Loc( theList, theList[i] ) ) > 1,
		Remove From( theList, i, 1 )
	)
);
newName = theList[1] || "." || theList[2];
For( i = 3, i <= N Items( theList ), i++,
	newName = newName || "_" || theList[i]
);

Jim

View solution in original post

txnelson · May 19, 2023 01:14 AM

Here is one way to get rid of duplicates

theList = Words( "abp-g5.GAIV_0099_P013_GAIV_0099_P013", "_." );
For( i = N Items( theList ), i >= 1, i--,
	If( N Rows( Loc( theList, theList[i] ) ) > 1,
		Remove From( theList, i, 1 )
	)
);
newName = theList[1] || "." || theList[2];
For( i = 3, i <= N Items( theList ), i++,
	newName = newName || "_" || theList[i]
);

Jim

lehaofeng · May 24, 2023 10:25 PM

Thank you, your method is very good！

jthi · May 19, 2023 03:26 AM

Few questions:

What do you consider a "word"? Anything separated by "_" and "."?
What do you do if there are multiple duplicates? Are all extracted/only shortest/only longest/none?
Can the duplication happen at any point in the string? Is duplicated part always at the end?

-Jarmo

lehaofeng · May 24, 2023 10:28 PM

Thank you, indeed this is a practical problem::
Whenever there is a GAI in this character, extract one of the characters starting with it whenever it appears repeated, no matter how many times it appears.
And there will definitely be duplicates.

ngambles · May 28, 2023 05:07 AM

This is an interesting question. I like @txnelson 's solution. In addition to his approach, this is a situation where regular expressions can be handy, so here is another way to do it.

The two examples provided have the following patterns:

a prefix may be present, for example "abp-g5."
the desired string/characters follow the prefix
the repeated/duplicated characters exactly match the desired string/characters
the repeated/duplicated characters are separated from the desired characters by an underscore
there is nothing other than duplicated characters at the end of the string
there is at least one repeated/duplicated string/characters present

Assuming these rules accurately define the column names that need changing, the following code will update all the column names in the "current data table" that match the above assumptions. The column names that do not match the assumptions will not be changed.

names default to here(1);

dt = current data table();

for( k = 1, k <= n cols(dt), k++,
	
	colName = column(dt, k) << get name();
	colName = regex( colName, "^.*?(.+)(?:_\1)+$", "\1" );
	try( column(dt, k) << set name( colName ) );
);