cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
lehaofeng
Level V

How to determine if a string has duplicate characters and extract it out?

I now have a table with over 1000 columns and most of the column names have duplicate words that need to be extracted. Please ask the big guys, how to achieve this with JSL?
For example:
column name is abp-g5.GAIV_0099_P013_GAIV_0099_P013,then extract GAIV_0099_P013;
The column name is abp-g5.GAI1997_GAI1997, then extract GAI1997

1 ACCEPTED SOLUTION

Accepted Solutions
txnelson
Super User

Re: How to determine if a string has duplicate characters and extract it out?

Here is one way to get rid of duplicates

theList = Words( "abp-g5.GAIV_0099_P013_GAIV_0099_P013", "_." );
For( i = N Items( theList ), i >= 1, i--,
	If( N Rows( Loc( theList, theList[i] ) ) > 1,
		Remove From( theList, i, 1 )
	)
);
newName = theList[1] || "." || theList[2];
For( i = 3, i <= N Items( theList ), i++,
	newName = newName || "_" || theList[i]
);
Jim

View solution in original post

5 REPLIES 5
txnelson
Super User

Re: How to determine if a string has duplicate characters and extract it out?

Here is one way to get rid of duplicates

theList = Words( "abp-g5.GAIV_0099_P013_GAIV_0099_P013", "_." );
For( i = N Items( theList ), i >= 1, i--,
	If( N Rows( Loc( theList, theList[i] ) ) > 1,
		Remove From( theList, i, 1 )
	)
);
newName = theList[1] || "." || theList[2];
For( i = 3, i <= N Items( theList ), i++,
	newName = newName || "_" || theList[i]
);
Jim
lehaofeng
Level V

Re: How to determine if a string has duplicate characters and extract it out?

Thank you, your method is very good!

jthi
Super User

Re: How to determine if a string has duplicate characters and extract it out?

Few questions:

  • What do you consider a "word"? Anything separated by "_" and "."?
  • What do you do if there are multiple duplicates? Are all extracted/only shortest/only longest/none?
  • Can the duplication happen at any point in the string? Is duplicated part always at the end?
-Jarmo
lehaofeng
Level V

Re: How to determine if a string has duplicate characters and extract it out?

Thank you, indeed this is a practical problem::
Whenever there is a GAI in this character, extract one of the characters starting with it whenever it appears repeated, no matter how many times it appears.
And there will definitely be duplicates.

ngambles
Level III

Re: How to determine if a string has duplicate characters and extract it out?

This is an interesting question. I like @txnelson 's solution. In addition to his approach, this is a situation where regular expressions can be handy, so here is another way to do it.

 

The two examples provided have the following patterns:

  • a  prefix may be present, for example "abp-g5." 
  • the desired string/characters follow the prefix
  • the repeated/duplicated characters exactly match the desired string/characters
  • the repeated/duplicated characters are separated from the desired characters by an underscore
  • there is nothing other than duplicated characters at the end of the string
  • there is at least one repeated/duplicated string/characters present

Assuming these rules accurately define the column names that need changing, the following code will update all the column names in the "current data table" that match the above assumptions. The column names that do not match the assumptions will not be changed.

names default to here(1);

dt = current data table();

for( k = 1, k <= n cols(dt), k++,
	
	colName = column(dt, k) << get name();
	colName = regex( colName, "^.*?(.+)(?:_\1)+$", "\1" );
	try( column(dt, k) << set name( colName ) );
);