cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
twaintwist
Level III

Truncating Character strings

I have multiple data tables that include a column called 'Found in Version".    The Major versions are usually 4 digits ---   the minor versions sometimes append a alpha-character to the end of a Major version number.   For the purposes of Graphing I don't want to show the minor versions --- instead I'd like to merge the Minor versions with their associated Major versions.  For example...   if a Major version is 5.1.2.1 then I want the minor version 5.1.2.1a to also be called 5.1.2.1.....

to throw another wrinkle in the mix ----   I may also want to convert all the 5.1.2.x's  into 5.1.2---- so 5.1.2.0, 5.1.2.2a, 5.1.2.3, 5.1.2.4a, etc... would become Version 5.1.2

The current column is 'Character', 'Nominal'......      I was thinking of 'truncating' the character strings in the column to 7 characters or 5 depending on the level of detail needed/desired --- but I'm not sure how to do that or if there is a better approach.

In addition --- I'm receiving datatables from different teams --- each seems to have a different versioning scheme --- so flexibility will have to be built-in ----- not sure if I have to analyze each data element s, and compare common characters to see if they 'belong' in the same version ???

2349_Character_trunc.jpg

1 ACCEPTED SOLUTION

Accepted Solutions
twaintwist
Level III

Re: Truncating Character strings

The answers here were very helpful and led me to think I may have been making this too complicated ---  so I tried the character function 'Left' but couldn't get it to work last week....  my syntax must have been 'off' --- because it worked for me today..

For each Row (:Found in Vers Truncated = Left(:Found in Version, 7));

----  this simply truncates the character string to the left most characters... which will strip off the lesser version characters ---since the version numbers from various teams is not consistent (some teams use 4 digits with appended alpha characters, some use 3 digits with no appended alpha characters) I think this method may work but I'll have to test which version scheme is being used to determine the level of truncation....

View solution in original post

6 REPLIES 6
hai_kuo
Level I

Re: Truncating Character strings

Hi,

Not sure if this what you want:

data have;

input old :$10.;

cards;

5.1.2

5.1.2.1

5.1.2.1e

;

data want;

length new $5;

set have;

new=old;

run;

proc print;run;

Haikuo

pmroz
Super User

Re: Truncating Character strings

This will do the trick.  Loops over all values and uses regular expressions to test each value.

dt = current data table();

dt << new column("Found in Vers Truncated", character, nominal);

for (i = 1, i <= nrows(dt), i++,

    test_value = column(dt, "Found in Version")[i];

// See if the last character is one of a to z

    if (is missing(regex(test_value, "[a-z]$")),

    // then

            column(dt, "Found in Vers Truncated")[i] = test_value;

        ,

    // else

        one_length = length(test_value);

        column(dt, "Found in Vers Truncated")[i] = substr(test_value, 1, (one_length - 1));

    );

  

);

tomkari
Level I

Re: Truncating Character strings

Apologies for using SAS code, as I'm not a JMP user...perhaps the equivalent functions exist in JMP?

This is very simple and very dumb, so the data must be quite conformed:

data have;

length FoundInVersion $32;

input FoundInVersion $;

cards;

5.1.2.1

5.1.2.2

5.2.2.2a

5.2.2.3

5.2.2.3x

5.2.2.9

5.2.2.9weird

run;

data want;

length FoundInVersionGood $32;

set have;

/* Using period as delimiter, scan for first, second, and third words (SCAN), and then concatenate them with periods between (CATX). */

/* assumes there will always be at least three levels */

FoundInVersionGood = catx(".", scan(FoundInVersion,1,"."), scan(FoundInVersion,2,"."), scan(FoundInVersion,3,"."));

run;

Tom

tomkari
Level I

Re: Truncating Character strings

On the other hand, this is the nuclear weapon of ETL, guaranteed to handle everything, but more work. You may find it's the only option if your different suppliers follow significantly different processes.

Simply create a unique conformance table for each supplier:

data have;

length FoundInVersion $32;

input FoundInVersion $;

cards;

5.1.2.1

5.1.2.2

5.2.2.2a

5.2.2.3

5.2.2.3x

5.2.2.9

5.2.2.9weird

run;

data master_conversion;

length FoundInVersionGood FoundInVersion $32;

input FoundInVersionGood $ FoundInVersion $;

cards;

5.1.2 5.1.2.1

5.1.2 5.1.2.2

5.2.2 5.2.2.2a

5.2.2 5.2.2.3

5.2.2 5.2.2.3x

5.2.2 5.2.2.9

5.2.2 5.2.2.9weird

run;

proc sql;

  create table want as

  select FoundInVersionGood as FoundInVersion

         from master_conversion inner join

              have on (master_conversion.FoundInVersion = have.FoundInVersion);

quit;

Tom

wiebepo
Level III

Re: Truncating Character strings

I think a regular expression (regex()) may address the issue.

Newcolumn("Found in Version Major",character,nominal,

formula(regex(:Found in Version, ".*?(\d+\.\d+\.\d+).*","\1")

);

".*?(\d+\.\d+\.\d+).*" means find just enough of anything, .*?, followed by a number of any length, \d+, decimal one time, \., a number of any length, \d+, decimal one time, \., a number of any length, \d+, followed by anything. Only return the part in parentheses, the numbers and decimals, \1.

For the minor version, change the regular expression (regex()) to have one more decimal and number.

Newcolumn("Found in Version Minor",character,nominal,

formula(regex(:Found in Version, ".*?(\d+\.\d+\.\d+\.\d+).*","\1")

);

twaintwist
Level III

Re: Truncating Character strings

The answers here were very helpful and led me to think I may have been making this too complicated ---  so I tried the character function 'Left' but couldn't get it to work last week....  my syntax must have been 'off' --- because it worked for me today..

For each Row (:Found in Vers Truncated = Left(:Found in Version, 7));

----  this simply truncates the character string to the left most characters... which will strip off the lesser version characters ---since the version numbers from various teams is not consistent (some teams use 4 digits with appended alpha characters, some use 3 digits with no appended alpha characters) I think this method may work but I'll have to test which version scheme is being used to determine the level of truncation....