Choose Language Hide Translation Bar
Highlighted
twaintwist
Level III

Truncating Character strings

I have multiple data tables that include a column called 'Found in Version".    The Major versions are usually 4 digits ---   the minor versions sometimes append a alpha-character to the end of a Major version number.   For the purposes of Graphing I don't want to show the minor versions --- instead I'd like to merge the Minor versions with their associated Major versions.  For example...   if a Major version is 5.1.2.1 then I want the minor version 5.1.2.1a to also be called 5.1.2.1.....

to throw another wrinkle in the mix ----   I may also want to convert all the 5.1.2.x's  into 5.1.2---- so 5.1.2.0, 5.1.2.2a, 5.1.2.3, 5.1.2.4a, etc... would become Version 5.1.2

The current column is 'Character', 'Nominal'......      I was thinking of 'truncating' the character strings in the column to 7 characters or 5 depending on the level of detail needed/desired --- but I'm not sure how to do that or if there is a better approach.

In addition --- I'm receiving datatables from different teams --- each seems to have a different versioning scheme --- so flexibility will have to be built-in ----- not sure if I have to analyze each data element s, and compare common characters to see if they 'belong' in the same version ???

2349_Character_trunc.jpg

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
twaintwist
Level III

Re: Truncating Character strings

The answers here were very helpful and led me to think I may have been making this too complicated ---  so I tried the character function 'Left' but couldn't get it to work last week....  my syntax must have been 'off' --- because it worked for me today..

For each Row (:Found in Vers Truncated = Left(:Found in Version, 7));

----  this simply truncates the character string to the left most characters... which will strip off the lesser version characters ---since the version numbers from various teams is not consistent (some teams use 4 digits with appended alpha characters, some use 3 digits with no appended alpha characters) I think this method may work but I'll have to test which version scheme is being used to determine the level of truncation....

View solution in original post

6 REPLIES 6
Highlighted
hai_kuo
Level I

Re: Truncating Character strings

Hi,

Not sure if this what you want:

data have;

input old :$10.;

cards;

5.1.2

5.1.2.1

5.1.2.1e

;

data want;

length new $5;

set have;

new=old;

run;

proc print;run;

Haikuo

Highlighted
pmroz
Super User

Re: Truncating Character strings

This will do the trick.  Loops over all values and uses regular expressions to test each value.

dt = current data table();

dt << new column("Found in Vers Truncated", character, nominal);

for (i = 1, i <= nrows(dt), i++,

    test_value = column(dt, "Found in Version")[i];

// See if the last character is one of a to z

    if (is missing(regex(test_value, "[a-z]$")),

    // then

            column(dt, "Found in Vers Truncated")[i] = test_value;

        ,

    // else

        one_length = length(test_value);

        column(dt, "Found in Vers Truncated")[i] = substr(test_value, 1, (one_length - 1));

    );

  

);

Highlighted
tomkari
Level I

Re: Truncating Character strings

Apologies for using SAS code, as I'm not a JMP user...perhaps the equivalent functions exist in JMP?

This is very simple and very dumb, so the data must be quite conformed:

data have;

length FoundInVersion $32;

input FoundInVersion $;

cards;

5.1.2.1

5.1.2.2

5.2.2.2a

5.2.2.3

5.2.2.3x

5.2.2.9

5.2.2.9weird

run;

data want;

length FoundInVersionGood $32;

set have;

/* Using period as delimiter, scan for first, second, and third words (SCAN), and then concatenate them with periods between (CATX). */

/* assumes there will always be at least three levels */

FoundInVersionGood = catx(".", scan(FoundInVersion,1,"."), scan(FoundInVersion,2,"."), scan(FoundInVersion,3,"."));

run;

Tom

Highlighted
tomkari
Level I

Re: Truncating Character strings

On the other hand, this is the nuclear weapon of ETL, guaranteed to handle everything, but more work. You may find it's the only option if your different suppliers follow significantly different processes.

Simply create a unique conformance table for each supplier:

data have;

length FoundInVersion $32;

input FoundInVersion $;

cards;

5.1.2.1

5.1.2.2

5.2.2.2a

5.2.2.3

5.2.2.3x

5.2.2.9

5.2.2.9weird

run;

data master_conversion;

length FoundInVersionGood FoundInVersion $32;

input FoundInVersionGood $ FoundInVersion $;

cards;

5.1.2 5.1.2.1

5.1.2 5.1.2.2

5.2.2 5.2.2.2a

5.2.2 5.2.2.3

5.2.2 5.2.2.3x

5.2.2 5.2.2.9

5.2.2 5.2.2.9weird

run;

proc sql;

  create table want as

  select FoundInVersionGood as FoundInVersion

         from master_conversion inner join

              have on (master_conversion.FoundInVersion = have.FoundInVersion);

quit;

Tom

Highlighted
wiebepo
Level III

Re: Truncating Character strings

I think a regular expression (regex()) may address the issue.

Newcolumn("Found in Version Major",character,nominal,

formula(regex(:Found in Version, ".*?(\d+\.\d+\.\d+).*","\1")

);

".*?(\d+\.\d+\.\d+).*" means find just enough of anything, .*?, followed by a number of any length, \d+, decimal one time, \., a number of any length, \d+, decimal one time, \., a number of any length, \d+, followed by anything. Only return the part in parentheses, the numbers and decimals, \1.

For the minor version, change the regular expression (regex()) to have one more decimal and number.

Newcolumn("Found in Version Minor",character,nominal,

formula(regex(:Found in Version, ".*?(\d+\.\d+\.\d+\.\d+).*","\1")

);

Highlighted
twaintwist
Level III

Re: Truncating Character strings

The answers here were very helpful and led me to think I may have been making this too complicated ---  so I tried the character function 'Left' but couldn't get it to work last week....  my syntax must have been 'off' --- because it worked for me today..

For each Row (:Found in Vers Truncated = Left(:Found in Version, 7));

----  this simply truncates the character string to the left most characters... which will strip off the lesser version characters ---since the version numbers from various teams is not consistent (some teams use 4 digits with appended alpha characters, some use 3 digits with no appended alpha characters) I think this method may work but I'll have to test which version scheme is being used to determine the level of truncation....

View solution in original post

Article Labels

    There are no labels assigned to this post.