Subscribe Bookmark RSS Feed

Truncating Character strings

twaintwist

Community Trekker

Joined:

Jun 14, 2012

I have multiple data tables that include a column called 'Found in Version".    The Major versions are usually 4 digits ---   the minor versions sometimes append a alpha-character to the end of a Major version number.   For the purposes of Graphing I don't want to show the minor versions --- instead I'd like to merge the Minor versions with their associated Major versions.  For example...   if a Major version is 5.1.2.1 then I want the minor version 5.1.2.1a to also be called 5.1.2.1.....

to throw another wrinkle in the mix ----   I may also want to convert all the 5.1.2.x's  into 5.1.2---- so 5.1.2.0, 5.1.2.2a, 5.1.2.3, 5.1.2.4a, etc... would become Version 5.1.2

The current column is 'Character', 'Nominal'......      I was thinking of 'truncating' the character strings in the column to 7 characters or 5 depending on the level of detail needed/desired --- but I'm not sure how to do that or if there is a better approach.

In addition --- I'm receiving datatables from different teams --- each seems to have a different versioning scheme --- so flexibility will have to be built-in ----- not sure if I have to analyze each data element s, and compare common characters to see if they 'belong' in the same version ???

2349_Character_trunc.jpg

1 ACCEPTED SOLUTION

Accepted Solutions
Solution

The answers here were very helpful and led me to think I may have been making this too complicated ---  so I tried the character function 'Left' but couldn't get it to work last week....  my syntax must have been 'off' --- because it worked for me today..

For each Row (:Found in Vers Truncated = Left(:Found in Version, 7));

----  this simply truncates the character string to the left most characters... which will strip off the lesser version characters ---since the version numbers from various teams is not consistent (some teams use 4 digits with appended alpha characters, some use 3 digits with no appended alpha characters) I think this method may work but I'll have to test which version scheme is being used to determine the level of truncation....

6 REPLIES
hai_kuo

Community Trekker

Joined:

Jan 25, 2012

Hi,

Not sure if this what you want:

data have;

input old :$10.;

cards;

5.1.2

5.1.2.1

5.1.2.1e

;

data want;

length new $5;

set have;

new=old;

run;

proc print;run;

Haikuo

pmroz

Super User

Joined:

Jun 23, 2011

This will do the trick.  Loops over all values and uses regular expressions to test each value.

dt = current data table();

dt << new column("Found in Vers Truncated", character, nominal);

for (i = 1, i <= nrows(dt), i++,

    test_value = column(dt, "Found in Version")[i];

// See if the last character is one of a to z

    if (is missing(regex(test_value, "[a-z]$")),

    // then

            column(dt, "Found in Vers Truncated")[i] = test_value;

        ,

    // else

        one_length = length(test_value);

        column(dt, "Found in Vers Truncated")[i] = substr(test_value, 1, (one_length - 1));

    );

  

);

tomkari

Community Trekker

Joined:

Jun 23, 2011

Apologies for using SAS code, as I'm not a JMP user...perhaps the equivalent functions exist in JMP?

This is very simple and very dumb, so the data must be quite conformed:

data have;

length FoundInVersion $32;

input FoundInVersion $;

cards;

5.1.2.1

5.1.2.2

5.2.2.2a

5.2.2.3

5.2.2.3x

5.2.2.9

5.2.2.9weird

run;

data want;

length FoundInVersionGood $32;

set have;

/* Using period as delimiter, scan for first, second, and third words (SCAN), and then concatenate them with periods between (CATX). */

/* assumes there will always be at least three levels */

FoundInVersionGood = catx(".", scan(FoundInVersion,1,"."), scan(FoundInVersion,2,"."), scan(FoundInVersion,3,"."));

run;

Tom

tomkari

Community Trekker

Joined:

Jun 23, 2011

On the other hand, this is the nuclear weapon of ETL, guaranteed to handle everything, but more work. You may find it's the only option if your different suppliers follow significantly different processes.

Simply create a unique conformance table for each supplier:

data have;

length FoundInVersion $32;

input FoundInVersion $;

cards;

5.1.2.1

5.1.2.2

5.2.2.2a

5.2.2.3

5.2.2.3x

5.2.2.9

5.2.2.9weird

run;

data master_conversion;

length FoundInVersionGood FoundInVersion $32;

input FoundInVersionGood $ FoundInVersion $;

cards;

5.1.2 5.1.2.1

5.1.2 5.1.2.2

5.2.2 5.2.2.2a

5.2.2 5.2.2.3

5.2.2 5.2.2.3x

5.2.2 5.2.2.9

5.2.2 5.2.2.9weird

run;

proc sql;

  create table want as

  select FoundInVersionGood as FoundInVersion

         from master_conversion inner join

              have on (master_conversion.FoundInVersion = have.FoundInVersion);

quit;

Tom

wiebepo

Community Trekker

Joined:

Oct 10, 2011

I think a regular expression (regex()) may address the issue.

Newcolumn("Found in Version Major",character,nominal,

formula(regex(:Found in Version, ".*?(\d+\.\d+\.\d+).*","\1")

);

".*?(\d+\.\d+\.\d+).*" means find just enough of anything, .*?, followed by a number of any length, \d+, decimal one time, \., a number of any length, \d+, decimal one time, \., a number of any length, \d+, followed by anything. Only return the part in parentheses, the numbers and decimals, \1.

For the minor version, change the regular expression (regex()) to have one more decimal and number.

Newcolumn("Found in Version Minor",character,nominal,

formula(regex(:Found in Version, ".*?(\d+\.\d+\.\d+\.\d+).*","\1")

);

Solution

The answers here were very helpful and led me to think I may have been making this too complicated ---  so I tried the character function 'Left' but couldn't get it to work last week....  my syntax must have been 'off' --- because it worked for me today..

For each Row (:Found in Vers Truncated = Left(:Found in Version, 7));

----  this simply truncates the character string to the left most characters... which will strip off the lesser version characters ---since the version numbers from various teams is not consistent (some teams use 4 digits with appended alpha characters, some use 3 digits with no appended alpha characters) I think this method may work but I'll have to test which version scheme is being used to determine the level of truncation....