Subscribe Bookmark RSS Feed

Extracting a character string of varying lengths and positions

jswislar

Contributor

Joined:

Nov 17, 2016

I want to extract character strings from a variable (Dx1), but the length and position of the strings varies, but they are separated by dashes. Sometimes there are multiple dashes. For example...

 

"7812-abcdef     -9"

"v7892-xyz    -0"

"812-fghkj    -0"

"17361-gf-jhyt   -9"

 

What I would like to do is create a new variable that extracts the digits/characters before the first dash, and then another variable that extracts the characters between the first and second dashes. So,

Var1      Var2

7812      abcdef

v7892    xyz

812        fghkj

17361    gf-jhyt

 

I tried following the example posted here, but it just extracted the dash. Changing the +1 to another number extacts that number of characters, but my data length varies.

 

https://community.jmp.com/t5/Uncharted/JSL-Character-String-Functions/ba-p/21323

 

Here's what I wrote. What am I doing wrong?

 

 

Trim(
Substr(
:Name( "Dx1" ),
Contains( :Name( "Dx1" ), "-" ),
1
)
)

 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions
txnelson

Super User

Joined:

Jun 22, 2012

Solution

You want to use the Word() function.  If you create a new column, "Var1", and place the below formula into the column, you will get the results you want

word(1,:Dx1,"-")

For Var2 it would be:

word(2,:Dx1,"-")

The Word() function is documented in the Scripting Index

     Help==>Scripting Index==>Word

Jim
txnelson

Super User

Joined:

Jun 22, 2012

Solution

65,000 rows is not a very large table for JMP to handle.  JMP can handle data tables with billions of rows.  What the issue is, is data inconsistancies in your data table.  The code assumes that at least one numeric field will be found after finding the character values for Var2.  Apparently there is a situation where a value of Dx1 does not end with numbers.  The following is a change in the format that adjusts for that issue

Final_Value = Word( 2, :Dx1, "-" );
i = 3;
(While( Is Missing( Num( Word( i, :Dx1, "-" ) ) ) == 1 & Word( i, :Dx1, "-" ) != "",
	Final_Value = Final_Value || "-" || Word( i, :Dx1, "-" );
	i++;
) ; Final_Value);
Jim
14 REPLIES
uday_guntupalli

Community Trekker

Joined:

Sep 15, 2014

@jswislar : 
     One way to tackle this could be to use Contains and Substr functions 

a = {"abc-12","abc - 45","abc-  89"}; 
List1 = list(); 
List2 = list(); 
For(i = 1 , i <= N Items(a), i++,
		Pos = Contains(a[i],"-"); 
		Insert Into(List1,Substr(a[i],1,Pos-1)); 
		Insert Into(List2,Substr(a[i],Pos+1,Length(a[i]))); 
   );




Best
Uday
pmroz

Super User

Joined:

Jun 23, 2011

The words function is your friend here.  Use the dash and space as delimiters and you can extract what you want easily.  Here's an example that uses formulas in a table:

dt = New Table( "Untitled", Add Rows( 4 ), 
	New Column( "dx1", Character, "Nominal",
		Set Values( {"7812-abcdef     -9", "v7892-xyz    -0", "812-fghkj    -0",
			"17361-gf-jhyt   -9"}
		)
	),
// Get the first word for Var1
	New Column( "Var1", Character, "Nominal", Formula( Words( :dx1, "- " )[1] ) ),

// Get the second word for Var1
	New Column( "Var2", Character, "Nominal", Formula( Words( :dx1, "- " )[2] ) )
);
txnelson

Super User

Joined:

Jun 22, 2012

Solution

You want to use the Word() function.  If you create a new column, "Var1", and place the below formula into the column, you will get the results you want

word(1,:Dx1,"-")

For Var2 it would be:

word(2,:Dx1,"-")

The Word() function is documented in the Scripting Index

     Help==>Scripting Index==>Word

Jim
jswislar

Contributor

Joined:

Nov 17, 2016

Thanks! That works well.

jswislar

Contributor

Joined:

Nov 17, 2016

Is there a way to get WORD to ignore some dashes? Some of the second words I'm trying to extract are hyphenated, so I'm only getting a partial extraction. For example, for "938-ex-marine" I get "938" and "ex" rather than "ex-marine"

txnelson

Super User

Joined:

Jun 22, 2012

JSL will allow the building and parsing of any these complex conditions.  It is up to the humans to determine what the rules are for such parsing.

 

In your example, what is the rule to determine what hyphenated values should be read as a combined value, and when to treat them as separate values?

Jim
jswislar

Contributor

Joined:

Nov 17, 2016

The only rule that I can think of that would be consistent would be that the second word appears beween numbers (#). Ultimately I want the phrase between the hyphens. So, "#-word     -#" and "#-word-word     -#" and "#-word-word word     -#"

txnelson

Super User

Joined:

Jun 22, 2012

Here is a formula that will do what you stated as the rule....

Final_Value = Word( 2, :Dx1, "-" );
i = 3;
(While( Is Missing( Num( Word( i, :Dx1, "-" ) ) ) == 1,
	Final_Value = Final_Value || "-" || Word( i, :Dx1, "-" );
	i++;
) ; Final_Value);

However, in your sample above, the sample "17361-gf-jhyt   -9" ends up with a Var2 value of "gf-jhyt".  Is that what you want?

 

Regardless, I hope you can  see that you can construct as complex of a formula as you may need to parse the string into whatevery the correct values are.  The only limitation is that you need to be able to determine the rules that need to be used.

Jim
jswislar

Contributor

Joined:

Nov 17, 2016

Thanks, Jim. That is what I was looking for.
How do I run this script? I tried it as a column formula. That didn't work, but I didn't expect it to. I also opened a script window and ran it there, but it just stopped responding.