Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Choose Language Hide Translation Bar
Highlighted
Level I

## Comparing datasets from two different files

hi,  How can we compare List 1 with List 2 to compute the string distance using Jaro Winkler?

Output file should have Names from List 1 and List 2 and a third column that had the JW score.

https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

List1

JOHN SMITH
TIGER WOODS
TONY MOLLY
JAMES A
MICKEY MOUSE
DONALD DUCK
DUNKIN DONUTS
TOKYO

List2:
JOHNTHAN SMITH
TIGER WOO
JOHN MOLLY
JAMES XXXX LLC
JAMES
TOKIO

6 REPLIES 6
Highlighted
Level VIII

## Re: Comparing datasets from two different files

@vishalparekh
I don't think there is an in-built function in JMP to compute the distance you are after. You will have to define custom user defined functions to achieve what you are after. If I get a chance, later I will give it a shot. you might want to read up on function() and contains() in the Scripting Index to get some ideas.

Best
Uday
Highlighted
Super User

## Re: Comparing datasets from two different files

The reading of 2 different lists from two different files is trivial, and then the combining is also not much of an issue.  The real work would be the development of a script that calculates the Jaro-Winkler distances.  There is not a builtin function to calculate the Jaro-Winkler Distance.  While JMP could handle the task, it would take the effort to develop it.

Jim
Highlighted
Level I

## Re: Comparing datasets from two different files

Do we have any other built-in function in JMP that is comparable to Jaro-Winkler?

Highlighted
Super User

## Re: Comparing datasets from two different files

I don't see any similar statistics, so I found some code in another language and ported it over to JMP.  Below is a sample script that does what I think you want.  Copy it into a script window and run it.  It generates the to lists of words, into separate data tables.....I believe that is what you want.  And then it combines them into a single data table, and creates a new column with the Jaro-Winkler statistic.

``````names default to here(1);
dt1 = New Table( "Table 1",
New Column( "List 1",
Character,
"Nominal",
Set Values(
{"JOHN SMITH", "TIGER WOODS", "TONY MOLLY", "JAMES A", "MICKEY MOUSE",
"DONALD DUCK", "DUNKIN DONUTS", "TOKYO"}
)
)
);

dt2 = New Table( "Table 2",
New Column( "List 2",
Character,
"Nominal",
Set Values(
{"JOHNTHAN SMITH", "TIGER WOO", "JOHN MOLLY", "JAMES XXXX LLC", "JAMES",
"TOKIO"}
)
)
);

// Join the tables
dtJoined = dt1 << Join(
With( dt2 ),
By Row Number,
Output Table( "Joined Table" )
);

// Create New column
dtJoined << New Column("Jaro-Winkler");

// Add the functions that create the Jaro-Winkler
jaromatch = Function( {string1, string2}, {i},

str1_len = Length( string1 );
str2_len = Length( string2 );

allowedDist = Floor( Max( str1_len, str2_len ) / 2 ) - 1;

//*** walk through string1 and match characters to string2 ***;//
matchChars = "";
For( i = 1, i <= str1_len, i++,

//*** get the part of string2 to search ***;
allowed_start = Max( 1, i - allowedDist );
//*** starting char position ***;
allowed_str2 = Substr( string2, allowed_start, i + allowedDist - allowed_start + 1 );

//*** find i char from string1 in string2 within the allowedDist ***;
position = Contains( allowed_str2, Substr( string1, i, 1 ) );

If( position > 0,
matchChars = Concat( matchChars, Substr( allowed_str2, position, 1 ) );
//*** Once a char is assigned, it can not be assigned again. So, chg char in string2. ***;
string2 = Substr( string2, 1, allowed_start + position - 2 ) || "~" ||
Substr( string2, allowed_start + position );
);
);

Trim( matchChars );
);

/* count the number of "half" transpositions */
jarotrans = Function( {string1, string2},{i},
ntrans = 0;
For( i = 1, i <= Min( Length( Trim( string1 ) ), Length( Trim( string2 ) ) ), i++,
If( Substr( string1, i, 1 ) != Substr( string2, i, 1 ),
ntrans
++)
);

ntrans / 2;
);

/* get the length of the matching characters at the beginning */
getPrefixlen = Function( {string1, string2, maxprelen},{i},
n = Min( maxprelen, Length( string1 ), Length( string2 ) );

If( n == 0,
n = 0,
For( i = 1, i <= n, i++,
If( Substr( string1, i, 1 ) != Substr( string2, i, 1 ),
n = (i - 1)
);  //*** all maxprelen characters match ***;
)
);
n;
);

/* calc the jaro distance */
jarodist = Function( {string1, string2}, {i},
//*** get number of matched characters in string1 ***;
m1 = jaromatch( string1, string2 );
m1_len = Length( m1 );
If( m1_len == 0, j_dist = 0 );

//*** get number of matched characters in string2 ***;
m2 = jaromatch( string2, string1 );
m2_len = Length( m2 );
If( m2_len == 0, j_dist = 0 );

//*** get number of transposed characters ***;
ntrans = jarotrans( m1, m2 );

//*** calc jaro distance ***;
j_dist = (m1_len / Length( string1 ) + m2_len / Length( string2 ) + (m1_len - ntrans) / m1_len) / 3;

j_dist;
);

/* calc the jaro-winkler distance */
jarowink = Function( {string1, string2, prefixscale},{i},
string1 = Uppercase( Trim( string1 ) );
string2 = Uppercase( Trim( string2 ) );

//*** check for trivial case and calc JW if needed ***;
If( string1 == string2,
returned = 1.0,
jarodist1 = jarodist( string1, string2 );
prelen = getPrefixlen( string1, string2, 4 );
returned = jarodist1 + prelen * prefixscale * (1 - jarodist1);
);

returned;
);

// Loop through the data table and create the values for the new column
For(i=1,i<=n rows(dtJoined), i++,
dtJoined:Name("Jaro-Winkler")[i] = jarowink(dtJoined:List 1[i], dtJoined:List2[i], 0.1);
)``````
Jim
Highlighted
Super User

## Re: Comparing datasets from two different files

JMP provides the function Shortest Edit Script() and the example in the Scripting Index provides a script to find the longest common sequence.

One of the scripts written for the book JSL Companion, 2nd Edition defines the Levenshtein distance and provides examples; it is attached below.

The web would not attach the jsl file, so I changed the extension to .txt

Highlighted
Level I

## Re: Comparing datasets from two different files

thank you. I will try out the options and circle back.

Article Labels

There are no labels assigned to this post.