BookmarkSubscribeRSS Feed
Choose Language Hide Translation Bar
vishalparekh

New Contributor

Joined:

Oct 25, 2018

Comparing datasets from two different files

hi,  How can we compare List 1 with List 2 to compute the string distance using Jaro Winkler?

 

Output file should have Names from List 1 and List 2 and a third column that had the JW score. 

 

https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

 

List1

 

JOHN SMITH
TIGER WOODS
TONY MOLLY
JAMES A
MICKEY MOUSE
DONALD DUCK
DUNKIN DONUTS
TOKYO

 

 

List2:
JOHNTHAN SMITH
TIGER WOO
JOHN MOLLY
JAMES XXXX LLC
JAMES
TOKIO

6 REPLIES 6
uday_guntupalli

Community Trekker

Joined:

Sep 15, 2014

Re: Comparing datasets from two different files

@vishalparekh
      I don't think there is an in-built function in JMP to compute the distance you are after. You will have to define custom user defined functions to achieve what you are after. If I get a chance, later I will give it a shot. you might want to read up on function() and contains() in the Scripting Index to get some ideas. 

Best
Uday
txnelson

Super User

Joined:

Jun 22, 2012

Re: Comparing datasets from two different files

The reading of 2 different lists from two different files is trivial, and then the combining is also not much of an issue.  The real work would be the development of a script that calculates the Jaro-Winkler distances.  There is not a builtin function to calculate the Jaro-Winkler Distance.  While JMP could handle the task, it would take the effort to develop it.

Jim
vishalparekh

New Contributor

Joined:

Oct 25, 2018

Re: Comparing datasets from two different files

Do we have any other built-in function in JMP that is comparable to Jaro-Winkler?

txnelson

Super User

Joined:

Jun 22, 2012

Re: Comparing datasets from two different files

I don't see any similar statistics, so I found some code in another language and ported it over to JMP.  Below is a sample script that does what I think you want.  Copy it into a script window and run it.  It generates the to lists of words, into separate data tables.....I believe that is what you want.  And then it combines them into a single data table, and creates a new column with the Jaro-Winkler statistic.

names default to here(1);
dt1 = New Table( "Table 1",
	Add Rows( 8 ),
	New Column( "List 1",
		Character,
		"Nominal",
		Set Values(
			{"JOHN SMITH", "TIGER WOODS", "TONY MOLLY", "JAMES A", "MICKEY MOUSE",
			"DONALD DUCK", "DUNKIN DONUTS", "TOKYO"}
		)
	)
);

dt2 = New Table( "Table 2",
	Add Rows( 6 ),
	New Column( "List 2",
		Character,
		"Nominal",
		Set Values(
			{"JOHNTHAN SMITH", "TIGER WOO", "JOHN MOLLY", "JAMES XXXX LLC", "JAMES",
			"TOKIO"}
		)
	)
);

// Join the tables
dtJoined = dt1 << Join(
	With( dt2 ),
	By Row Number,
	Output Table( "Joined Table" )
);

// Create New column
dtJoined << New Column("Jaro-Winkler");

// Add the functions that create the Jaro-Winkler
jaromatch = Function( {string1, string2}, {i},

	str1_len = Length( string1 );
	str2_len = Length( string2 );

	allowedDist = Floor( Max( str1_len, str2_len ) / 2 ) - 1;

    //*** walk through string1 and match characters to string2 ***;//
	matchChars = "";
	For( i = 1, i <= str1_len, i++, 
	
         //*** get the part of string2 to search ***;
		allowed_start = Max( 1, i - allowedDist );
        //*** starting char position ***;
		allowed_str2 = Substr( string2, allowed_start, i + allowedDist - allowed_start + 1 );

        //*** find i char from string1 in string2 within the allowedDist ***;
		position = Contains( allowed_str2, Substr( string1, i, 1 ) );

		If( position > 0,
			matchChars = Concat( matchChars, Substr( allowed_str2, position, 1 ) );
            //*** Once a char is assigned, it can not be assigned again. So, chg char in string2. ***;
			string2 = Substr( string2, 1, allowed_start + position - 2 ) || "~" ||
			Substr( string2, allowed_start + position );
		);
	);

	Trim( matchChars );
);  

 /* count the number of "half" transpositions */
jarotrans = Function( {string1, string2},{i},
	ntrans = 0;
	For( i = 1, i <= Min( Length( Trim( string1 ) ), Length( Trim( string2 ) ) ), i++,
		If( Substr( string1, i, 1 ) != Substr( string2, i, 1 ),
			ntrans
			++)
	);

	ntrans / 2;
);

 /* get the length of the matching characters at the beginning */
getPrefixlen = Function( {string1, string2, maxprelen},{i},
	n = Min( maxprelen, Length( string1 ), Length( string2 ) );

	If( n == 0,
		n = 0,
		For( i = 1, i <= n, i++,
			If( Substr( string1, i, 1 ) != Substr( string2, i, 1 ),
				n = (i - 1)
			);  //*** all maxprelen characters match ***;
		)
	);
	n;
);

 /* calc the jaro distance */
jarodist = Function( {string1, string2}, {i},
    //*** get number of matched characters in string1 ***;
	m1 = jaromatch( string1, string2 );
	m1_len = Length( m1 );
	If( m1_len == 0, j_dist = 0 );

    //*** get number of matched characters in string2 ***;
	m2 = jaromatch( string2, string1 );
	m2_len = Length( m2 );
	If( m2_len == 0, j_dist = 0 );

    //*** get number of transposed characters ***;
	ntrans = jarotrans( m1, m2 );

    //*** calc jaro distance ***;
	j_dist = (m1_len / Length( string1 ) + m2_len / Length( string2 ) + (m1_len - ntrans) / m1_len) / 3;

	j_dist;
);

 /* calc the jaro-winkler distance */
jarowink = Function( {string1, string2, prefixscale},{i},
	string1 = Uppercase( Trim( string1 ) );
	string2 = Uppercase( Trim( string2 ) );

    //*** check for trivial case and calc JW if needed ***;
	If( string1 == string2,
		returned = 1.0,
		jarodist1 = jarodist( string1, string2 );
		prelen = getPrefixlen( string1, string2, 4 );
		returned = jarodist1 + prelen * prefixscale * (1 - jarodist1);
	);

	returned;
);

// Loop through the data table and create the values for the new column
For(i=1,i<=n rows(dtJoined), i++,
	dtJoined:Name("Jaro-Winkler")[i] = jarowink(dtJoined:List 1[i], dtJoined:List2[i], 0.1);
)
Jim
gzmorgan0

Super User

Joined:

Jul 25, 2016

Re: Comparing datasets from two different files

JMP provides the function Shortest Edit Script() and the example in the Scripting Index provides a script to find the longest common sequence.

One of the scripts written for the book JSL Companion, 2nd Edition defines the Levenshtein distance and provides examples; it is attached below.

 

The web would not attach the jsl file, so I changed the extension to .txt

 

 

vishalparekh

New Contributor

Joined:

Oct 25, 2018

Re: Comparing datasets from two different files

thank you. I will try out the options and circle back.