cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Register for our Discovery Summit 2024 conference, Oct. 21-24, where you’ll learn, connect, and be inspired.
Choose Language Hide Translation Bar
Jackie_
Level VI

Find common words in a string

Hi,

 

I have the following jsl code which prints the common words and the count. However, it doesn't appear to be counting it correctly.

The output I am getting is:

commonWords = ["c1" => 3, "cate" => 3];

 

It should be 

commonWords = ["c1" => 3, "cate" =>3, "shotp" => 3]; 

Any suggestions?

 

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and , boundaryChars = "_,"; // Define the pattern for boundary and word boundaryPat = Pat Any( boundaryChars ) | Pat Pos( 0 ); wordPat = Pat Break( boundaryChars ) | Pat Rem(); patMatchWord = boundaryPat + Pat Pos()>>position + wordPat>>word + boundaryPat; // Associative array to count word frequency wordsToCount = [=>]; // Match pattern in the input rc = Pat Match( input, patMatchWord + Pat Test( word = Lowercase( word ); // Convert word to lowercase for uniformity If( Contains( wordsToCount, word ), wordsToCount[word] = wordsToCount[word] + 1, wordsToCount[word] = 1 ); ) + Pat R Pos( 0 ) // The pattern must reach the end ); // Identify common words (those that appear in all elements) elements = Words( input, "," ); totalElements = N Items( elements ); commonWords = [=>]; For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ); ); // Display the common words and their counts Show( commonWords );

// commonWords = ["c1" => 2, "cate" => 2];

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: Find common words in a string

Nice! a few changes, explained below.

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
	Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
	Pat Pos( 0 ) 
		+ Pat Repeat(
			Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars ) 
			+ Pat Test(
				If( Contains( wordsToCount, word ),
					wordsToCount[word] = wordsToCount[word] + 1,
					wordsToCount[word] = 1
				);
				1; // explicitly, result of pattest is 'true'
			)
		) 
	+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
;

The main change is using PatRepeat to walk through the input one token (word) at a time, and adding a sentinel separator at the end. Your original code walks the string by trying to match only one word, then discovering the word is not at the end of the string, advancing the start of the match by one character and trying to match again. It misses the final "shotp" because there is no final separator. The work in pattest is simplified by pre-lowercasing at the same time the sentinel is added.

 

Pretty sure someone will propose a solution using the words() function, which won't need a sentinel.

 

edit: Somehow I missed this sentinel from the past.

Craige

View solution in original post

1 REPLY 1
Craige_Hales
Super User

Re: Find common words in a string

Nice! a few changes, explained below.

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
	Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
	Pat Pos( 0 ) 
		+ Pat Repeat(
			Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars ) 
			+ Pat Test(
				If( Contains( wordsToCount, word ),
					wordsToCount[word] = wordsToCount[word] + 1,
					wordsToCount[word] = 1
				);
				1; // explicitly, result of pattest is 'true'
			)
		) 
	+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
;

The main change is using PatRepeat to walk through the input one token (word) at a time, and adding a sentinel separator at the end. Your original code walks the string by trying to match only one word, then discovering the word is not at the end of the string, advancing the start of the match by one character and trying to match again. It misses the final "shotp" because there is no final separator. The work in pattest is simplified by pre-lowercasing at the same time the sentinel is added.

 

Pretty sure someone will propose a solution using the words() function, which won't need a sentinel.

 

edit: Somehow I missed this sentinel from the past.

Craige