cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Register for our Discovery Summit 2024 conference, Oct. 21-24, where you’ll learn, connect, and be inspired.
Choose Language Hide Translation Bar
Jackie_
Level VI

Find common words in a string

Hi,

 

I have the following jsl code which prints the common words and the count. However, it doesn't appear to be counting it correctly.

The output I am getting is:

commonWords = ["c1" => 3, "cate" => 3];

 

It should be 

commonWords = ["c1" => 3, "cate" =>3, "shotp" => 3]; 

Any suggestions?

 

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and , boundaryChars = "_,"; // Define the pattern for boundary and word boundaryPat = Pat Any( boundaryChars ) | Pat Pos( 0 ); wordPat = Pat Break( boundaryChars ) | Pat Rem(); patMatchWord = boundaryPat + Pat Pos()>>position + wordPat>>word + boundaryPat; // Associative array to count word frequency wordsToCount = [=>]; // Match pattern in the input rc = Pat Match( input, patMatchWord + Pat Test( word = Lowercase( word ); // Convert word to lowercase for uniformity If( Contains( wordsToCount, word ), wordsToCount[word] = wordsToCount[word] + 1, wordsToCount[word] = 1 ); ) + Pat R Pos( 0 ) // The pattern must reach the end ); // Identify common words (those that appear in all elements) elements = Words( input, "," ); totalElements = N Items( elements ); commonWords = [=>]; For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ); ); // Display the common words and their counts Show( commonWords );

// commonWords = ["c1" => 2, "cate" => 2];

Thanks

 

2 ACCEPTED SOLUTIONS

Accepted Solutions
Craige_Hales
Super User

Re: Find common words in a string

Nice! a few changes, explained below.

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
	Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
	Pat Pos( 0 ) 
		+ Pat Repeat(
			Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars ) 
			+ Pat Test(
				If( Contains( wordsToCount, word ),
					wordsToCount[word] = wordsToCount[word] + 1,
					wordsToCount[word] = 1
				);
				1; // explicitly, result of pattest is 'true'
			)
		) 
	+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
;

The main change is using PatRepeat to walk through the input one token (word) at a time, and adding a sentinel separator at the end. Your original code walks the string by trying to match only one word, then discovering the word is not at the end of the string, advancing the start of the match by one character and trying to match again. It misses the final "shotp" because there is no final separator. The work in pattest is simplified by pre-lowercasing at the same time the sentinel is added.

 

Pretty sure someone will propose a solution using the words() function, which won't need a sentinel.

 

edit: Somehow I missed this sentinel from the past.

Craige

View solution in original post

jthi
Super User

Re: Find common words in a string

Of course if you need an associative array, you cannot have it ordered in any other way than what JMP does BUT you can order the values you get from it by utilizing ranks.

 

Trying to keep with the "jmp table" solution (I made few assumptions)

Names Default To Here(1);

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";
input = lowercase(input);

elements = Words(input, ",");

dt = New Table("Data",
	New Column("Word", Character, Nominal),
	New Column("Elementnr", Numeric, Nominal)
);

For Each({element, idx}, elements,
	l = Words(element, "_");
	nr = Repeat(idx, N Items(l));
	r = N Rows(dt);
	dt << Add Rows(N Items(l));
	
	dt[r+1::r+N Items(l), 1] = l;
	dt[r+1::r+N Items(l), 2] = nr;
);

new_col1 = dt << New Column("C", Numeric, Continuous, Formula(
	Col Number(:Elementnr, :Word)
));

new_col2 = dt << New Column("R", Numeric, Continuous, Formula(
	Col Min(Col Cumulative Sum(1, :Elementnr), :Word)
));
dt << run formulas;
new_col1 << delete formula;
new_col2 << delete formula;

Summarize(dt, elem = by(:Elementnr));
element_count = N Items(elem);

dt << Delete Rows(dt << get rows where(:C != element_count));
dt << Select Duplicate Rows(Match(:Word)) << Delete Rows << Clear Select;

aa = Associative Array(:Word, :C); // results in AA
r = Rank(:Word << get values); // use Rank to return in original order

Close(dt, No save);

keys = (aa << get keys)[r]; 
// Values does not need sorting as they always have same values
values = aa << get values;

Write();

show(aa, r, keys, values);
-Jarmo

View solution in original post

7 REPLIES 7
Craige_Hales
Super User

Re: Find common words in a string

Nice! a few changes, explained below.

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
	Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
	Pat Pos( 0 ) 
		+ Pat Repeat(
			Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars ) 
			+ Pat Test(
				If( Contains( wordsToCount, word ),
					wordsToCount[word] = wordsToCount[word] + 1,
					wordsToCount[word] = 1
				);
				1; // explicitly, result of pattest is 'true'
			)
		) 
	+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
;

The main change is using PatRepeat to walk through the input one token (word) at a time, and adding a sentinel separator at the end. Your original code walks the string by trying to match only one word, then discovering the word is not at the end of the string, advancing the start of the match by one character and trying to match again. It misses the final "shotp" because there is no final separator. The work in pattest is simplified by pre-lowercasing at the same time the sentinel is added.

 

Pretty sure someone will propose a solution using the words() function, which won't need a sentinel.

 

edit: Somehow I missed this sentinel from the past.

Craige
Jackie_
Level VI

Re: Find common words in a string

@Craige_Hales  Another question Is it possible to retain the order of the words. 

 

commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];

Should be 

CATE_N_C1_Shotp

commonWords = ["cate" => 3, "c1" => 3, "shotp" => 3];

 

 

jthi
Super User

Re: Find common words in a string

You can also utilize JMP tables

Names Default To Here(1);

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";
input = lowercase(input);

elements = Words(input, ",");

dt = New Table("Data",
	New Column("Word", Character, Nominal),
	New Column("Elementnr", Numeric, Nominal),
	private
);

For Each({element, idx}, elements,
	l = Words(element, "_");
	nr = Repeat(idx, N Items(l));
	r = N Rows(dt);
	dt << Add Rows(N Items(l));
	
	dt[r+1::r+N Items(l), 1] = l;
	dt[r+1::r+N Items(l), 2] = nr;
);

dt_summary = dt << Summary(
	Group(:Word),
	N,
	Subgroup(:Elementnr),
	Freq("None"),
	Weight("None"),
	Link to original data table(0),
	private
);

sums = V Sum((dt_summary[0, 3::N Cols(dt_summary)] > 0)`);
valid_idx = Loc(sums >= N ITems(elements));

words = Associative Array(dt_summary[valid_idx, 1], dt_summary[valid_idx, 2]);

Close(dt, no save);
Close(dt_summary, no save);

show(words);

There are also some small optimizations which could be done if necessary

-Jarmo
Jackie_
Level VI

Re: Find common words in a string

@jthi is possible to retain the words order?


Should be: words = ["cate" => 3, "c1" => 3, "shotp" => 3];

Craige_Hales
Super User

Re: Find common words in a string

Maybe.

 

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
position = 0;
wordToPosition = [=> ];
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
	Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
	Pat Pos( 0 ) 
		+ Pat Repeat(
			Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars ) >> sep
			+ Pat Test(
				position += 1;
				If( Contains( wordsToCount, word ),
					wordsToCount[word] = wordsToCount[word] + 1,
					wordsToCount[word] = 1
				);
				If( Contains( wordToPosition, word ) & wordToPosition[word] != position,
					throw("word ordering not consistent")
				);
				wordToPosition[word] = position;
				if(sep==",", position = 0);
				1; // explicitly, result of pattest is 'true'
			)
		) 
	+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
Show( wordToPosition );// wordToPosition = ["c1" => 3, "cate" => 1, "n" => 2, "p" => 2, "shotp" => 4];
keys = wordToPosition << getkeys;//{"c1", "cate", "n", "p", "shotp"}
vals = wordToPosition << getvalues;// {3, 1, 2, 2, 4}
sort = Rank( vals );// [2, 3, 4, 1, 5]
keys = keys[sort];// {"cate", "n", "p", "c1", "shotp"}
vals = vals[sort];// {1, 2, 2, 3, 4}
For Each( {k}, keys, If( Contains( commonWords, k ), Show( k, commonWords[k] ) ) );
/*
k = "cate";
commonWords[k] = 3;
k = "c1";
commonWords[k] = 3;
k = "shotp";
commonWords[k] = 3;
*/

The throw() might alert you to one potential problem.

Craige
jthi
Super User

Re: Find common words in a string

It can. With just one example of data it is a bit annoying to try and figure out what should be done though. For example, can same word appear multiple times in same element? And can the order change within elements?

-Jarmo
jthi
Super User

Re: Find common words in a string

Of course if you need an associative array, you cannot have it ordered in any other way than what JMP does BUT you can order the values you get from it by utilizing ranks.

 

Trying to keep with the "jmp table" solution (I made few assumptions)

Names Default To Here(1);

input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";
input = lowercase(input);

elements = Words(input, ",");

dt = New Table("Data",
	New Column("Word", Character, Nominal),
	New Column("Elementnr", Numeric, Nominal)
);

For Each({element, idx}, elements,
	l = Words(element, "_");
	nr = Repeat(idx, N Items(l));
	r = N Rows(dt);
	dt << Add Rows(N Items(l));
	
	dt[r+1::r+N Items(l), 1] = l;
	dt[r+1::r+N Items(l), 2] = nr;
);

new_col1 = dt << New Column("C", Numeric, Continuous, Formula(
	Col Number(:Elementnr, :Word)
));

new_col2 = dt << New Column("R", Numeric, Continuous, Formula(
	Col Min(Col Cumulative Sum(1, :Elementnr), :Word)
));
dt << run formulas;
new_col1 << delete formula;
new_col2 << delete formula;

Summarize(dt, elem = by(:Elementnr));
element_count = N Items(elem);

dt << Delete Rows(dt << get rows where(:C != element_count));
dt << Select Duplicate Rows(Match(:Word)) << Delete Rows << Clear Select;

aa = Associative Array(:Word, :C); // results in AA
r = Rank(:Word << get values); // use Rank to return in original order

Close(dt, No save);

keys = (aa << get keys)[r]; 
// Values does not need sorting as they always have same values
values = aa << get values;

Write();

show(aa, r, keys, values);
-Jarmo