- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Find common words in a string
Hi,
I have the following jsl code which prints the common words and the count. However, it doesn't appear to be counting it correctly.
The output I am getting is:
commonWords = ["c1" => 3, "cate" => 3];
It should be
commonWords = ["c1" => 3, "cate" =>3, "shotp" => 3];
Any suggestions?
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Define the pattern for boundary and word
boundaryPat = Pat Any( boundaryChars ) | Pat Pos( 0 );
wordPat = Pat Break( boundaryChars ) | Pat Rem();
patMatchWord = boundaryPat + Pat Pos()>>position + wordPat>>word + boundaryPat;
// Associative array to count word frequency
wordsToCount = [=>];
// Match pattern in the input
rc = Pat Match( input, patMatchWord
+ Pat Test(
word = Lowercase( word ); // Convert word to lowercase for uniformity
If( Contains( wordsToCount, word ),
wordsToCount[word] = wordsToCount[word] + 1,
wordsToCount[word] = 1
);
)
+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=>];
For Each( {{word, count}}, wordsToCount,
If( count == totalElements,
commonWords[word] = count
);
);
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 2, "cate" => 2];
Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
Nice! a few changes, explained below.
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
Pat Pos( 0 )
+ Pat Repeat(
Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars )
+ Pat Test(
If( Contains( wordsToCount, word ),
wordsToCount[word] = wordsToCount[word] + 1,
wordsToCount[word] = 1
);
1; // explicitly, result of pattest is 'true'
)
)
+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
;
The main change is using PatRepeat to walk through the input one token (word) at a time, and adding a sentinel separator at the end. Your original code walks the string by trying to match only one word, then discovering the word is not at the end of the string, advancing the start of the match by one character and trying to match again. It misses the final "shotp" because there is no final separator. The work in pattest is simplified by pre-lowercasing at the same time the sentinel is added.
Pretty sure someone will propose a solution using the words() function, which won't need a sentinel.
edit: Somehow I missed this sentinel from the past.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
Of course if you need an associative array, you cannot have it ordered in any other way than what JMP does BUT you can order the values you get from it by utilizing ranks.
Trying to keep with the "jmp table" solution (I made few assumptions)
Names Default To Here(1);
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";
input = lowercase(input);
elements = Words(input, ",");
dt = New Table("Data",
New Column("Word", Character, Nominal),
New Column("Elementnr", Numeric, Nominal)
);
For Each({element, idx}, elements,
l = Words(element, "_");
nr = Repeat(idx, N Items(l));
r = N Rows(dt);
dt << Add Rows(N Items(l));
dt[r+1::r+N Items(l), 1] = l;
dt[r+1::r+N Items(l), 2] = nr;
);
new_col1 = dt << New Column("C", Numeric, Continuous, Formula(
Col Number(:Elementnr, :Word)
));
new_col2 = dt << New Column("R", Numeric, Continuous, Formula(
Col Min(Col Cumulative Sum(1, :Elementnr), :Word)
));
dt << run formulas;
new_col1 << delete formula;
new_col2 << delete formula;
Summarize(dt, elem = by(:Elementnr));
element_count = N Items(elem);
dt << Delete Rows(dt << get rows where(:C != element_count));
dt << Select Duplicate Rows(Match(:Word)) << Delete Rows << Clear Select;
aa = Associative Array(:Word, :C); // results in AA
r = Rank(:Word << get values); // use Rank to return in original order
Close(dt, No save);
keys = (aa << get keys)[r];
// Values does not need sorting as they always have same values
values = aa << get values;
Write();
show(aa, r, keys, values);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
Nice! a few changes, explained below.
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
Pat Pos( 0 )
+ Pat Repeat(
Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars )
+ Pat Test(
If( Contains( wordsToCount, word ),
wordsToCount[word] = wordsToCount[word] + 1,
wordsToCount[word] = 1
);
1; // explicitly, result of pattest is 'true'
)
)
+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
;
The main change is using PatRepeat to walk through the input one token (word) at a time, and adding a sentinel separator at the end. Your original code walks the string by trying to match only one word, then discovering the word is not at the end of the string, advancing the start of the match by one character and trying to match again. It misses the final "shotp" because there is no final separator. The work in pattest is simplified by pre-lowercasing at the same time the sentinel is added.
Pretty sure someone will propose a solution using the words() function, which won't need a sentinel.
edit: Somehow I missed this sentinel from the past.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
@Craige_Hales Another question Is it possible to retain the order of the words.
commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
Should be
CATE_N_C1_Shotp
commonWords = ["cate" => 3, "c1" => 3, "shotp" => 3];
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
You can also utilize JMP tables
Names Default To Here(1);
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";
input = lowercase(input);
elements = Words(input, ",");
dt = New Table("Data",
New Column("Word", Character, Nominal),
New Column("Elementnr", Numeric, Nominal),
private
);
For Each({element, idx}, elements,
l = Words(element, "_");
nr = Repeat(idx, N Items(l));
r = N Rows(dt);
dt << Add Rows(N Items(l));
dt[r+1::r+N Items(l), 1] = l;
dt[r+1::r+N Items(l), 2] = nr;
);
dt_summary = dt << Summary(
Group(:Word),
N,
Subgroup(:Elementnr),
Freq("None"),
Weight("None"),
Link to original data table(0),
private
);
sums = V Sum((dt_summary[0, 3::N Cols(dt_summary)] > 0)`);
valid_idx = Loc(sums >= N ITems(elements));
words = Associative Array(dt_summary[valid_idx, 1], dt_summary[valid_idx, 2]);
Close(dt, no save);
Close(dt_summary, no save);
show(words);
There are also some small optimizations which could be done if necessary
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
@jthi is possible to retain the words order?
Should be: words = ["cate" => 3, "c1" => 3, "shotp" => 3];
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
Maybe.
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";// input string
// Define boundary characters as _ and ,
boundaryChars = "_,";
position = 0;
wordToPosition = [=> ];
// Associative array to count word frequency
wordsToCount = [=> ];
// Match pattern in the input
rc = Pat Match(
Lowercase( input ) || boundaryChars, // pre-normalize and add sentinel at end
Pat Pos( 0 )
+ Pat Repeat(
Pat Break( boundaryChars ) >> word + Pat Span( boundaryChars ) >> sep
+ Pat Test(
position += 1;
If( Contains( wordsToCount, word ),
wordsToCount[word] = wordsToCount[word] + 1,
wordsToCount[word] = 1
);
If( Contains( wordToPosition, word ) & wordToPosition[word] != position,
throw("word ordering not consistent")
);
wordToPosition[word] = position;
if(sep==",", position = 0);
1; // explicitly, result of pattest is 'true'
)
)
+ Pat R Pos( 0 ) // The pattern must reach the end
);
// Identify common words (those that appear in all elements)
elements = Words( input, "," );
totalElements = N Items( elements );
commonWords = [=> ];
For Each( {{word, count}}, wordsToCount, If( count == totalElements, commonWords[word] = count ) );
// Display the common words and their counts
Show( commonWords );// commonWords = ["c1" => 3, "cate" => 3, "shotp" => 3];
Show( wordToPosition );// wordToPosition = ["c1" => 3, "cate" => 1, "n" => 2, "p" => 2, "shotp" => 4];
keys = wordToPosition << getkeys;//{"c1", "cate", "n", "p", "shotp"}
vals = wordToPosition << getvalues;// {3, 1, 2, 2, 4}
sort = Rank( vals );// [2, 3, 4, 1, 5]
keys = keys[sort];// {"cate", "n", "p", "c1", "shotp"}
vals = vals[sort];// {1, 2, 2, 3, 4}
For Each( {k}, keys, If( Contains( commonWords, k ), Show( k, commonWords[k] ) ) );
/*
k = "cate";
commonWords[k] = 3;
k = "c1";
commonWords[k] = 3;
k = "shotp";
commonWords[k] = 3;
*/
The throw() might alert you to one potential problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
It can. With just one example of data it is a bit annoying to try and figure out what should be done though. For example, can same word appear multiple times in same element? And can the order change within elements?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Find common words in a string
Of course if you need an associative array, you cannot have it ordered in any other way than what JMP does BUT you can order the values you get from it by utilizing ranks.
Trying to keep with the "jmp table" solution (I made few assumptions)
Names Default To Here(1);
input = "CATE_N_C1_Shotp,CATE_P_C1_Shotp,CATE_P_C1_Shotp";
input = lowercase(input);
elements = Words(input, ",");
dt = New Table("Data",
New Column("Word", Character, Nominal),
New Column("Elementnr", Numeric, Nominal)
);
For Each({element, idx}, elements,
l = Words(element, "_");
nr = Repeat(idx, N Items(l));
r = N Rows(dt);
dt << Add Rows(N Items(l));
dt[r+1::r+N Items(l), 1] = l;
dt[r+1::r+N Items(l), 2] = nr;
);
new_col1 = dt << New Column("C", Numeric, Continuous, Formula(
Col Number(:Elementnr, :Word)
));
new_col2 = dt << New Column("R", Numeric, Continuous, Formula(
Col Min(Col Cumulative Sum(1, :Elementnr), :Word)
));
dt << run formulas;
new_col1 << delete formula;
new_col2 << delete formula;
Summarize(dt, elem = by(:Elementnr));
element_count = N Items(elem);
dt << Delete Rows(dt << get rows where(:C != element_count));
dt << Select Duplicate Rows(Match(:Word)) << Delete Rows << Clear Select;
aa = Associative Array(:Word, :C); // results in AA
r = Rank(:Word << get values); // use Rank to return in original order
Close(dt, No save);
keys = (aa << get keys)[r];
// Values does not need sorting as they always have same values
values = aa << get values;
Write();
show(aa, r, keys, values);