Removing Duplicate Words

Craige_Hales · May 29, 2023 02:18 PM

Displaying duplicate words with Text Box markup.

@lehaofeng asked a question (well answered already) about removing duplicate strings within a bigger string. Here's some JSL that uses pattern matching to identify strings, an associative array to locate duplicates, munger() to remove or edit the duplicates, <<markup mode on a textbox to display the results, and <<UnderlineStyle on a textbox to make a web link. JMP 16 required for ForEach support.

A picture of the JSL output; clicking the link won't work here, but you can run the JSL below. A picture of the JSL output; clicking the link won't work here, but you can run the JSL below.

// https://www.poetryfoundation.org/poems/42916/jabberwocky
input = "’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

“Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!”

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.";
/*
make some rules for boundaries around words. Intentionally leave out the ’Twas
unicode apostrophe in the text above. Or you can add it. The next three pattern
matcher statements match any of the boundary characters or the first character,
a word of not-boundary characters up to a boundary character or the last char,
and a complete word with its position and text stored into position and word.
*/      
boundaryChars = " .,:;!?“”\!n\!r";
boundaryPat = Pat Any( boundaryChars ) | Pat Pos( 0 );
wordPat = Pat Break( boundaryChars ) | Pat Rem();
patMatchWord = boundaryPat + Pat Pos()>>position + wordPat>>word + boundaryPat;
/*
set a couple of rules about the length of permitted words; scale is 1000 for 12
character words. scale is used to pack position*scale + length into a sortable
list. Finally, the wordToPos associative array will gather the results.
*/
minLength = 2;
maxLength = 12;
scale = 10^(1+ceiling(log10(maxLength))); // for packing position*scale+length
wordToPos = [=> ];
/*
Run the match. This loops over the input text using the 2nd argument pattern.
The pattern repeatedly matches a word, moving through the input text to find the
next word accepted by patMatchWord. At each position, patMatchWord sets the
variables position and word. Then the PatTest() function runs its script...
*/
rc = Pat Match( input, patMatchWord 
	+ Pat Test(
		word = Lowercase( word ); // make The and the equivalent
		wordLen = Length( word ); // check if this word is in the range to keep
		If( minLength <= wordLen <= maxLength,
			// position from pattern matching is zero-based. convert to 1-based for munger()...
			packed = (position + 1) * scale + wordLen; // pack together with sortable position			
			If( !Contains( wordToPos, word ), // has this word been added? if not...
				wordToPos[word] = {}; // create a list to hold it
				packed += .5; // flag this position is the first one for this word
			);
			Insert Into( wordToPos[word], packed ); // store the packed info
		);
		1; // pat test must return true 
	) 
	+ Pat R Pos( 0 )// the pattern must reach the end
);
If( rc == 0, Throw( "Bummer. Something has gone wrong." ) );
/*
wordPosList is an associative array; the keys are the words found in the input and
the values are lists of the packed position+length+first indicator. Many words only
occurred once and should be ignored. The others, with multiple occurrences, are 
all dumped into a list of duplicates (using the packed pos+len+1st value.)
*/
duplicates = {};
For Each( {{word, wordPosList}}, wordToPos,
	If( N Items( wordPosList ) > 1,
		Insert Into( duplicates, wordPosList[1 :: N Items( wordPosList )] )// starting at 2:: if you prefer to ignore first
	)
);
/*
the packed values are sortable. The input data will be edited from right-to-left
so the earlier indexes don't go bad; reverse will make the right-most indexes be
at the start of the list. Then break out the positions and lengths from the sorted
values. The lengths keep the 0.5 flag that marks a first occurrence.
*/
Sort List Into( duplicates ); // small to large
Reverse Into( duplicates ); //large to small
positions = Floor( duplicates / scale ); //unpack
lengths = Mod( duplicates, scale ); //unpack
/*
a few variable to control the report appearance. you might want to move colors here too.
*/
titlesize = 20;
authorsize = 14;
textsize = 12;
deletedsize = 10;
/*
make a copy of the input for cleaning, then use each position/length pair to 
modify the cleaned copy. The munger function is pretty perfect for this job.
This algorithm will be very slow if modifying long strings in a lot of places
because each modification requires splitting and rejoining a long string.
If this is critical, you'll want to use a list of short pieces of the string
and the concatitems() function to rejoin them when finished.
*/
cleaned = input;
foreach({{pos,len}},across(positions,lengths),
	if(len==floor(len),// no 0.5, not first, show as deleted
		//cleaned=munger(cleaned,pos,len,""); // delete, or...
		// ...use the <<markup textbox style to identify the deleted text
		cleaned = Munger( cleaned, pos + len, 0, "</font>" ); // furthest right first
		cleaned = Munger( cleaned, pos, 0, Eval Insert( "<font size='^deletedsize^' color='blue'>" ) );	
	,// else show as first
		cleaned = Munger( cleaned, pos + Floor( len ), 0, "</font>" ); // remove the 0.5
		cleaned = Munger( cleaned, pos, 0, "<font color='green'>" ); // furthest left goes 2nd
	);
);
/*
change double newlines to have a space between so both get used...
in <<markup mode multiple newlines collapse to one
*/
cleaned = regex(cleaned,"\!r\!r","\!r \!r",globalreplace);
/*
The window is a vlist box with titles above, poem middle, legend below.
The poem is two copies in an hlistbox, left unchanged, right cleaned up.
The hlistbox holding the two poems uses padding/margin/border to create
a floating black line. several textboxes uses <<markup for color and size.
*/
New Window( "Before and After",
	V List Box(
		H Center Box( Text Box( "Jabberwocky", <<setfontsize( titlesize ) ) ),
		H Center Box( Text Box( "Lewis Carroll", <<setfontsize( authorsize ) ) ),
		H List Box(
			Text Box( input, <<setfontsize( textsize ) ),
			Spacer Box( size( 10, 1 ) ),
			Text Box( cleaned, <<setfontsize( textsize ), <<markup ),
			<<padding( Left( 9 ), Right( 9 ), top( 9 ), bottom( 9 ) ),
			<<margin( Left( 9 ), Right( 9 ), top( 9 ), bottom( 9 ) ),
			<<border( Left( 9 ), Right( 9 ), top( 9 ), bottom( 9 ) )
		),
		H Center Box(
			H List Box(
				Button Box( "poetryfoundation.org", Web( "https://www.poetryfoundation.org/poems/42916/jabberwocky" ), <<Underline Style( 1 ) ),
				Spacer Box( size( 100, 1 ) ),
				Text Box(
					"black used once   <font color='green'>green initial</font>   <font color='blue'>blue subsequent</font>",
					<<setfontsize( textsize ),
					<<markup
				)
			)
		)
	)
);

lehaofeng · ‎05-30-2023

wonderful