Challenge 1 results are in!

DonMcCormack · Apr 23, 2020 12:43 PM

Programming wisdom suggests that turning a cyclical programming task into a linear one, vectorization, improves computation time. But is this true? Under what conditions? By how much? Challenge 1 provided a great opportunity to find out. To answer the four challenge questions, I decided that for each set of seven letters I would:

Count the number of times each letter appeared in a dictionary word. This allowed me to find answers to items 2, 3 and 4.
Determine if a word had letters not in the seven-letter set. This was needed for item 1.

Challenge 1 try 1.JSL does the counting using column formulas based on the Length and Regex functions. A few additional column formulas are used to make the code easier to read. The only looping is used to reevaluate the functions for each letter set. The run time is just under 29 seconds (all times are averaged over 10 runs). Not bad, but not great. Is it the complexity of the formulas or because the formula evaluation should have been done using a loop (over a list or matrix) instead of a column? I used test simple.JSL and test full.JSL to investigate. In addition to comparing column versus loop, I looked to see if it was faster to read the input values directly from the table or from a list. Does it add time to first read the values from the table into a list or matrix?

The results below (in seconds) show that, as the formulas became more complex, the difference between formula and loop are less apparent. The functions used to build the formulas had a much greater impact on time. The general search functions (Regex and Loc) were much slower than specific search functions (Contains and Substr).

Functions	Formula	List + List Access	List + Table Access
Length	0.0525	0.1398	0.1311
Loc + Items	0.4187	0.5462	0.5383
Length + Regex	9.3744	9.2745	9.3911
Loops + Contains + Substr	3.3255	3.4139	3.5444

Putting this knowledge into practice, I created two new scripts. Challenge 1 try 2.JSL uses formulas with a loop inside one formula to count the letters. Challenge 2 Try 3.JSL uses lists rather than formulas. It also takes advantage of the condition that for any words to satisfy items 1, 2 and 3, it must contain all seven letter set letters at least once. The run times were 15.50 and 10.40 seconds, respectively. In this case, the looping was faster likely because there were fewer functions to evaluate. Either improvement is much better than my initial attempt, but still not the fastest. Improvements would have to come from faster functions, less looping/function evaluations, or a different method for answering the challenge questions. Since first two approaches seemed less likely to yield fruitful possibilities, a different method was sought.

The Challenge 1 Pattern Matching.JSL script is from @Craige_Hales. Craige recently retired from JMP Development but is still very active in the Community. He was responsible for the pattern matching and regular expression functionality in JMP. Craige’s solution vectorizes the problem by first converting the dictionary list into a long string, then applies pattern matching to parse the words into solution buckets. It runs in 3.71 seconds and offers a great example of the breadth and flexibility of pattern matching. The fastest solution, taking 2.76 seconds, was from @brady_brady. Rather than applying each letter set to the dictionary words, he starts by counting the number of occurrences for every alphabet letter in each dictionary word. While this, initially, increases the number of letter comparisons from 7 to 26, it only needs to be done once, avoiding reevaluation for each new letter set. Answer words are found by extracting the appropriate columns from the initial alphabet matrix, leading to a much faster solution. It is also the shortest script at 1146 characters.

Here is the leaderboard after the first challenge:

Name	Points
@P_Anderson	16
@MathStatChem	13
@leelaim	11
@David_Burnham	8

Stay tuned for Challenge 2 coming out May 1!

MathStatChem · ‎04-28-2020

So I used basically the same logic as @brady_brady , but my initial letter counting code was just very innefficent, using more loops and calls to lists. I substituted his approach and sure enough my script ran in about 3 seconds. Very cool!

The function items() is really key to this. I didn't realize it would do this. Converts a string to a list of its individual characters (in order, with repeated letters as separate items in the list) with just

items(string);

so items("easypeasy","") returns {"e", "a", "s", "y", "p", "e", "a", "s", "y"}.

Then using contains to reference the index of the letter of the alphabet was a really nice trick, also.

contains("abcedfghijklmnopqrstuvwxyz","a") returns 1,

contains("abcedfghijklmnopqrstuvwxyz","b") returns 2, ...

I made my code and output to the log a little more verbose, not sure if you removed comments before counting the characters.

Here is my code (improved with the faster approach):

startTime = Time Of Day( Today() );

letterSetList = {"apouter", "abdcefl", "acouter", "rstlneo", "efghijk"};

alphabet = {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x",
"y", "z"};
abet="abcedfghijklmnopqrstuvwxyz";
_dt = Current Data Table();

numwords = N Rows( _dt );
_c = Column( _dt, "Words" );
wordlist = _c << get values;

// get lengths of each word in word list
// also create a matrix indicating how many times each letter in the alphabet is used in each word

wordlengths = J( numwords, 1, 0 );
alphacounts = J( numwords, 26, 0 );
For( ii = 1, ii <= numwords, ii++,
	its=items(wordlist[ii],"");
	wordlengths[ii] = Length( wordlist[ii] );
	for(jj=1, jj<=nitems(its), jj++, alphacounts[ii,contains(abet,its[jj])]++);
);

// also create a matrix indicating if a letter in the alphabet is used in each word
alphaindex = alphacounts > 0;
For( ll = 1, ll <= N Items( letterSetList ), ll++,
	str = letterSetList[ll];
	// create an indicator vector for the letters in the string
	str_alphaindex = J( 1, 26, 0 );
	For( jj = 1, jj <= 26, jj++,
		For( kk = 1, kk <= Length( str ), kk++,
			str_alphaindex[1, jj] = str_alphaindex[1, jj] + (Substr( str, kk, 1 ) == alphabet[jj])
		)
	);

	// 1 The longest word using all seven letters and only those seven letters. Letters may be used more than once.

	//  words that meet this criteria will have a "zero distance" from their letter indicator row and the string letter indicator vector
	pickrow1 = Distance( alphaindex, str_alphaindex ) == 0;
    // now find the word that has the longest length
	pickrow1 = pickrow1 & (wordlengths == Max( wordlengths[Loc( pickrow1 )] ));
		
	Print( "1. Find longest word(s) that uses only the letters <" || str || "> (item " || Char( ll ) || " from itemSetList)" );
	
	If( Dim( Loc( pickrow1 ) )[1],
		Print( wordlist[Loc( pickrow1 )] );
		Print( "which is in the following row(s) from the Word List table " );
		Print( Loc( pickrow1 & (wordlengths == Max( wordlengths[Loc( pickrow1 )] )) ) );
	,
		Print( "no word(s) found" )
	);
	Print( "-------------------" );

	// 2 The longest word using all seven letters and any other letters.

	// words that meet this condition, the sum of their letter indictor rows for the letters in the string should equal 7
	pickrow2 = (V Sum( alphaindex[0, Loc( str_alphaindex )]` )` == 7);
	// now find the word that has the longest length
	pickrow2 = pickrow2 & (wordlengths == Max( wordlengths[Loc( pickrow2 )] ));
	Print(
		"2. Find longest word(s) that use all letters <" || str || "> (item " || Char( ll ) ||
		" from itemSetList) and/or any other letters"
	);
		
	If( Dim( Loc( pickrow2 ) )[1],
		Print( wordlist[Loc( pickrow2 )] );
		Print( "which is in the following row(s) from the Word List table " );
		Print( Loc( pickrow2 ) );
	,
		Print( "no word(s) found" )
	);
	
	Print( "-------------------" );
	

	// 3 The longest word using all seven letters and any other letters where no letter 
	//	from the set of seven is used more than once.

	//  words that meet this condition, meet the condition for the 2nd rule but also the sum of their letter counts for 
	// the letters in the string has to be equal to 7 
	pickrow3 = (V Sum( alphaindex[0, Loc( str_alphaindex )]` )` == 7) & (V Sum( alphacounts[0, Loc( str_alphaindex )]` )` == 7);
	// now find the word that has the longest length	
	pickrow3 = pickrow3 & (wordlengths == Max( wordlengths[Loc( pickrow3 )] ));
	
	Print(
		"3. Find longest word(s) that use all letters <" || str || "> (item " || Char( ll ) ||
		" from itemSetList) only once and/or any other letters "
	);
	If( Dim( Loc( pickrow3 ) )[1],
		Print( wordlist[Loc( pickrow3 )] );
		Print( "which is in the following row(s) from the Word List table " );
		Print( Loc( pickrow3 ) );
	,
		Print( "no word(s) found" )
	);
	Print( "-------------------" );

	// 4 The longest word in which 5 of the 7 letters appears. 
	//	Any letter including those outside of the 7 may appear once or more 
	//	except the two unused letters from the list of 7, which may not appear at all.

	// words that meet this condition have the sum of their letter indicators (for the letters in the string) equal to 5
	pickrow4 = (V Sum( alphaindex[0, Loc( str_alphaindex )]` )` == 5);
	// now find the word that has the longest length
	pickrow4 = pickrow4 & (wordlengths == Max( wordlengths[Loc( pickrow4 )] ));
	
	Print(
		"4. Find longest word(s) that uses only 5 of the 7 letters <" || str || "> (item " || Char( ll ) ||
		" from itemSetList) and/or any other letters"
	);
	If( Dim( Loc( pickrow4 ) )[1],
		Print( wordlist[Loc( pickrow4 )] );
		Print( "which is in the following row(s) from the Word List table " );
		Print( Loc( pickrow4 ) );
	,
		Print( "no word(s) found" )
	);
	Print( "-------------------" );
);

endTime = Time Of Day( Today() );
Print( endTime - startTime );

DonMcCormack · ‎04-28-2020

@MathStatChem I remove the comments and white space before counting the characters. Your new script is running in about 3.6 sec on my machine in it's current state. Slower than Brady's but faster than Craige's. Pretty dramatic difference for a few small changes! Some of the items aren't answered correctly, though. Below is the letter sets I used along with the answers.

//acehlot
1. {"acacatechol"}
2. {"scientificophilosophical"}
3. {"hypercivilization"}
4. {"macracanthrorhynchiasis", "pancreaticoduodenostomy", "pericardiomediastinitis"}

//ilosuvx
1. {"lixivious"}
2.{"subclavioaxillary"}
3.{"excitovascular"}
4. {"formaldehydesulphoxylate"}

//mnoruwy
1.{"unwormy"}
2.{"countrywoman", "journeywoman", "laundrywoman"}
3.{"jurywoman"}
4.{"formaldehydesulphoxylate"}

The lengths of your new and old scripts were 2461 and 2469 respectively.