Programming wisdom suggests that turning a cyclical programming task into a linear one, vectorization, improves computation time. But is this true? Under what conditions? By how much? Challenge 1 provided a great opportunity to find out. To answer the four challenge questions, I decided that for each set of seven letters I would:
- Count the number of times each letter appeared in a dictionary word. This allowed me to find answers to items 2, 3 and 4.
- Determine if a word had letters not in the seven-letter set. This was needed for item 1.
Challenge 1 try 1.JSL does the counting using column formulas based on the Length and Regex functions. A few additional column formulas are used to make the code easier to read. The only looping is used to reevaluate the functions for each letter set. The run time is just under 29 seconds (all times are averaged over 10 runs). Not bad, but not great. Is it the complexity of the formulas or because the formula evaluation should have been done using a loop (over a list or matrix) instead of a column? I used test simple.JSL and test full.JSL to investigate. In addition to comparing column versus loop, I looked to see if it was faster to read the input values directly from the table or from a list. Does it add time to first read the values from the table into a list or matrix?
The results below (in seconds) show that, as the formulas became more complex, the difference between formula and loop are less apparent. The functions used to build the formulas had a much greater impact on time. The general search functions (Regex and Loc) were much slower than specific search functions (Contains and Substr).
Functions
|
Formula
|
List + List Access
|
List + Table Access
|
Length
|
0.0525
|
0.1398
|
0.1311
|
Loc + Items
|
0.4187
|
0.5462
|
0.5383
|
Length + Regex
|
9.3744
|
9.2745
|
9.3911
|
Loops + Contains + Substr
|
3.3255
|
3.4139
|
3.5444
|
Putting this knowledge into practice, I created two new scripts. Challenge 1 try 2.JSL uses formulas with a loop inside one formula to count the letters. Challenge 2 Try 3.JSL uses lists rather than formulas. It also takes advantage of the condition that for any words to satisfy items 1, 2 and 3, it must contain all seven letter set letters at least once. The run times were 15.50 and 10.40 seconds, respectively. In this case, the looping was faster likely because there were fewer functions to evaluate. Either improvement is much better than my initial attempt, but still not the fastest. Improvements would have to come from faster functions, less looping/function evaluations, or a different method for answering the challenge questions. Since first two approaches seemed less likely to yield fruitful possibilities, a different method was sought.
The Challenge 1 Pattern Matching.JSL script is from @Craige_Hales. Craige recently retired from JMP Development but is still very active in the Community. He was responsible for the pattern matching and regular expression functionality in JMP. Craige’s solution vectorizes the problem by first converting the dictionary list into a long string, then applies pattern matching to parse the words into solution buckets. It runs in 3.71 seconds and offers a great example of the breadth and flexibility of pattern matching. The fastest solution, taking 2.76 seconds, was from @brady_brady. Rather than applying each letter set to the dictionary words, he starts by counting the number of occurrences for every alphabet letter in each dictionary word. While this, initially, increases the number of letter comparisons from 7 to 26, it only needs to be done once, avoiding reevaluation for each new letter set. Answer words are found by extracting the appropriate columns from the initial alphabet matrix, leading to a much faster solution. It is also the shortest script at 1146 characters.
Here is the leaderboard after the first challenge:
Stay tuned for Challenge 2 coming out May 1!
Challenge 1 Pattern Matching.jsl