Thanks to those folks who gave Challenge 7 a try. Hopefully, even if you didn’t have a chance, you’ve been following the discussions. How can we solve this challenge? Conceptually, the simplest approach would find single character duplicates, then two-character duplicates, three character, etc., increasing by one character until there were no duplicates. Naïvely checking all possible strings for a given length grows increasingly time-consuming with substring length. For the Covid-19 genome with a length of approximately 30K characters, even though we are guaranteed to have seven-character substring duplicates with only 47 = 16,384 unique combinations, only 6889 occur (42%). And the number of duplicates decreases as the quantity of possible duplicates increases rapidly as the substrings grow longer. An efficient method for identifying duplicates must be found for this approach to work. We will come back to this after we examine other ways to tackle this problem.
Let’s focus on just finding the longest duplicate without having to worry about finding all duplicates. A simple approach would be to start the substring at the first character with a seven-character duplicate since one must exist. Then, grow it one character at a time until there is no duplicate. Move to the next character and and a longer duplicate substring. Continue this until you are close enough to the end of the search string that a duplicate isn't possible. In the Covid-19 genome, an eight-character substring ATTAAAGG can be built starting at the first character. Moving to the second character, an 11-character string TTAAAGGTTTA staring at position 2 has a duplicate at position 20063. There is no duplicate 11-characters or longer starting at position 3, so we move to position 4. And so on. The code below needs about 5 seconds to find the longest duplicated string, two 32-character overlapping string of As. Note that in the code below, when a duplicate exists for a substring of equal length as the longest substring already found, While iterates at least once more before terminating. In these situations nextLen needs to be decremented by 1.
findLongestDupe = Function({inString,charSet},{parts,nextLen,nextPos=1,longestDupes={""}},
parts = Items(inString,"");
nextLen = Floor(Log(Length(inString))/Log(N Items(charSet)));
While(nextPos <= Length(inString)-nextLen+1,
nextWord = Concat Items(parts[nextPos::(nextPos+nextLen-1)],"");
While(Contains(inString,nextWord,nextPos+1),
nextLen++;
nextWord = Concat Items(parts[nextPos::(nextPos+nextLen-1)],"");
);
nextPos++;
If(Length(nextWord)>Length(longestDupes[1]),
nextLen--; //nextLen will be one more than longest length when Pat Match fails
nextWord = Concat Items(parts[nextPos::(nextPos+nextLen-1)],"");
If(Length(nextWord)==Length(longestDupes[1])+1,
Insert Into(longestDupes,nextWord),
longestDupes = Eval List({nextWord})
);
);
nextChar++;
);
longestDupes;
);
As with @Craige_Hales' code posted in the previous blog entry, I tried implementing the above approach using pattern matching. Surprisingly, the results took five to six times longer. I’m not sure exactly why, but my hunch is that (as illustrated last time) Contains is much faster when the substring duplicate occurs less frequently.
We could make things go faster by taking advantage of JMP’s built-in duplicate finder, the Summary table platform. An advantage to this approach is that we can start with any substring length. How should you build the substrings? Two solutions can be found in last month’s blog post. The first atomizes and iterates over the input string.
parts = Items(inString,"");
nInds = Length(inString) - subStringLen + 1;
stringList = {};
For(i=1,i<=nInds,i++,
Insert Into(stringList,Concat Items(parts[i::(i + subStringLen - 1)],""))
);
where inString is the input string and subStringLen is the length of the substrings being built. A non-iterative approach can be implemented using expressions:
parts = Items(inString,"");
nInds = Length(inString) - subStringLen + 1;
stringList = {};
i=1;
stringList[1::nInds] = Expr(Concat Items(parts[i::(i++ + subStringLen - 1)],""));
stringList = Eval List(stringList);
or pattern matching
stringList = {};
Pat Match(inString, Pat Len(subStringLen) >> nextWord+Pat Test(Insert Into(stringList,nextWord);0));
To test the speed of each approach, I built substrings of size 4, 9, 15, and 21 on an input string 1,000,000 characters long comprised of four randomly generated characters. Ten iterations were run at each length. The results below show the superiority of pattern matching. Complete results can be found in Test Build Substrings.JMP.
Algorithm
|
String Length Construction Times (sec)
|
4
|
9
|
15
|
21
|
Expressions
|
6.91
|
7.35
|
8.19
|
9.01
|
Loop
|
2.56
|
2.98
|
3.62
|
4.26
|
Pattern Matching
|
1.09
|
1.12
|
1.10
|
1.11
|
An ANOVA was performed showing a statistical difference (p << 0.001) for Algorithm, String Length, and the Algorithm x String Length interaction. Post hoc testing using Tukey’s HSD indicated there was no statistical difference between the string length levels for the Pattern Matching algorithm. To see how this method performed with a larger character set a test was run with a 1,000,000-character long string built from 26 characters randomly selected. Twenty repetitions were run on the same substring lengths as above plus 2- and 40-characters substrings. While there were statistical differences, the practical differences were small.
Substring
Length
|
Time (sec)
|
2
|
1.11
|
4
|
1.08
|
9
|
1.09
|
15
|
1.11
|
24
|
1.13
|
40
|
1.53
|
Putting substring construction and duplicate finding together might look something like the code below. Knowing the rows associated with the duplicates let us identify their location in the string. This will work to our advantage a little later. Making the tables Private speeds up the operations slightly.
stringList = {};
Pat Match(inString, Pat Len(subStringLen) >> nextWord+Pat Test(Insert Into(stringList,nextWord);0));
tmpTbl = New Table("tmp",New Column("tmp",Character,Values(stringList)),Private);
sumTbl = tmpTbl << Summary(Group(:tmp),Private);
dupeLocs = sumTbl << Get Rows Where(:N Rows > 1);
dupes = Column(sumTbl,1)[dupeLocs];
Close(tmpTbl,No Save);
The table below gives average run time for a single iteration based on 10 runs using the Covid-19 and 1,000,000 character strings. Given the results, it’s easy to see that using this approach exclusively and without substring size optimization will be problematic as strings and substrings grow larger.
Substring
Length
|
Covid-19
|
1M
|
2
|
0.0517
|
1.68
|
4
|
0.0504
|
4.25
|
9
|
0.1837
|
8.29
|
15
|
0.2226
|
8.63
|
21
|
0.2209
|
8.85
|
40
|
0.2957
|
12.30
|
Even with starting substring size optimization, we can’t predict how much or little structure the input data will have. A better approach would be to employ a hybrid strategy. Use Summary to identify duplicates of a given size, then, using these duplicates alone, find the longest duplicates that can be built from them. This will work, since a substring of the longest duplicate(s) must appear in the set of duplicates created from a shorter substring length. There are several possible approaches. I choose an approach that analogous to, but different from, the final method to be discussed. To be honest, I saw the final method prior to coming up with this approach and was, no doubt, highly influenced by it. The challenge of the approach I chose was to create a data structure to keep track of the original substring positions. The code is surprisingly short, about 50 lines with generous spacing. It takes about 0.32 seconds to find all the duplicates in the Covid-19 data and 13 seconds for the 1,000,000-character string. The code can be found in findLongestDupe (Summary).JSL.
Summary is used to find the duplicates on each iteration. While this is slow for large tables, we only need to build tables as large as the number of duplicates from the previous step. Starting with a substring length one character beyond the minimal length to guarantee duplicates, the total number of duplicates is likely to decreases if not immediately, then very quickly. If the number of rows in the few iterations isn’t too large, this method can be used without further optimization, such as starting with longer substrings. An advantage of this methods is that additional optimization can be used in these situations. The indexing used to keep track of initial string positions works in a fashion as might be employed with an associative array, which is what is used in the final method to discuss. Start by finding the locations of the duplicate strings of initial length and save them to a temporary array. Build strings one character longer starting from these locations making sure to remove any strings that would exceed the bounds of the input string. Find the locations of the duplicate strings in this list. We can link these new indices back to the original location by using them as indices to the previously saved array keeping only those elements that have indices. Replace the previous array with this one. Continue until there are no duplicates left at which time the previous array contains the locations of the longest duplicates.
As mentioned above, the final method is similar in flavor to the previous one. The idea for this approach is thanks to @brady_brady . There are a few important differences. First, it builds all substrings from one character up. While this sounds inefficient, by using Associative Arrays and indexing the individual string characters using matrices it eliminates searching the string directly and avoids character combinations that don’t exist. In general, it outperforms the previous method when the character set and input string get big. Ten runs for various input string sizes (50K – 2.5M) and character sets (4 to 72) were made to compare the two approaches. Mean run times are shown below for the Summary method (1) and Associative Array method (2). Complete results can be found in Compare fastest methods.JMP.
|
50K
|
100K
|
500K
|
1M
|
2.5M
|
Char Set Size
|
1
|
2
|
1
|
2
|
1
|
2
|
1
|
2
|
1
|
2
|
4
|
0.5087
|
0.5814
|
0.9081
|
1.2664
|
5.421
|
6.813
|
13.412
|
13.778
|
32.23
|
35.67
|
10
|
0.2724
|
0.4117
|
0.6251
|
0.9207
|
2.850
|
4.509
|
6.998
|
9.503
|
13.61
|
24.70
|
26
|
0.3098
|
0.3748
|
0.6786
|
0.7441
|
3.625
|
3.981
|
7.363
|
8.537
|
21.68
|
19.58
|
52
|
0.4247
|
0.3265
|
1.0617
|
0.7682
|
4.246
|
4.046
|
9.522
|
7.395
|
30.67
|
18.08
|
72
|
0.3186
|
0.3309
|
0.7308
|
0.7079
|
3.533
|
4.236
|
7.192
|
8.917
|
20.14
|
18.97
|
As with the Summary method, this approach starts by turning the input string into single characters. Rather than working with the characters directly, however, their numeric positions as they appear in the character set are used. This allows for substring building to occur using matrix operations, which tends to be faster than string operations. Specifically, the first iteration takes a matrix representation of each character set position as a key (e.g., [3] for G) to an associate array and uses the vector of its locations in the input string as its value. Each subsequent iteration starts by building a new Associative Array by find the input string values at 1 plus the vector values for every key in the previous array, being careful not to go beyond the last element in the input string. The value corresponding to the first element of the incremented vector is used to search the rest of the incremented vector values for duplicates. If they exist, the positions they are associated with are used for the value in the new array element. They key for this element is built by concatenating the character set value of the duplicated element to the original key making it easy to recreate the character string. The search value and its duplicates, if they exist, are removed from the vector and search continues until no values are left. This building process occurs for each element in the previous array. Once completed, the new associate array replaces the old and the next iteration starts. This continues until there are no new keys to replace the previous array. The previous array contains the longest duplicate(s).
Brady’s code can be found in findLongestDupe (AA).JSL. I have made a minor adjustment to allow for different character sets to be used.
For next time we will move from strings to lists. As mentioned in the previous blog, avoiding iteration has been the topic of a two recent Community posts involving turning numeric elements to character and list indexing. Two challenges for next time. They are based on questions that came up in the linked posts. In both cases. You are converting numeric values to character strings.
- Write a function that coverts a list of n lists (each with mi numbers). Don’t assume mi is the same for all lists.
- Write a function that can transpose the list from item 1. Here’s an small example:
inputList = {{12, 11, 16, 1, 15, ., 23, 8, 16, 17}, {1, 10, ., 4, 25, ., 19, 21, ., 3}, {11, 21,
22, 7, 3, ., 24, 10, 21, 20}, {., ., ., ., ., ., ., ., ., .}, {11, ., 3, ., 9, ., 23,
., 3, 16}};
outputList = {{12, 1, 11, ., 11}, {11, 10, 21, ., .}, {16, ., 22, ., 3}, {1, 4, 7, ., .}, {15, 25,
3, ., 9}, {., ., ., ., .}, {23, 19, 24, ., 23}, {8, 21, 10, ., .}, {16, ., 21, ., 3},
{17, 3, 20, ., 16}}
Good luck!
Compare fastest methods.jmp
Test Build Substrings.jmp
findLongestDupe (Summary).jsl