cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
JMP Challenge 7

With lots going since just after the last post, including getting ready for JMP 16, which came out only a few days ago, and with no replies to my last post asking for column formula challenges that readers had encountered, I didn’t think I had the time or material to post something worth your time reading.

I’ve had a bit more time as of late, and it’s time for a new challenge. We’ll table the column formula challenges for another time and start with a new challenge and new approach. For this post, and likely a few after that, I’ll talk about a general question based on a Craige Hales response to a community post. Craige (@Craige_Hales), despite having retired from JMP Development, remains very active here in the Community, answering user questions, maintaining his Uncharted blog, and contributing to this blog. The question was about the efficiency of transposing a (multidimensional) list without using loops. I want to pose a more general question that I’ve often wondered about: “Are there better alternatives to iteration and, if so, what are they?” In most cases better means faster, but I’ll also consider code compactness if it doesn’t compromise speed too much. We’ll also take a look at different approaches to iteration to see if there are differences there, as well. Let’s start with searching text.

Strings

A common task in bioinformatics is to search a string containing genetic information to find specific substring patterns. For DNA, the string characters are comprised of only four letters: A, T, C, and G. The file Covid19.TXT (attached to this post) contains the genetic sequence for the Covid19 genome, approximately 30,000 characters long (in contrast, human chromosome 1, the longest, is approximately 290 million characters long). How quickly can it be searched to find the location of all occurrences of certain substrings from 2 to 10 characters in length?

Let’s look at four approaches, three of them iterative. Regardless of technique, be aware that substring matches can overlap. For example, BAB is contained twice in BABAB, from positions 1 to 3 and again from 3 to 5. We need something that will capture both of these occurrences. We’ll start with Substr using the code below, where inString is the string to be searched and inSeq is the substring to search for.

	findSequence = Function({inString,inSeq},{i,seqLen,checkTo,outVec},
		seqLen  = Length(inSeq);
		checkTo = Length(covid19)-seqLen+1; //Only need to go until the substring includes the last character
		outVec  = J(checkTo,1,0);
		For(i=1,i<=checkTo,i++,
			outVec[i] = Substr(inString,i,seqLen) == inSeq;
		);
		Loc(outVec);
	);

This was the least impressive of the approaches, taking an average of 1.2 seconds, regardless of substring length. Results are shown below (in microseconds). The test generated 10 random substrings for each substring length and repeated the search 20 times. Values were summarized over all substrings of a given length. Complete details for individual substrings are in Covid 19 Results Full.JMP attached to this post. The code used to generate the test is in Test Covid19 Search.JSL.

Substring Length Substr Index Contains PatMatch
2 1,227,342 60,312 79,640 4,046
3 1,245,909 62,650 22,143 3,103
4 1,215,641 66,045 4,938 2,881
5 1,218,983 72,337 1,293 2,880
6 1,220,846 76,705 357 2,956
7 1,249,843 77,127 234 3,010
8 1,212,982 84,413 132 3,078
9 1,227,892 87,479 113 3,331
10 1,208,948 89,732 106 3,742

We have seen in previous challenges that Substr is slow. What if we were to “atomize” the string, break it into individual characters and look at character ranges? The motivation for this is on the order of directly accessing string values via indexing as you can in a language like C. The atomization part is needed in JSL since you can’t index a string directly. The code I used is here:

	findSequence = Function({inString,inSeq},{i,seqLen,checkTo,outVec,inList,seqList},
		seqLen  = Length(inSeq);		
		checkTo = Length(inString)-Length(inSeq)+1; //Only need to go until the substring includes the last character
		outVec  = J(checkTo,1,0);
		inList  = Items(inString,"");
		seqList = Items(inSeq,"");
		
		For(i=1,i<=checkTo,i++,
			outVec[i] = inList[i::(i+seqLen-1)] == seqList;
		);
		Loc(outVec);
	);

While this approach sounds like it would be more time-consuming since there is the initial overhead of breaking up the searched string in addition to searching it, it’s about 13-19X faster as seen in the result table. Unlike Substr, however, search time increases as a function of substring length.

The final iterative approach uses Contains. The idea is to find the given substring, mark its location in the searched string, and repeat the process starting one character past the location until the substring can’t be found. The appeal of this method is that searching becomes very fast when there are few occurrences of the substring in the searched string. The code below is thanks to Brady Brady (@brady_brady

	findSequence = Function({inString,inSeq},{pos=0,ans={}},
		while(1,
			 pos = contains(inString, inSeq, pos);
			 if(pos, insertinto(ans, pos++), break())
		);
		ans;
	);

As you can see, while this approach starts slightly slower than the indexing method for two‑character substrings, it rapidly gets faster as the substrings get longer, i.e., as there are fewer substrings to find.

The final approach uses what is becoming one of my favorite JSL tools, pattern matching. Not only is the code fast, but it is short.

	findSequence = Function({inString,inSeq},{i=1,locs={}},
		Pat Match(inString,Pat Pos()>>r + inSeq + Pat Test(Insert Into(locs,r+1);0));
		locs;
	);

While it is slightly slower for the two-character substrings, it doesn’t seem to be affected by the length of the substring for three characters and larger.

Clearly, the iterative approach using Contains and pattern matching approach are superior methods for searching long strings. But which method, and when? Since the testing was done using different substrings, some indication of how substring frequency (# occurrences*1000/29,903) may affect search time. The plot below shows a crossover point somewhere around a frequency of 2 substrings per 1,000 characters. Because all testing was done on the same covid19 genome string, however, it is unclear how search time is affected by searched string length.

Covid19 Results Full - Graph Builder.png

To find the answers to these questions, and to explore possible interactions between factors, I ran an experiment using a space-filling design, varying searched string length (10,000 – 500,000 characters), substring length (2 – 21 characters), and frequency (0.75 – 3 substrings per 1,000 characters). 125 runs were generated using a Fast Flexible design with fractional string length values rounded. For each run, a random string of alphabetical characters was generated. Within that string, a substring made up of randomly generated numbers was inserted at random locations. Because frequency values could lead to a non-integer number of insertions, the number of insertions was taken as the ceiling of the frequency times the searched string length divided by 1,000. Each run was repeated 10 times on the same searched string/substring combination. The code used to generate the runs is in Execute Design Script.JSL and results are in Space Filling Results 1.JMP attached to this post.

A quick inspection of the results shows only 13 observations where the Contains method outperforms pattern matching. Of these, the largest difference is about 4,000 microseconds faster (whereas pattern matching was as much as 0.72 seconds faster than Contains). Fitting a Gaussian Process model to the data suggests that searched string length had the largest effect with a strong frequency by searched string length interaction. Substring length had little effect. A linear model was initially fit to the data, but residuals analysis suggested that the fit might be questionable. I have posted an interactive version of the Profiler for this and the follow up models to JMP Public.

A second space-filling design was created to get better representation of cases where Contains outperformed pattern matching. For this experiment, the ranges for both string lengths remained as before and frequency was varied between 0 and 1. The results are in Space Filling Results 2.JMP. The direction of the results was similar to before with longer searched string lengths and higher frequencies favoring pattern matching.

Combining both sets of results and fitting a Gaussian Process model suggests that for strings of 50,000 characters, Contains is superior if frequencies are below 2. At 100,000 characters, the frequency drops to 1 and at 200,000 characters, below 0.5. The conclusion from this is that for sparser shorter strings, use Contains and as string length and density increases, pattern matching will be superior.

Challenge for the month

Using the Covid19 genome, find the longest substring that appears two or more times.

Rather than leaving this unanswered until the next post, let’s discuss and share. I believe this is a very hard problem, and we will benefit from open discussion.  

Last Modified: Mar 29, 2021 11:29 AM
Comments