cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
The Discovery Summit 2025 Call for Content is open! Submit an abstract today to present at our premier analytics conference.
See how to use JMP Live to centralize and share reports within groups. Webinar with Q&A April 4, 2pm ET.
Choose Language Hide Translation Bar
View Original Published Thread

JMP Challenge 7

With lots going since just after the last post, including getting ready for JMP 16, which came out only a few days ago, and with no replies to my last post asking for column formula challenges that readers had encountered, I didn’t think I had the time or material to post something worth your time reading.

I’ve had a bit more time as of late, and it’s time for a new challenge. We’ll table the column formula challenges for another time and start with a new challenge and new approach. For this post, and likely a few after that, I’ll talk about a general question based on a Craige Hales response to a community post. Craige (@Craige_Hales), despite having retired from JMP Development, remains very active here in the Community, answering user questions, maintaining his Uncharted blog, and contributing to this blog. The question was about the efficiency of transposing a (multidimensional) list without using loops. I want to pose a more general question that I’ve often wondered about: “Are there better alternatives to iteration and, if so, what are they?” In most cases better means faster, but I’ll also consider code compactness if it doesn’t compromise speed too much. We’ll also take a look at different approaches to iteration to see if there are differences there, as well. Let’s start with searching text.

Strings

A common task in bioinformatics is to search a string containing genetic information to find specific substring patterns. For DNA, the string characters are comprised of only four letters: A, T, C, and G. The file Covid19.TXT (attached to this post) contains the genetic sequence for the Covid19 genome, approximately 30,000 characters long (in contrast, human chromosome 1, the longest, is approximately 290 million characters long). How quickly can it be searched to find the location of all occurrences of certain substrings from 2 to 10 characters in length?

Let’s look at four approaches, three of them iterative. Regardless of technique, be aware that substring matches can overlap. For example, BAB is contained twice in BABAB, from positions 1 to 3 and again from 3 to 5. We need something that will capture both of these occurrences. We’ll start with Substr using the code below, where inString is the string to be searched and inSeq is the substring to search for.

	findSequence = Function({inString,inSeq},{i,seqLen,checkTo,outVec},
		seqLen  = Length(inSeq);
		checkTo = Length(covid19)-seqLen+1; //Only need to go until the substring includes the last character
		outVec  = J(checkTo,1,0);
		For(i=1,i<=checkTo,i++,
			outVec[i] = Substr(inString,i,seqLen) == inSeq;
		);
		Loc(outVec);
	);

This was the least impressive of the approaches, taking an average of 1.2 seconds, regardless of substring length. Results are shown below (in microseconds). The test generated 10 random substrings for each substring length and repeated the search 20 times. Values were summarized over all substrings of a given length. Complete details for individual substrings are in Covid 19 Results Full.JMP attached to this post. The code used to generate the test is in Test Covid19 Search.JSL.

Substring Length Substr Index Contains PatMatch
2 1,227,342 60,312 79,640 4,046
3 1,245,909 62,650 22,143 3,103
4 1,215,641 66,045 4,938 2,881
5 1,218,983 72,337 1,293 2,880
6 1,220,846 76,705 357 2,956
7 1,249,843 77,127 234 3,010
8 1,212,982 84,413 132 3,078
9 1,227,892 87,479 113 3,331
10 1,208,948 89,732 106 3,742

We have seen in previous challenges that Substr is slow. What if we were to “atomize” the string, break it into individual characters and look at character ranges? The motivation for this is on the order of directly accessing string values via indexing as you can in a language like C. The atomization part is needed in JSL since you can’t index a string directly. The code I used is here:

	findSequence = Function({inString,inSeq},{i,seqLen,checkTo,outVec,inList,seqList},
		seqLen  = Length(inSeq);		
		checkTo = Length(inString)-Length(inSeq)+1; //Only need to go until the substring includes the last character
		outVec  = J(checkTo,1,0);
		inList  = Items(inString,"");
		seqList = Items(inSeq,"");
		
		For(i=1,i<=checkTo,i++,
			outVec[i] = inList[i::(i+seqLen-1)] == seqList;
		);
		Loc(outVec);
	);

While this approach sounds like it would be more time-consuming since there is the initial overhead of breaking up the searched string in addition to searching it, it’s about 13-19X faster as seen in the result table. Unlike Substr, however, search time increases as a function of substring length.

The final iterative approach uses Contains. The idea is to find the given substring, mark its location in the searched string, and repeat the process starting one character past the location until the substring can’t be found. The appeal of this method is that searching becomes very fast when there are few occurrences of the substring in the searched string. The code below is thanks to Brady Brady (@brady_brady

	findSequence = Function({inString,inSeq},{pos=0,ans={}},
		while(1,
			 pos = contains(inString, inSeq, pos);
			 if(pos, insertinto(ans, pos++), break())
		);
		ans;
	);

As you can see, while this approach starts slightly slower than the indexing method for two‑character substrings, it rapidly gets faster as the substrings get longer, i.e., as there are fewer substrings to find.

The final approach uses what is becoming one of my favorite JSL tools, pattern matching. Not only is the code fast, but it is short.

	findSequence = Function({inString,inSeq},{i=1,locs={}},
		Pat Match(inString,Pat Pos()>>r + inSeq + Pat Test(Insert Into(locs,r+1);0));
		locs;
	);

While it is slightly slower for the two-character substrings, it doesn’t seem to be affected by the length of the substring for three characters and larger.

Clearly, the iterative approach using Contains and pattern matching approach are superior methods for searching long strings. But which method, and when? Since the testing was done using different substrings, some indication of how substring frequency (# occurrences*1000/29,903) may affect search time. The plot below shows a crossover point somewhere around a frequency of 2 substrings per 1,000 characters. Because all testing was done on the same covid19 genome string, however, it is unclear how search time is affected by searched string length.

Covid19 Results Full - Graph Builder.png

To find the answers to these questions, and to explore possible interactions between factors, I ran an experiment using a space-filling design, varying searched string length (10,000 – 500,000 characters), substring length (2 – 21 characters), and frequency (0.75 – 3 substrings per 1,000 characters). 125 runs were generated using a Fast Flexible design with fractional string length values rounded. For each run, a random string of alphabetical characters was generated. Within that string, a substring made up of randomly generated numbers was inserted at random locations. Because frequency values could lead to a non-integer number of insertions, the number of insertions was taken as the ceiling of the frequency times the searched string length divided by 1,000. Each run was repeated 10 times on the same searched string/substring combination. The code used to generate the runs is in Execute Design Script.JSL and results are in Space Filling Results 1.JMP attached to this post.

A quick inspection of the results shows only 13 observations where the Contains method outperforms pattern matching. Of these, the largest difference is about 4,000 microseconds faster (whereas pattern matching was as much as 0.72 seconds faster than Contains). Fitting a Gaussian Process model to the data suggests that searched string length had the largest effect with a strong frequency by searched string length interaction. Substring length had little effect. A linear model was initially fit to the data, but residuals analysis suggested that the fit might be questionable. I have posted an interactive version of the Profiler for this and the follow up models to JMP Public.

A second space-filling design was created to get better representation of cases where Contains outperformed pattern matching. For this experiment, the ranges for both string lengths remained as before and frequency was varied between 0 and 1. The results are in Space Filling Results 2.JMP. The direction of the results was similar to before with longer searched string lengths and higher frequencies favoring pattern matching.

Combining both sets of results and fitting a Gaussian Process model suggests that for strings of 50,000 characters, Contains is superior if frequencies are below 2. At 100,000 characters, the frequency drops to 1 and at 200,000 characters, below 0.5. The conclusion from this is that for sparser shorter strings, use Contains and as string length and density increases, pattern matching will be superior.

Challenge for the month

Using the Covid19 genome, find the longest substring that appears two or more times.

Rather than leaving this unanswered until the next post, let’s discuss and share. I believe this is a very hard problem, and we will benefit from open discussion.  

Comments
brady_brady
Staff

The Contains approach can be written more succintly as below. As a comparison is removed, this also has the benefit of running a bit faster, most noticeably when the substring does not exist within the main string.

 

findSequence = Function({txt, sub},{pos=0,ans={}},
	while(pos = contains(txt, sub, pos), insertinto(ans, pos++));
	ans;
);
Craige_Hales
Super User

Including overlaps? That makes it easy. Not sure how I'd do it excluding the overlaps though. This solution is a 5-10 second answer for the supplied data. And it would get painful for much longer data with much longer substrings.

The idea is to count how many of each substring occur for different substring lengths, at every position 1...length of text. Then ask "are there any counts > 1?" Use a binary search on the substring length to find the biggest substring length that produces duplicates. A little more work should be done at the end to see if there is another, more interesting, substring of the same length.

txt = Load Text File( "z:/covid-19.txt" );

// finding the longest duplicated substring, including overlaps.
// if there is one, then there will be shorter ones of every possible
// shorter length. So, as the possible length of the substring grows,
// there is a boundary at some length where all shorter lengths have
// duplicates and no longer lengths have duplicates. find the boundary.

// do a binary search for the longest string that occurs 2 or more times
// testlength is the length during a pass

// see https://en.wikipedia.org/wiki/Binary_search_algorithm#Procedure
// for a *real* binary search. this code is potentially wrong and very
// poorly tested!

// Changing the maxlength guess below from 1...1000 will produce different times.
// 20 is a good starting point, bigger than 200 or so is very slow because 
// length(txt) * maxlength strings are being manipulated, a lot.

start = Tick Seconds();

minlength = 1;
maxlength = 27; // initial guess, will (slowly) auto-adjust bigger as needed
firstpass = 1; // needed to detect guess too small
minkey = .;
minval = .;
while( 1,
	testlength = if( firstpass == 1,
		Min( Length( txt ), maxlength )// this binary search starts at the big end, then splits on subsequent passes. This lets code
	, // else
		Floor( (minlength + maxlength) / 2 )// below test that the big end is big enough on the first pass.
	);
	Write( "testing strings of length ", testlength, "\!n" );
	Wait( 0 );
	aa = [=> 0];

	patmatch( txt,
		Pat Len( testlength ) >> sequence + //
		pattest( // build the associative array of key=sequence, val=count
			aa[sequence]++;
			// usually I use 1; on the next line. that always succeeds.
			// here, I want pattest to make the matcher back up and retry...
			0; // fail! try again at next position
		),
		NULL, // unused positional replacement
		FULLSCAN // needed to bypass a heuristic and get the last little bit
	);
	keys = aa << getkeys;
	vals = aa << getvalues;
	idx = Loc Max( Matrix( vals ) );
	bigN = vals[idx];
	//
	// the binary search is for the biggest testlength that has bigN > 1.
	// bigN is the maximum repeats of a string of testlength characters.
	//
	if( bigN == 1,
		maxlength = testlength;// maxlength promises to be 1-too-big at the end
	, //else
		minlength = testlength;
		Write( " found some, minlength=", minlength, " maxlength=", maxlength, "\!n" );
		minkey = keys[idx];
		minval = vals[idx];
		if( firstpass == 1, // looking for a big-enough high end, still
			maxlength = Min( 2 * maxlength, Length( txt ) );
			Write( "Increasing max, minlength=", minlength, " maxlength=", maxlength, "\!n" );
			Continue();
		);
	);
	firstpass = 0;
	if( minlength >= maxlength - 1,
		if( Is Missing( minkey ),
			Continue();// needs one more pass
		, // else
			Break();// got it
		)
	);
);

Write( "\!n", minkey, " occurs ", minval, " times at " );
patmatch(
	txt,
	Pat Pos() >> location + //
	minkey + //
	pattest(
		Write( location + 1, " " );
		1;
	) + //
	Pat Fail() // this will also retry the match, 1 character further along
);
Write( "\!n" );

stop = Tick Seconds();
Show( stop - start );
Craige_Hales
Super User

I think the best answer should be at least twice as fast as above. That JSL throws away a lot of information about positions that have already been shown to have no duplicates of some length (and, of course, no duplicates of longer length.)

ian_jmp
Level X

Please find another approach that may or may not be better under some circumstances (I didn't have the time to check):

NamesDefaultToHere(1);

str = LoadTextFile("$DOWNLOADS/covid-19.txt");
txt = "AACA";

bStr = CharToBlob(str);
bTxt = CharToBlob(txt);

pos = {};
for(i=1, i<=(Length(str) - Length(txt)), i++,
	if(BlobPeek(bStr, i, Length(txt)) == bTxt, InsertInto(pos, i+1));
	);
Print(NItems(pos));