Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- JMP User Community
- :
- Discussions
- :
- Count the number of Ts in a sequence

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Nov 28, 2018 2:48 PM
| Last Modified: Nov 30, 2018 7:46 AM
(6228 views)

Hello,

I am trying to find a formula that will count highest number times a letter is repeated consecuteivley in a sequence. I have attached an example where I am trying to write formula for poly Ts column, and it will generate count of Ts in a sequence consectively.

Thank you,

Pratish

1 ACCEPTED SOLUTION

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Nov 29, 2018 1:50 AM
| Last Modified: Nov 29, 2018 1:51 AM
(6199 views)
| Posted in reply to message from padhikari 11-28-2018

A regular expression expert might have a nice pattern to scan and find all matches, but that is beyond my REGEX skills.

I have provided two solutions that can be done using column formulas. Both might need some explanation.

The first uses nested character functions, the second uses the ShortestEditScript() function. By the way, you did not specify if you are counting T sequences prior to (N1), both use the entire string. The example table is attached and explanations are below

Assume s2 = "(N1:25252525)AACCAA(N1)GACGTTAACAGTTCTTTG";

Character functions: Words(), Sort List(), Reverse(), list[n], Length()

`Length(Reverse(Sort List(words(s2,"ACGN():0123456789")))[1]);`

Here is the log output for the respective functions

```
//:*/
words(s2,"ACGN():0123456789")
/*:
{"TT", "TT", "TTT"}
//:*/
Reverse(Sort List(words(s2,"ACGN():0123456789")))
/*:
{"TTT", "TT", "TT"}
//:*/
Reverse(Sort List(words(s2,"ACGN():0123456789")))[1]
/*:
"TTT"
//:*/
Length(Reverse(Sort List(words(s2,"ACGN():0123456789")))[1]);
/*:
3
```

ShortestEditScript() is an interesting function. Script **9_Extra_ShortestEditDistance.jsl** written for __JSL Companion, Applications of the JMP Scripting Language Second Edition__ document 4 different methods for using this powerful and useful function. For this example, I am using Sequnces() and requesting matrix output. It would take too much space to document this completely, in this forum, so I will just show the results and add a few comments.

```
msed = Shortest Edit Script( Strings(s2, Repeat("T",length(s2)),matrix(1)) );
maximum(msed[loc(msed[0,1]==0),4]);
```

The two strings being compared are s2 and a string of all T's created by function Repeat("T", length(s2) ).

```
msed = Shortest Edit Script( Strings(s2, Repeat("T",length(s2)),matrix(1)) );
/*:
[-1 1 . 27,
0 28 1 2,
-1 30 . 5,
0 35 3 2,
-1 37 . 1,
0 38 5 3,
-1 41 . 1,
1 . 8 34]
/* The matrix output n x 4 where n = nrow(msed)
Column1: -1 | 1| 0 -1-->remove, 1-->insert, 0-->common
Column2: position in the 1st string .-->missing / not found
Column3: position in the 2nd string .-->missing / not found
Column4: length
*/
```

So now it is a matter of finding the locations in the 1st column of the matrix with 0's (matches/common/T's) this can be done with the loc() function. Then the length of the matching sequence is in the 4th column, so just find the maximum. Note msed[0,1] represents the 1st column of the matrix msed.

`loc(msed[0,1]==0)`

/*:

[2, 4, 6]

//:*/

msed[loc(msed[0,1]==0),4]

/*:

[2, 2, 3]

//:*/

maximum(msed[loc(msed[0,1]==0),4]);

/*:

3

It will be interesting to see other solutions.

4 REPLIES 4

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Nov 29, 2018 1:50 AM
| Last Modified: Nov 29, 2018 1:51 AM
(6200 views)
| Posted in reply to message from padhikari 11-28-2018

A regular expression expert might have a nice pattern to scan and find all matches, but that is beyond my REGEX skills.

I have provided two solutions that can be done using column formulas. Both might need some explanation.

The first uses nested character functions, the second uses the ShortestEditScript() function. By the way, you did not specify if you are counting T sequences prior to (N1), both use the entire string. The example table is attached and explanations are below

Assume s2 = "(N1:25252525)AACCAA(N1)GACGTTAACAGTTCTTTG";

Character functions: Words(), Sort List(), Reverse(), list[n], Length()

`Length(Reverse(Sort List(words(s2,"ACGN():0123456789")))[1]);`

Here is the log output for the respective functions

```
//:*/
words(s2,"ACGN():0123456789")
/*:
{"TT", "TT", "TTT"}
//:*/
Reverse(Sort List(words(s2,"ACGN():0123456789")))
/*:
{"TTT", "TT", "TT"}
//:*/
Reverse(Sort List(words(s2,"ACGN():0123456789")))[1]
/*:
"TTT"
//:*/
Length(Reverse(Sort List(words(s2,"ACGN():0123456789")))[1]);
/*:
3
```

ShortestEditScript() is an interesting function. Script **9_Extra_ShortestEditDistance.jsl** written for __JSL Companion, Applications of the JMP Scripting Language Second Edition__ document 4 different methods for using this powerful and useful function. For this example, I am using Sequnces() and requesting matrix output. It would take too much space to document this completely, in this forum, so I will just show the results and add a few comments.

```
msed = Shortest Edit Script( Strings(s2, Repeat("T",length(s2)),matrix(1)) );
maximum(msed[loc(msed[0,1]==0),4]);
```

The two strings being compared are s2 and a string of all T's created by function Repeat("T", length(s2) ).

```
msed = Shortest Edit Script( Strings(s2, Repeat("T",length(s2)),matrix(1)) );
/*:
[-1 1 . 27,
0 28 1 2,
-1 30 . 5,
0 35 3 2,
-1 37 . 1,
0 38 5 3,
-1 41 . 1,
1 . 8 34]
/* The matrix output n x 4 where n = nrow(msed)
Column1: -1 | 1| 0 -1-->remove, 1-->insert, 0-->common
Column2: position in the 1st string .-->missing / not found
Column3: position in the 2nd string .-->missing / not found
Column4: length
*/
```

So now it is a matter of finding the locations in the 1st column of the matrix with 0's (matches/common/T's) this can be done with the loc() function. Then the length of the matching sequence is in the 4th column, so just find the maximum. Note msed[0,1] represents the 1st column of the matrix msed.

`loc(msed[0,1]==0)`

/*:

[2, 4, 6]

//:*/

msed[loc(msed[0,1]==0),4]

/*:

[2, 2, 3]

//:*/

maximum(msed[loc(msed[0,1]==0),4]);

/*:

3

It will be interesting to see other solutions.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Count the number of Ts in a sequence

Here's a simple brute force approach; not sure of the performance relative to @gzmorgan0's methods.

```
s2 = "(N1:25252525)AACCAA(N1)GACGTTAACAGTTCTTTG";
len = length(s2);
tstring = repeat("T", len);
maxlen = 0;
for (i = len, i >= 1, i--,
if (contains(s2, tstring),
maxlen = i;
break();
,
tstring = substr(tstring, 2);
);
);
show(maxlen);
```

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Count the number of Ts in a sequence

In the spirit of 'other solutions' here's another brute force one:

```
NamesDefaultToHere(1);
// Given a string and a single character, finds the longest sequence of that character
// and returns the length and starting position of that sequence. If the sequence
// occurs more than once, only the first is identified
findLongestRepeatedCharacter =
Function({str, char}, {Default Local},
n = Length(str);
count = 0;
currentCount = 1;
// Traverse the string except for the last character
for (i = 1, i <= n-1, i++,
thisChar = Substr(str, i, 1);
nextChar = Substr(str, i+1, 1);
// If the current character and the next are both 'char' ...
if((thisChar == char & nextChar == char),
// ... increment 'currentCount'
currentCount++,
// ... else if they're not ...
if(currentCount > count,
// ... record 'currentCount' if it's bigger than we've seen so far
count = currentCount;
);
// ... and reset 'currentCount'
currentCount = 1;
);
);
// Build the sequence we've found
seq = Repeat(char, count);
// Find where it occurs
pos = Munger(str, 1, seq);
// Return the results
if (pos == 0,
EvalList({0, pos}),
EvalList({count, pos})
);
);
// Try it out
str = "(N1:25252525)AACCAA(N1)GACGTTAACAGTTCTTTG";
Print(findLongestRepeatedCharacter(str, "T"));
Print(findLongestRepeatedCharacter(str, "A"));
Print(findLongestRepeatedCharacter(str, "X"));
```

Highlighted
##

Table formula using ShortestEditScript() used s2 in a portion of the formula that should have been Sequence. Table with corrected function is attached.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Count the number of Ts in a sequence

Article Labels

There are no labels assigned to this post.