cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
SaraHorta23
Level I

How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Hi Everyone, 

 

I have a column composed by words that have combination of different characters (ie. AAAAAAAAAA - word with 10 characters). I want to identify in an adjacent column the words that have more than 2 characters different (ie. ABAAAACEAA). The position at which the letters are different doesn't matter and additionally, some words can also be shorter or longer than my reference word.

Therefore, I want to achieve a formula that would allow me to distinguish/identify words that have less/more than 2 different characters difference from my reference sequence and have the same length. I'm still a beginner in JMP and don't know if this is possible to achieve. Let me know if you have any suggestions

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
jthi
Super User

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Thank you for extra info! Check out Shortest Edit Script from Scripting index:

jthi_0-1644415162212.png

With that you might be able to do something, unless it is too slow. Pat Match could also maybe do this, but I haven't (yet) looked into how to use it properly.

 

editList = Shortest Edit Script(:List of sequences, "FEBRUARY");
common = "";
For(i = 1, i <= N Items(editList), i++,
	If(editList[i][1] == "Common",
		common = common || editList[i][2]
	)
);
Length(common) < Length(:List of sequences) - 2;

jthi_1-1644415221426.png

 

-Jarmo

View solution in original post

9 REPLIES 9
ian_jmp
Staff

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Take a look at the attached table, which should give you a start. The heart of it is a Formula Column (called 'Decision' and denoted by the '+' sign in the columns panel of the table). Clicking on the '+' gives:

Screenshot 2022-02-09 at 10.15.32.png

You need some knowledge of JSL to understand this, and can use 'Help > Scripting Index' to figure out exactly how it works if you need to.

 

jthi
Super User

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Many ways to do this. Here is one additional option:

Names Default To Here(1);

/*
a = "ABAAAACEAA";
b = "AAAAAAAAAA";
Show(!IsMissing(Substitute(a, Left(a,1), "")));
Show(!IsMissing(Substitute(b, Left(b,1), "")));
*/
New Table("Untitled 2",
	Add Rows(3),
	New Column("Column 1", Character, "Nominal", Set Values({"ABAAAACEAA", "AAAAAAAAAA", "BBBBAAA"})),
	New Column("Column 2",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Formula(!Is Missing(Substitute(:Column 1, Left(:Column 1, 1), "")))
	)
);

Column1 is the column with characters

Formula(!Is Missing(Substitute(:Column 1, Left(:Column 1, 1), "")))

in this simple example idea is to replace all values in the string found from first position and if there are still some characters left, we know that there were more than 1 character in the string. If you have more than one value to replace, you should be able to use same idea by modifying Substitute function arguments

-Jarmo
SaraHorta23
Level I

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Hi Jthi, thanks for the suggestion.

To be honest I'm still trying to understand the formula :'). In the file I'm working on, I have only one reference word and I have to screen more than a million row entries with every possible combination of position/letter. I assume that this formula would not work because I cannot define those values in the formula. Would it be possible to adapt it considering a large database Ive?

 

 

SaraHorta23
Level I

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Hi Ian,

 

Thanks for the feedback. I tried to implement the formula you wrote but I don't get where can I define that my reference sequence is AAAAAAA and not something else. I replaced the "" in the items selection by "AAAAAAAA" but the result was not correct. Any suggestion how can I upgrade the formula you wrote?

ian_jmp
Staff

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Looks like I misunderstood what you meant by "I want to identify in an adjacent column the words that have more than 2 characters different". What I suggested would flag rows where each cell in the source column contains a string that is itself made up of more than two distinct characters (so there is no reference sequence or string).

Craige_Hales
Super User

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

We need some more examples. What is the number of differences for (don't over-specify, indicate where it isn't important.)

 

ref = "AEIOUA" // trying to understand the rule for the repeated letter

 

test = "AEIOUA" // nDiff = 0 for identical string

test = "AAEIOU" // nDiff = ? for same letters, reordered

test = "AEIOU" // nDiff = ? for same letters, missing one on right

test = "EIOUA" // nDiff = ? for same letters, missing one on left

test = "EIOU" // nDiff = ?

test = "AA" // dDiff = ?

test = "AXA" // dDiff = ?

test = "AXXXXA" // dDiff = ? for matching length

test = "AEIUOA" // dDiff = ? for internal letters swapped

test = "AEIOUEIOU" // nDiff = ? for repeated letters

 

These answers would help choose an algorithm:

You said ~1e6 test words; how many ref words are you looking at?

Will you need answers other than a "yes/no there are more than 2 differences"? Is a count of diffs going to be useful?

Craige
SaraHorta23
Level I

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Hi Craige,

Thanks for helping out. I have 1 to 5e6 and only one reference word. I basically want to filter out words that have more than 2 letter different comparing to a reference word and therefore I do not require more than yes/no or 0/1 as answer. Going through your questions:

 

test = "AEIOUA" // nDiff = 0 for identical string

test = "AAEIOU" // nDiff = 1 - for same letters, reordered.

- The first A is only one kept at the same position. (5 diff letters)

 

test = "AEIOU" // nDiff = 1 for same letters, missing one on right 

test = "EIOUA" // nDiff = 1 for same letters, missing one on left

test = "EIOU" // nDiff = 1

test = "AA" // dDiff = 1

test = "AXA" // dDiff = 1

- Length is not same as AEIOUA. For this I applied a function Len to filter those words out.

 

test = "AXXXXA" // dDiff = 1 for matching length

- The first and last A are kept at the same position. (4 diff letters)

 

test = "AEIUOA" // dDiff = 0 for internal letters swapped

Only 2 letters different from AEIOUA, so it will pass the criteria. 

 

test = "AEIOUEIOU" // nDiff = 1 for repeated letters

Long sequence. Removed using Len.

 

 

Sorry, I was not clear when explaining my question/problem. I recreated an example of what i'm looking for in addition to your questions. 

 

Reference sequenceFEBRUARY
  
List of sequencesLabel - more than 2 characters different from reference?
FEBAUARYNo
WAARUARYYes
YRAURBEFYes
FEBRURRRNo
FEBRUARWNo
YEAURBEFYes
YEAURARYYes
jthi
Super User

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Thank you for extra info! Check out Shortest Edit Script from Scripting index:

jthi_0-1644415162212.png

With that you might be able to do something, unless it is too slow. Pat Match could also maybe do this, but I haven't (yet) looked into how to use it properly.

 

editList = Shortest Edit Script(:List of sequences, "FEBRUARY");
common = "";
For(i = 1, i <= N Items(editList), i++,
	If(editList[i][1] == "Common",
		common = common || editList[i][2]
	)
);
Length(common) < Length(:List of sequences) - 2;

jthi_1-1644415221426.png

 

-Jarmo
SaraHorta23
Level I

Re: How do I identify words (using formula in column) that have more than 2 characters different from my reference word?

Thanks Jthi! Works nicely