Fuzzy regex (e.g. {e<=2}) correct usage in Python

Question

I am trying to find strings which are at most two mistakes 'away' from the original pattern string (i.e. they differ by at most two letters).

However, the following code isn't working as I would expect, at least not from my understanding of fuzzy regex:

import regex
res = regex.findall("(ATAGGAGAAGATGATGTATA){e<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
print res
>> ['ATAGAGCAAGATGATGTATA']  # the second string

As you can see, the two strings differ on three letters rather than at most two:

the first has: ATAGGAGAAGATGATGTATA

the second has: ATAGAGCAAGATGATGTATA

and yet the result shows the second string, as though it's within e<=2 (this also happens with overlapped=False, so that can't be it).

What am I missing here? And is there any way of getting this to find only strings within the Hamming 2-ball of the given pattern?

Is it possible that a swap of letters is considered to be only one change? And if so - how can I avoid this?

Let me precise: the ATAGAGCAAGATGATGTATA is the correct result as per the expression. You asked to find ATAGGAGAAGATGATGTATA where any 2 differences (substitutions, insertion or deletion) can be found. AGC is in fact 2 differences: G is removed and C is inserted. I suggest using a non-regex approach here (there is an example code in Wikipedia). Just get all the necessary permutations of the string and check them with that method. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 3, 2017 at 8:31
Thanks Wiktor, that makes sense. Regarding your suggestion of creating all permutations first and then checking them - I did that, but it's unfortunately much slower... If anyone is familiar with a way of getting the regex expression to relate to hamming distances only (in this case - at most two positions at which the corresponding letters are different) I would love to know. Thanks! — Anon
– Anon, Commented Oct 3, 2017 at 8:40

Oleh Rybalchenko · Accepted Answer · 2017-10-03 08:50:16Z

Let's check this snippet for fuzzy counts:

>>> pattern_string = 'ATAGGAGAAGATGATGTATA'
>>> query_string = 'ATAGAGCAAGATGATGTATA'
>>> r = regex.compile('(%s){e<=2}' % pattern_string)
>>> r.match(query_string)
<regex.Match object; span=(0, 20), match='ATAGAGCAAGATGATGTATA', fuzzy_counts=(0, 1, 1)>

fuzzy_counts=(0, 1, 1) means that in this case, we get no substitutions, 1 insertion, and 1 deletion. So your filter works because the total count of errors is 2.

But it seems that you need to filter only by substitutions count, so you can modify the regex:

import regex
res = regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
print res

Check this great example from docs:

{i<=3} permit at most 3 insertions, but no other types
{d<=3} permit at most 3 deletions, but no other types
{s<=3} permit at most 3 substitutions, but no other types
{i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
{e<=3} permit at most 3 errors
{1<=e<=3} permit at least 1 and at most 3 errors
{i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

Mark Amery · Accepted Answer · 2019-07-07 12:40:50Z

Your mistake is to assume that "errors" are the same thing as "substitutions", when this is not the case.

The regex package's fuzzy matching understands three kinds of errors - insertions, deletions, and substitutions. An error distance specified with e, as you've used, can be made up of any combination of those errors. And ATAGGAGAAGATGATGTATA can be edited into ATAGAGCAAGATGATGTATA with only two such operations (1 deletion and 1 insertion), as shown by the sequence alignment below:

ATAGGAG-AAGATGATGTATA
ATAG-AGCAAGATGATGTATA

is there any way of getting this to find only strings within the Hamming 2-ball of the given pattern?

Yes. Note that Hamming distance is a kind of edit distance that measures the minimum number of substitutions required to edit one string to another of equal length. So to only match strings within the Hamming 2-ball of pattern, we need to tell regex to match anything within 2 substitutions, which we can do by using the s error type instead of e:

import regex
res = regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
print res

Is it possible that a swap of letters is considered to be only one change?

Not in the regex package as it currently stands. The standard term of art for a "swap" of two characters is a "transposition". Edit distances that include transpositions as a possible edit (e.g. Dameau-Levenshtein distance, in which edits can be insertions, substitutions, deletions, or transpositions of adjacent characters) do exist and are useful for some applications (e.g. typo correction). However, at the time of writing, the fuzzy matching in the regex package does not have any support for them at all.

Collectives™ on Stack Overflow

Fuzzy regex (e.g. {e<=2}) correct usage in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related