I am trying to find strings which are at most two mistakes 'away' from the original pattern string (i.e. they differ by at most two letters).
However, the following code isn't working as I would expect, at least not from my understanding of fuzzy regex:
import regex
res = regex.findall("(ATAGGAGAAGATGATGTATA){e<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
print res
>> ['ATAGAGCAAGATGATGTATA'] # the second string
As you can see, the two strings differ on three letters rather than at most two:
the first has: ATAGGAGAAGATGATGTATA
the second has: ATAGAGCAAGATGATGTATA
and yet the result shows the second string, as though it's within e<=2 (this also happens with overlapped=False, so that can't be it).
What am I missing here? And is there any way of getting this to find only strings within the Hamming 2-ball of the given pattern?
Is it possible that a swap of letters is considered to be only one change? And if so - how can I avoid this?
ATAGAGCAAGATGATGTATAis the correct result as per the expression. You asked to findATAGGAGAAGATGATGTATAwhere any 2 differences (substitutions, insertion or deletion) can be found.AGCis in fact 2 differences:Gis removed andCis inserted. I suggest using a non-regex approach here (there is an example code in Wikipedia). Just get all the necessary permutations of the string and check them with that method.