4

I'm testing the new python regex module, which allows for fuzzy string matching, and have been impressed with its capabilities so far. However, I've been having trouble making certain exceptions with fuzzy matching. The following is a case in point. I want ST LOUIS, and all variations of ST LOUIS within an edit distance of 1 to match ref. However, I want to make one exception to this rule: the edit cannot consist of an insertion to the left of the leftmost character containing the letters N, S, E, or W. With the following example, I want inputs 1 - 3 to match ref, and input 4 to fail. However, using the following ref causes it to match to all four inputs. Does anyone who is familiar with the new regex module know of a possible workaround?

input1 = 'ST LOUIS'
input2 = 'AST LOUIS'
input3 = 'ST LOUS'
input4 = 'NST LOUIS'


ref = '([^NSEW]|(?<=^))(ST LOUIS){e<=1}'

match = regex.fullmatch(ref,input1)
match
<_regex.Match object at 0x1006c6030>
match = regex.fullmatch(ref,input2)
match
<_regex.Match object at 0x1006c6120>
match = regex.fullmatch(ref,input3)
match
<_regex.Match object at 0x1006c6030>
match = regex.fullmatch(ref,input4)
match
<_regex.Match object at 0x1006c6120>
4
  • have you tried (?<=[^NSEW]|^)(ST LOUIS){e<=1} Commented Feb 4, 2013 at 17:29
  • @Some1.Kill.The.DJ: It should have the same effect... No, you removed the fuzzy matching part, which he needs. Commented Feb 4, 2013 at 17:31
  • @nhahtdh hmm..edited it..but i guess fuzzy matching would not be implemented on lookarounds.. Commented Feb 4, 2013 at 17:34
  • Have you considered a simple two-pass approach? Commented Feb 4, 2013 at 17:47

1 Answer 1

4

Try a negative lookahead instead:

(?![NEW]|SS)(ST LOUIS){e<=1}

(ST LOUIS){e<=1} matches a string meeting the fuzzy conditions placed on it. You want to prevent it from starting with [NSEW]. A negative lookahead does that for you (?![NSEW]). But your desired string starts with an S already, you only want to exclude the strings starting with an S added to the beginning of your string. Such a string would start with SS, and that's why it's added to the negative lookahead.

Note that if you allow errors > 1, this probably wouldn't work as desired.

Sign up to request clarification or add additional context in comments.

4 Comments

Ah yes, thought there was no S in there. Fixed for insertion.
Wow, that worked! You're suggestion makes no sense to me though. I'll check it off as an accepted answer, but could you (or someone else) explain the rationale for taking S out of the character class and placing it within an alternation instead (as a double S, no less)?
@user1185790, added a description.
Ah. That didn't even occur to me. Thanks so much for your help. This explains why the example I was going to give was working as expected - because there was no repeated letter between the character exception and the first letter of the matching string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.