python complementing a complex regular expression

Question

Trying to learn regular expressions and despite some great posts on here and links to a regEx site, I have a case I was trying to hack out of sheer stubbornness that defied producing the match I was looking for. To understand it, consider the following code which allows us to pass in a list of strings and a pattern and find out if the pattern either matches all items in the list or matches none of them:

import re
def matchNone(pattern, lst):
    return not any([re.search(pattern, i) for i in lst])

def matchAll(pattern, lst):
    return all([re.search(pattern, i) for i in lst])

To help w/ debugging, this simple code allows us to just add _test to a function call and see what is being passed to the any() or all() functions which ultimately return the result:

def matchAll_test(pattern, lst):
    return [re.search(pattern, i) for i in lst]

def matchNone_test(pattern, lst):
    return ([re.search(pattern, i) for i in lst])

This pattern and list produces True from matchAll():

wordPattern = "^[cfdrp]an$"
matchAll(wordPattern, ['can', 'fan', 'dan', 'ran', 'pan']) # True

This pattern on the surface appears to work with matchNone() in our effort to reverse the pattern:

wordPattern = "^[^cfdrp]an|[cfdrp](^an)$"
matchNone(wordPattern, ['can', 'fan', 'dan', 'ran', 'pan']) # True

It returns True as we hoped it would. But a true reversal of this pattern would return False for a list of values where none of them are equivalent to our original list ['can', 'fan', 'dan', 'ran', 'pan'] regardless of what else we pass into it. (i.e. "match anything except these 5 words")

In testing to see what changes to the words in this list will get us a False, we quickly discover the pattern is not as successful as it first appears. If it were, it would fail for matchNone() on anything not in the aforementioned list.

These permutations helped uncover the short-comings of my pattern tests:

["something unrelated", "p", "xan", "dax", "ccan", "dann", "ra"]

In my exploration of above, I tried other permutations as well taking the original list, using the _test version of the functions and changing one letter at a time on the original words, and or modifying one term or adding one term from permutations like what is above.

If anyone can find the true inverse of my original pattern, I would love to see it so I can learn from it.

To help with your investigation:

This pattern also works with matchAll() for all words, but I could not seem to create its inverse either: "^(can|fan|dan|ran|pan)$"

Thanks for any time you expend on this. I'm hoping to find a regEx guru on here who spots the mistake and can propose the right solution.

(^an) doesn't mean "anything but an". Outside of a character class, ^ matches the start of the string (and also the beginning of a line in MULTILINE mode). — user2357112
– user2357112, Commented Mar 22, 2017 at 19:06
Even if it did mean "anything but an", you still wouldn't have correctly inverted your regex, though. — user2357112
– user2357112, Commented Mar 22, 2017 at 19:07
Responding to all comments on here. Whereas the first pattern successfully matches the 5 words in the list, the "reverse" or "inverse" I was looking for (but was grasping for words) is a pattern that matches everything passed to it except the 5 words in our sample. Regarding the comments on "^", on various regEx sites and in the class I am taking, this character has two contexts: at the very start of a pattern it means start of the pattern, but inside [] and possibly inside () or placed in the middle of a pattern, it means "not this". . . — TMWP
– TMWP, Commented Mar 22, 2017 at 19:17
My pattern tests all get partway there, but all of them fail when I pass in some of the given test cases so they are not true "reversals" or the original and fail at goal of match all except what is in the list. As stated, there are probably other ways around this use case, but finding the regEx equivelence of what I am looking for will go a long way to helping me grasp its syntax and some of the confusing results I see with patterns I try. — TMWP
– TMWP, Commented Mar 22, 2017 at 19:17

R1tschY · Accepted Answer · 2017-03-22 20:11:48Z

3

I hope I understood your question. This is the solution that I found:

^(?:[^cfdrp].*|[cfdrp][^a].*|[cfdrp]a[^n].*|.{4,}|.{0,2})$

[^cfdrp].*: if text starts not with c, f, d, r or p than match
[cfdrp][^a].*: text starts with c, f, d, r or p: match if second character is not an a
[cfdrp]a[^n].*: text starts with [cfdrp]a: match if third character is not a n.
.{4,}: match anything with more than 3 characters
.{0,2}: match anything with 0, 1, or 2 characters

It is equal to:

^(?:[^cfdrp].*|.[^a].*|..[^n].*|.{4,}|.{0,2})$

edited Mar 22, 2017 at 20:11

answered Mar 22, 2017 at 20:03

R1tschY

3,2501 gold badge27 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

TMWP Over a year ago

could you explain the "?:" ... having trouble getting even the help on this part of the syntax, but I think you may have gotten it. Doing some testing and some reverse engineering of your explanation to better understand and validate it and if it works as I expect, the answer will be accepted shortly.

TMWP Over a year ago

Your answer was accepted. If you have not done so already, if you could vote for the question it would be appreciated. My questions with answers and comments but no votes almost cost me the ability to post new questions last week because of a weird algorithm on this site.

R1tschY Over a year ago

@TMWP: without the ?: it will be the same for your use case. It generates a non-capturing group instead of a capturing group. Important is that everything between ^ and $ is in one group. Otherwise it is something like: (^[^cfdrp].*)|(.[^a].*)|(..[^n].*)|(.{4,})|(.{0,2}$)

TMWP Over a year ago

Thanks. I have been learning exactly what I hoped to from this exchange. Analyzing the patterns you submitted, I now feel like I have a better understanding of regEx so I will write more patterns that work in the future. Best wishes and have a good weekend.

mbomb007 · Accepted Answer · 2018-02-01 15:59:33Z

1

What you are looking to do is find the complement. To do this for any regular expression is a difficult problem. There is no built-in for complementing a regex.

There is an open challenge on PPCG to do this. One comment explains the difficulty:

It's possible, but crazy tedious. You need to parse the regexp into an NFA (eg Thompson's algorithm), convert the NFA to a DFA (powerset construction), complete the DFA, find the complement, then convert the DFA to a RE (eg Brzozowski's method). ie slightly harder than writing a complete RE engine!

There are Python libraries out there that will convert from a regular expression (the original specification refers to a "regular language", which only has literals, "or", and "star" -- much simpler than the type of regex you're thinking of [more info here]) to an NFA, to a DFA, complement it, and convert it back. It's pretty complicated.

Here's a related SO question: Finding the complement of a DFA?

In summary, it's far simpler to find the result of the original regular expression instead, then use Boolean negation.

edited Feb 1, 2018 at 15:59

answered Mar 22, 2017 at 20:09

mbomb007

4,4013 gold badges52 silver badges79 bronze badges

8 Comments

TMWP Over a year ago

This was helpful and explains why as patterns grow large, the answer should not be attempted if another way can be found. I think with 5 terms, it should be possible and am evaluating another solution on here as I type this. Thanks for the useful comments. The whole point was to understand RegEx syntax better which is why an actual pattern-match to this limited example was what was being looked for. Best wishes.

mbomb007 Over a year ago

Here is an implementation in Python. I put a comment at the bottom of the code explaining what operations are allowed. It's not quite the complement you're looking for, because it only includes letters that are in the input. Also, it's not always fully simplified. In this, + means OR. * is the same. So a+b's complement is (a+b)(a+b)(a+b)*+() (an a or b, followed by another, and then zero or more of them, OR the empty string.)

TMWP Over a year ago

Did you create the code on "implementation in Python" link above? If so - that's an amazing piece of code that must have taken you long to do. I only just saw this comment. Oddly though, the human solution I accepted, though not as robust (for any scenario) was a cleaner answer that passed all tests I had devised to throw at it. But your code theoretically could solve anything, even something a human could not so that is really impressive!

mbomb007 Over a year ago

@TMWP I told you, it's not a Python regex that it outputs. You need to read what I said more thoroughly. The formal definition of a regular expression DOES NOT have character classes, or negation, or non-greedy modifiers, or capture groups. It only has + (which acts like Python's |), * (same as Python's *), and concatenation.

mbomb007 Over a year ago

So if you want the complement of Python's a+, you have to use aa*. If you want Python's a|b, you must use a+b. Etc. Then after receiving the result, you'd have to convert it back into Python's syntax. The complement of aa* is aaa*+(), which means "2 or more a's or empty string", which converts back to aa+| in Python, which is the complement if the only letter in the alphabet is a. The code does not have the ability to use a larger alphabet than the regex uses yet.

|

Collectives™ on Stack Overflow

python complementing a complex regular expression

2 Answers 2

4 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related