19

Is there any mechanism in Python for combining compiled regular expressions?

I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:

import re

first = re.compile(r"(hello?\s*)")

# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)

# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was 
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)

The result I'm looking for is achievable like so in Perl:

$first = qr{(hello?\s*)};

# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;

$both = qr{$first$second$second};

A test shows the results:

test($second, "...one-two-three...");                   # Matches
test($both, "...hello one-two-THREEone-two-three...");  # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE...");    # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE...");  # No match

sub test {
  my ($pat, $str) = @_;
  print $str =~ $pat ? "Matches\n" : "No match\n";
}

Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?

(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)

15
  • 1
    The first regex shouldn't be required here to reproduce the problem, right? (Or maybe it causes the issue with the case sensitivity?) Commented Feb 23, 2018 at 22:29
  • When you add the three patterns in your example, you get a new pattern '([0-9]+)one(?P<r1>[-/])two(?P=r1)threeone(?P<r1>[-/])two(?P=r1)three' which is likely not what you want. Adding the patterns simply concatenates them as strings. Perhaps you want to join them with '|'? Commented Feb 23, 2018 at 22:32
  • it does not work because they are not regular expression at all in strict sense. for instance, regular expression does not know what "previously matched". turning them into real RE will do the work. however, i do see your point what you are trying to take advantage of. Commented Feb 23, 2018 at 22:33
  • 1
    If you want perl-like features and more, use the regex module, not re. Note that in the re module the case sensitivity or any other behaviour you can switch on/off with a flag is always for the whole pattern, even if you use (?i) in the pattern itself. Commented Feb 23, 2018 at 22:48
  • Thanks @CasimiretHippolyte for the reference, I'll check it out. I didn't see the ability to work with pre-compiled regexes at first skim, but maybe it's there. Thanks also for the tip about (?i), I would definitely have tripped on that. Commented Feb 23, 2018 at 22:55

2 Answers 2

3

Ken, this is an interesting problem. I agree with you that the Perl solution is very slick. I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.

first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)

str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
    first_end_pos = re.search(first,str).end()
    if re.match(second,str[first_end_pos:]):
        second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
        if re.match(second,str[second_end_pos:]):
            print ('Matches')

It will work for most of the cases but it is not working for the below case:

...hellone/Two/ThreeONE-TWO-THREE...

So, yes I admit it is not a complete solution to your problem. Hope this helps though.

Sign up to request clarification or add additional context in comments.

1 Comment

Good technique. It sort of falls down if you want to build up a complex expression from multiple smaller expressions, because then the "concatenation" code gets very hairy. But for simple cases this can work.
1

I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this

both = re.compile(first.pattern + second.pattern + second.pattern)

tries to create two capture groups named r1

For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?

# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'

Solution 1

An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.

first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)

Solution 2

What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.

pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)

Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:

both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)

which gives you the following regex:

r'(hello?\s*)(one([-/])two([-/])three){2}'

3 Comments

The reason I'm not "comparing apples to apples" is because the proper solution IMO (relative capture groups) exists in Perl, but not in Python. I don't know of any solution in Python. In your solution, please note that your pattern2 matches "one-two/three", which I do not wish to be part of the solution space. That was the reason for using named capture groups or relative groups in the first place.
I see. If you plug your regex for pattern2 into my solution it fixes the "one-two/three" match, but it still leaves both "HELLO" and "hello" matching. So I don't think there's a way to meet all your requirements with python's re module. At least not all in one step. There's no way to directly concatenate two pattern objects, and the re.IGNORECASE flag will need to be set on the whole regex, not just a part of it.
Yes - even if I used my regex for pattern2, it couldn't be compiled with another copy of itself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.