regular expression matches in Python

Question

I have a question regarding regular expressions. When using or construct

$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> for mo in re.finditer('a|ab', 'ab'):
...     print mo.start(0), mo.end(0)
... 
0 1

we get only one match, which is expected as the first leftmost branch, that gets accepted is reported. My question is that is it possible and how to construct a regular expression, which would yield both (0,1) and (0,2). And also, how to do that in general for any regex in form r1 | r2 | ... | rn .

Similarly, is it possible to achieve this for *, +, and ? constructs? As by default:

>>> for mo in re.finditer('a*', 'aaa'):
...     print mo.start(0), mo.end(0)
... 
0 3
3 3
>>> for mo in re.finditer('a+', 'aaa'):
...     print mo.start(0), mo.end(0)
... 
0 3
>>> for mo in re.finditer('a?', 'aaa'):
...     print mo.start(0), mo.end(0)
... 
0 1
1 2
2 3
3 3

Second question is that why do empty strings match at ends, but not anywhere else as is case with * and ? ?

EDIT:

I think I realize now that both questions were nonsense: as @mgilson said, re.finditer only returns non-overlapping matches and I guess whenever a regular expression accepts a (part of a) string, it terminates the search. Thus, it is impossible with default settings of the Python matching engine.

Although I wonder that if Python uses backtracking in regex matching, it should not be very difficult to make it continue searching after accepting strings. But this would break the usual behavior of regular expressions.

EDIT2:

This is possible in Perl. See answer by @Qtax below.

mgilson · Accepted Answer · 2013-02-07 02:13:58Z

1

I don't think this is possible. The docs for re.finditer state:

Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string

(emphasis is mine)

In answer to your other question about why empty strings don't match elsewhere, I think it is because the rest of the string is already matched someplace else and finditer only gives matches for non-overlapping patterns which match (see answer to first part ;-).

answered Feb 7, 2013 at 2:13

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mgilson Over a year ago

@answerers -- If you prove me wrong on this point, please @notify me. I'm interested to know how this one turns out :)

Timo Over a year ago

Of course, the second question was foolish, should have read the docs :)

Qtax · Accepted Answer · 2013-02-07 03:16:00Z

1

Just want to mention that you can do such things in Perl, using an expression like:

(?:a|ab)(?{ say $& })(?!)

The (?{ code }) construct executes the code every time the regex engine gets to that position in the pattern. Here right after your regex, and it prints the content of the match so far. The (?!) after that fails the match, making the regex engine backtrack, and giving us the next possible match, and so on.

This will work for any kind of expression.

Example:

perl -E "$_='ab'; /(?:a|ab)(?{ say $& })(?!)/"

Output:

a
ab

Another example:

perl -E "$_='aaaa'; /a+(?{ say $& })(?!)/"

Output:

aaaa
aaa
aa
a
aaa
aa
a
aa
a
a

answered Feb 7, 2013 at 3:16

Qtax

34k9 gold badges92 silver badges127 bronze badges

7 Comments

Timo Over a year ago

Very cool, I did not knew about that extension before. I wonder if something similar exists in Python or other languages such as Javascript? I am currently reading the Python docs and hope I find something similar :)

Qtax Over a year ago

@Timo, surely not in JavaScript, and no other languages/libs that I know of have such execute code features. But some libs (PCRE?) probably allow you to set some settings to get the same result as in this case.

Timo Over a year ago

Yup, did not find anything about that in Python docs. I guess this proves that Perl is still the best tool to use with anything related to regular expressions :D

mgilson Over a year ago

@Timo -- It looks like there might be something similar to what you are looking for in the regex module ...

mgilson Over a year ago

although, I can't seem to get it to work. . .(it seemed like the overlapped=True keyword would be what you wanted)

|

Collectives™ on Stack Overflow

regular expression matches in Python

2 Answers 2

2 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related