Python regex: findall() and search()

Question

I have the following Python regex:

>>> p = re.compile(r"(\b\w+)\s+\1")

\b    :   word boundary
\w+ :   one or more alphanumerical characters
\s+ :   one or more whitespaces (can be , \t, \n, ..)
\1    :   backreference to group 1 ( = the part between (..))

This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:

>>> p.search("I am in the the car.")

<_sre.SRE_Match object; span=(8, 15), match='the the'>

The found match is the the, just as I had expected. The weird behaviour is in the findall function:

>>> p.findall("I am in the the car.")

['the']

The found match is now only the. Why the difference?

Because findall returns only the capturing groups if there are any (or the complete match otherwise). — Sebastian Proske
– Sebastian Proske, Commented Apr 17, 2017 at 14:15
docs.python.org/3/library/re.html#re.findall "If one or more groups are present in the pattern, return a list of groups" — melpomene
– melpomene, Commented Apr 17, 2017 at 14:16
Oh, now I see. Thank you. So I have to use a non-capturing group to solve the issue? I will try it out right now.. — K.Mulier
– K.Mulier, Commented Apr 17, 2017 at 14:18
I get a sre_constants.error: invalid group reference 1 at position 13 error when changing my group (...) by (?:...) to make it non-capturing. Perhaps that's because using a backrefence to a non-capturing group is impossible? — K.Mulier
– K.Mulier, Commented Apr 17, 2017 at 14:22
@K.Mulier: well, you can't do that because then you have nothing for \1 to match against.. — Martijn Pieters
– Martijn Pieters, Commented Apr 17, 2017 at 14:22

Martijn Pieters · Accepted Answer · 2017-04-17 14:31:15Z

4

When using groups in a regular expression, findall() returns only the groups; from the documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:

>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]

The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:

>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")

You can filter that back to just the outer group result:

>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']

answered Apr 17, 2017 at 14:31

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

K.Mulier Over a year ago

Great answer! Thank you Martijn :-)

Collectives™ on Stack Overflow

Python regex: findall() and search()

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related