3

I have the following Python regex:

>>> p = re.compile(r"(\b\w+)\s+\1")

\b    :   word boundary
\w+  :   one or more alphanumerical characters
\s+  :   one or more whitespaces (can be , \t, \n, ..)
\1    :   backreference to group 1 ( = the part between (..))

This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:

>>> p.search("I am in the the car.")

<_sre.SRE_Match object; span=(8, 15), match='the the'>

The found match is the the, just as I had expected. The weird behaviour is in the findall function:

>>> p.findall("I am in the the car.")

['the']

The found match is now only the. Why the difference?

6
  • 3
    Because findall returns only the capturing groups if there are any (or the complete match otherwise). Commented Apr 17, 2017 at 14:15
  • docs.python.org/3/library/re.html#re.findall "If one or more groups are present in the pattern, return a list of groups" Commented Apr 17, 2017 at 14:16
  • Oh, now I see. Thank you. So I have to use a non-capturing group to solve the issue? I will try it out right now.. Commented Apr 17, 2017 at 14:18
  • I get a sre_constants.error: invalid group reference 1 at position 13 error when changing my group (...) by (?:...) to make it non-capturing. Perhaps that's because using a backrefence to a non-capturing group is impossible? Commented Apr 17, 2017 at 14:22
  • @K.Mulier: well, you can't do that because then you have nothing for \1 to match against.. Commented Apr 17, 2017 at 14:22

1 Answer 1

4

When using groups in a regular expression, findall() returns only the groups; from the documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:

>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]

The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:

>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")

You can filter that back to just the outer group result:

>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']
Sign up to request clarification or add additional context in comments.

1 Comment

Great answer! Thank you Martijn :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.