re.findall failing for regex with grouping in Python

Question

Im writing a python program using regex to find email addresses. re.findall function is giving wrong output whenever I try to use round brackets for grouping. Can anyone point out the mistake / suggest an alternate solution?

Here are two snippets of code to explain -

pat = "[\w]+[ ]*@[ ]*[\w]+.[\w]+"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

gives the output

['[email protected]', '[email protected]']

However, if I use grouping in this regex and modify the code as

pat = "[\w]+[ ]*@[ ]*[\w]+(.[\w]+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

the output is

['.com', '.com']

To confirm the correctness of the regex, I tried this specific regex (in second example) in http://regexpal.com/ with the same input string, and both the email addresses are matched successfully.

You have used character classes in all the places where you shouldn't have, and failed to use one where you should have (or used escaping). Also, that regex fails on loads of valid addresses like [email protected]. I expect that allowing spaces around the @ (which is of course invalid) is done on purpose? — Tim Pietzcker
– Tim Pietzcker, Commented Mar 17, 2012 at 8:11

huon · Accepted Answer · 2012-03-17 13:16:08Z

3

In Python, re.findall returns the whole match only if there are no groups, if there are groups then it will return the groups. To get around this, you should use a non-capturing group (?:...). In this case:

pat = "[\w.]+ *@ *\w+(?:\.\w+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

edited Mar 17, 2012 at 13:16

answered Mar 17, 2012 at 8:01

huon

103k24 gold badges239 silver badges230 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tim Pietzcker Over a year ago

You have reproduced all of the errors in @anu.agg's original regex. A somewhat better version (although still much less than optimal) would be "[\w.]+ *@ *\w+(?:\.\w+)*".

huon Over a year ago

@TimPietzcker, ooh, yes, I just modified the group without properly thinking. Replaced.

Honest Abe · Accepted Answer · 2012-03-17 14:21:52Z

You would use groups if you wanted to do something like separate the user from the host:
(The hyphens are optional, some emails have them.)

pat = '([\w\.-]+)@([\w\.-]+)'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Output:

[('abc', 'cs.stansoft.edu.com'), ('myacc', 'gmail.com')]

To further illustrate we could replace the host, and keep the user from group 1 (\1):

emails = '[email protected] .rtrt.. [email protected] '
pat = '([\w\.-]+)@([\w\.-]+)'
re.sub(pat, r'\[email protected]', emails)

Output:

'[email protected] .rtrt.. [email protected] '

Simply remove the parentheses from the pattern to match the whole email:

pat = '[\w\.-]+@[\w\.-]+'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Output:

['[email protected]', '[email protected]']

Collectives™ on Stack Overflow

re.findall failing for regex with grouping in Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related