1

Im writing a python program using regex to find email addresses. re.findall function is giving wrong output whenever I try to use round brackets for grouping. Can anyone point out the mistake / suggest an alternate solution?

Here are two snippets of code to explain -

pat = "[\w]+[ ]*@[ ]*[\w]+.[\w]+"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

gives the output

['[email protected]', '[email protected]']

However, if I use grouping in this regex and modify the code as

pat = "[\w]+[ ]*@[ ]*[\w]+(.[\w]+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

the output is

['.com', '.com']

To confirm the correctness of the regex, I tried this specific regex (in second example) in http://regexpal.com/ with the same input string, and both the email addresses are matched successfully.

2
  • +1 for excellently asked question. Commented Mar 17, 2012 at 8:10
  • You have used character classes in all the places where you shouldn't have, and failed to use one where you should have (or used escaping). Also, that regex fails on loads of valid addresses like [email protected]. I expect that allowing spaces around the @ (which is of course invalid) is done on purpose? Commented Mar 17, 2012 at 8:11

2 Answers 2

3

In Python, re.findall returns the whole match only if there are no groups, if there are groups then it will return the groups. To get around this, you should use a non-capturing group (?:...). In this case:

pat = "[\w.]+ *@ *\w+(?:\.\w+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')
Sign up to request clarification or add additional context in comments.

2 Comments

You have reproduced all of the errors in @anu.agg's original regex. A somewhat better version (although still much less than optimal) would be "[\w.]+ *@ *\w+(?:\.\w+)*".
@TimPietzcker, ooh, yes, I just modified the group without properly thinking. Replaced.
1

You would use groups if you wanted to do something like separate the user from the host:
(The hyphens are optional, some emails have them.)

pat = '([\w\.-]+)@([\w\.-]+)'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Output:

[('abc', 'cs.stansoft.edu.com'), ('myacc', 'gmail.com')]

To further illustrate we could replace the host, and keep the user from group 1 (\1):

emails = '[email protected] .rtrt.. [email protected] '
pat = '([\w\.-]+)@([\w\.-]+)'
re.sub(pat, r'\[email protected]', emails)

Output:

'[email protected] .rtrt.. [email protected] '

Simply remove the parentheses from the pattern to match the whole email:

pat = '[\w\.-]+@[\w\.-]+'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Output:

['[email protected]', '[email protected]']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.