2

I'm trying to extract any jabber accounts (emails) using regex from this page.

I've tried using regex:

\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-]+

...but it's not producing the desired results.

3
  • Welcome to SO! I tweaked some of the wording and added a tag to help improve your chance of getting an answer. You may also want to try adding more specific info about what happens when you run the coded that isn't working. Good luck! Commented Mar 5, 2015 at 22:11
  • have a look at: regular-expressions.info/email.html. better to scroll down to The Official Standard: RFC 5322 section and get scared. regex is not a tool for this task. Commented Mar 5, 2015 at 22:17
  • Your question has been asked many times on Stack Overflow. See stackoverflow.com/questions/201323/… for my default answer for this.... Commented Mar 6, 2015 at 0:41

3 Answers 3

5

This might work:

[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+

p = re.compile(ur'[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+', re.MULTILINE | re.IGNORECASE)
test_str = r'...'
re.findall(p, test_str)

See example.

Sign up to request clarification or add additional context in comments.

2 Comments

pretty close, but .@... is not a valid adress imho... In general: •Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. For matching email-adress-like-patterns your attempt is fine.
@dognose: I did not try to create a generic regex, only something that would work in this case. A lot has already been said about email validation regex for Python here: stackoverflow.com/questions/8022530/…, no need to continue it here IMO.
4
# -*- coding: utf-8 -*-
s = '''
...YOUR HTML page source code HERE..........

'''

import re
reobj = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
print re.findall(reobj, s.decode('utf-8'))

Result

[u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]']

Comments

0

Try this one:

reg_emails=r'^((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))@((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))\.((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))$'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.