Extract emails from html using regex

Question

I'm trying to extract any jabber accounts (emails) using regex from this page.

I've tried using regex:

\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-]+

...but it's not producing the desired results.

Welcome to SO! I tweaked some of the wording and added a tag to help improve your chance of getting an answer. You may also want to try adding more specific info about what happens when you run the coded that isn't working. Good luck! — Jaydles
– Jaydles, Commented Mar 5, 2015 at 22:11
have a look at: regular-expressions.info/email.html. better to scroll down to The Official Standard: RFC 5322 section and get scared. regex is not a tool for this task. — Jason Hu
– Jason Hu, Commented Mar 5, 2015 at 22:17
Your question has been asked many times on Stack Overflow. See stackoverflow.com/questions/201323/… for my default answer for this.... — bmhkim
– bmhkim, Commented Mar 6, 2015 at 0:41

Wiktor Stribiżew · Accepted Answer · 2015-03-05 21:40:29Z

5

This might work:

[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+

p = re.compile(ur'[^\s@<>]+@[^\s@<>]+\.[^\s@<>]+', re.MULTILINE | re.IGNORECASE)
test_str = r'...'
re.findall(p, test_str)

See example.

answered Mar 5, 2015 at 21:40

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dognose Over a year ago

pretty close, but .@... is not a valid adress imho... In general: •Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. For matching email-adress-like-patterns your attempt is fine.

Wiktor Stribiżew Over a year ago

@dognose: I did not try to create a generic regex, only something that would work in this case. A lot has already been said about email validation regex for Python here: stackoverflow.com/questions/8022530/…, no need to continue it here IMO.

Aaron · Accepted Answer · 2015-03-06 00:18:54Z

4

# -*- coding: utf-8 -*-
s = '''
...YOUR HTML page source code HERE..........

'''

import re
reobj = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b", re.IGNORECASE)
print re.findall(reobj, s.decode('utf-8'))

Result

[u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]']

answered Mar 6, 2015 at 0:18

Aaron

2,4033 gold badges24 silver badges53 bronze badges

Comments

slfan · Accepted Answer · 2017-09-10 09:16:40Z

0

Try this one:

reg_emails=r'^((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))@((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))\.((([0-9a-zA-Z]+)[\_\.\-])*([0-9a-zA-Z]+))$'

edited Sep 10, 2017 at 9:16

slfan

9,129115 gold badges69 silver badges81 bronze badges

answered Sep 10, 2017 at 8:48

ytldsimage

11 bronze badge

Collectives™ on Stack Overflow

Extract emails from html using regex

3 Answers 3

2 Comments

Result

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Result

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related