0
url ="https://www.siliconvalleypediatricdentistry.com/"
res=requests.get(url)
html=res.text
#re.findall(r'([\w0-9._-]+@[\w0-9._-]+\.[\w0-9_-]+)',html)
#re.findall(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",html)

I found plenty of questions regarding this but most of them are extracting "wrong" emails

I am getting this as output

['[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]']

some of them are just JS scripts, is there a safer regex to use or module that does this?

3
  • 1
    Email addresses are more complex than often thought. Imo the easiest way is to use a simpler regex like e.g. \S+@\S+ and actually send an email to that address. Commented Apr 7, 2020 at 19:21
  • @Jan how can I check if email exists without sending email ? Commented Apr 7, 2020 at 19:29
  • You can't. And even if you do send, many hosts won't respond with an error if the username doesn't exist. Commented Apr 7, 2020 at 19:32

2 Answers 2

1

That works for me:

re.findall(r'([\w-]+@[\w-]+\.[a-zA-Z]{1,5})',html)

Basically, we just force the end to be letters (e.g. .com), so the JS scripts don't match

Sign up to request clarification or add additional context in comments.

1 Comment

\w already include [0-9_], so your character class can be shortened to [-.\w].
1

Just can try this:

r'^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,6})+$'

Or you can use your our own regex and just check if the email address are valid with:

from validate_email import validate_email
is_valid = validate_email('[email protected]')

2 Comments

nice module, does it actually send an email if not then how does it check
It checks if the SMTP Server exists and can check if it that server has that email address without sending an email. More info at: pypi.org/project/validate_email

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.