3

How can I find an email address inside HTML code with Nokogiri? I supose I will need to use regex, but don't know how.

Example code

    <html>
    <title>Example</title>
    <body>
    This is an example text.
    [email protected]
    </body>
    </html>

There is an answer covering the case when there is a href to mail_to, but that is not my case. The email addresses are sometimes inside a link, but not always.

Thanks

1
  • This is definitely not a Nokogiri question, it's a text parsing question in ruby. I tagged it with Ruby and regex to improve your responses. Commented Nov 27, 2012 at 23:03

2 Answers 2

6

If you're just trying to parse the email address from a string that just so happens to be HTML, Nokogiri isn't needed for this.

html_string   = "Your HTML here..."
email_address = html_string.match(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i)[0]

This isn't a perfect solution though, as the RFC for what constitutes a 'valid' email address is very lenient. This means most regular expressions you come across (the above one included) do not account for edge case valid addresses. For example, according to the RFC

[email protected]

is a valid email address, but will not be matched by the above regular expressions as it stands.

Sign up to request clarification or add additional context in comments.

2 Comments

Why this really is not a perfect solution is that it finds only the first email address on the page
This is a perfect solution for the question asked here. The question had nothing to do with parsing multiple addresses.
1

Just use a regex on the HTML string, no need for Nokogiri (as @deefour suggested). For the regex itself, I'd suggest the one (called AUTO_EMAIL_RE) used by the rails autolink gem:

/[\w.!#\$%+-]+@[\w-]+(?:\.[\w-]+)+/

This should catch those edge cases that stricter regex filters miss:

RE = /[\w.!#\$%+-]+@[\w-]+(?:\.[\w-]+)+/

RE.match('[email protected]')
#=> #<MatchData "[email protected]">

RE.match('[email protected]')
#=> #<MatchData "[email protected]">

Note that if you really want to match all valid email addresses, you're going to need a mighty big regex.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.