Parse text/html part of email source using Javascript

Question

Using javascript, I need to parse the Content-Type text/html portion of an email message and extract just the HTML part. Here's an example of the part of the mail source in question:

------=_Part_1504541_510475628.1327512846983
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit


<html ... a bunch of html ...

/html>

I want to extract everything between (and including) the <html> tags after text/html. How do I do this?

NOTE: I'm OK with a hacky regex. I don't expect this to be bulletproof.

Ωmega · Accepted Answer · 2012-07-05 14:10:11Z

5

Based on RFC/MIME documentation, the encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.

Note: In JavaScript there is indeed no /s modifier to make the dot . match all characters, including line breaks. To match absolutely any character, you can use character class that contains a shorthand class and its negated version, such as [\s\S].

Regex:

\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--

JavaScript:

matches = /\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--/gim.exec(mail);

edited Jul 5, 2012 at 14:10

answered Jul 3, 2012 at 21:43

Ωmega

44k35 gold badges143 silver badges213 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Steve Smith · Accepted Answer · 2012-07-04 15:11:44Z

3

The answer by Ωmega is close but you can't be sure that the boundary contains the - character.

You first need to look within the headers. The headers and body of the actual email content will be separated by \r\n\r\n. You should see a header something like

Content-Type: multipart/alternative;
    boundary="------=_Part_1504541_510475628.1327512846983"

This boundary is what you can then use to find the actual divider. You can then construct a regexp just like Ωmega's but substitute in this divider.

The only thing to be aware of is that the last boundary will have -- at the end in addition to the normal boundary content.

answered Jul 4, 2012 at 15:11

Steve Smith

5,2111 gold badge32 silver badges31 bronze badges

1 Comment

Ωmega Over a year ago

Steve, I have edited my answer with note from documentation - boundry has to start with at least two - characters...

Leo · Accepted Answer · 2012-07-03 21:12:31Z

2

var html = source.toString().substr(source.toString().indexOf("\n\n")).trim();

answered Jul 3, 2012 at 21:12

Leo

1,8393 gold badges20 silver badges25 bronze badges

Collectives™ on Stack Overflow

Parse text/html part of email source using Javascript

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related