2

Using javascript, I need to parse the Content-Type text/html portion of an email message and extract just the HTML part. Here's an example of the part of the mail source in question:

------=_Part_1504541_510475628.1327512846983
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit


<html ... a bunch of html ...

/html>

I want to extract everything between (and including) the <html> tags after text/html. How do I do this?

NOTE: I'm OK with a hacky regex. I don't expect this to be bulletproof.

3 Answers 3

5

Based on RFC/MIME documentation, the encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.

Note: In JavaScript there is indeed no /s modifier to make the dot . match all characters, including line breaks. To match absolutely any character, you can use character class that contains a shorthand class and its negated version, such as [\s\S].


Regex:

\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--

JavaScript:

matches = /\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--/gim.exec(mail);
Sign up to request clarification or add additional context in comments.

Comments

3

The answer by Ωmega is close but you can't be sure that the boundary contains the - character.

You first need to look within the headers. The headers and body of the actual email content will be separated by \r\n\r\n. You should see a header something like

Content-Type: multipart/alternative;
    boundary="------=_Part_1504541_510475628.1327512846983"

This boundary is what you can then use to find the actual divider. You can then construct a regexp just like Ωmega's but substitute in this divider.

The only thing to be aware of is that the last boundary will have -- at the end in addition to the normal boundary content.

1 Comment

Steve, I have edited my answer with note from documentation - boundry has to start with at least two - characters...
2
var html = source.toString().substr(source.toString().indexOf("\n\n")).trim();

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.