1

My users will send me posts by email ala Posterous

I'm using Google Apps Engine (GAE) to receive and parse emails. GAE returns the text part of the message.

I need to extract the post from the plain text part of the message.

The plain text can be "contaminated" with promotional headers, footers, signatures, etc.

Also I would like to leave out the "please post this:" or similar some people candidly include.

How would you achieve this?

Are there any tools (simpler than regex) I can use?

UPDATE

Examples:

(in all these examples the post is "Lorem ipsum sit amet..."

=====

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Victor P
[email protected]
visit my blog at: www.example.com/victor

=====

Hello, I like your page. Please can you include this: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

=====

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

=====

If you find more examples of what a email can be, please feel free to include them in the post.

3
  • 1
    It's far from trivial, but depending on specifics, a few tricks could be applied and address a fair proportion of the targeted situations. Can you provide more details or maybe examples which would show the particular format you have at hand (for example, "plain text = no mark-up at all?". Commented Dec 7, 2009 at 14:55
  • post a couple of samples on what should and shouldn't get through the regex. Commented Dec 7, 2009 at 15:22
  • I edited the post and added some examples Commented Dec 7, 2009 at 15:52

1 Answer 1

2

I would go with a list of compiled regular expressions. Something along the lines of:

import re

regexes = (
    re.compile("visit my blog at: .*$", re.IGNORECASE),
    re.compile("please post this:", re.IGNORECASE),
    re.compile("please can you include this:", re.IGNORECASE)
    # etc
)

for filePath in files:
    with open(filePath) as file:
        for line in file:
            for regex in regexes:
                print(re.sub(regex, ""))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.