1

I need to parse the email file in elmx (Mac OS X email file format) to extract some information using regular expression with Python

The email contains the following format, and there are a lot of text before and after.

...

Name and Address (multi line)

Delivery estimate: SOMEDATE

BOOKNAME
AUTHOR and PRICE

SELLER

...

The example is as follows.

...

Engineer1 
31500 N. Mopac Circle.
Company, Building A, 3K.A01
Dallas, TX 78759
United States

Delivery estimate: February 3, 2011

1 "Writing Compilers and Interpreters"
Ronald Mak; Paperback; $21.80

Sold by: Textbooksrus LLC

...

How can I parse the email to extract them? I normally use line = file.readline(); for line in lines, but in this case some of the info is multi-line (the address for example).

The thing is that those information is just one part of big file, so I need to find a way to detect them.

1
  • I found the bug. Zip code 78759 is actually in Austin, not Dallas ;-) Commented Feb 3, 2011 at 22:42

3 Answers 3

1

I don't think that you need regular expressions. You could probably do this by using readlines to load the file, then iterate over that looking for "Delivery estimate:" using the startswith() method in the string module. At that point, you have a line number where the data is located.

You can get the address by scanning backwards from the line number to find the block of text delimited by blank lines. Don't forget to use strip() when looking for blank lines.

Then do a forward scan from the delivery estimate line to pick up the other info. Much faster than regular expressions too.

Sign up to request clarification or add additional context in comments.

Comments

0

Do data = file.read() which will give you the whole shabang and then make sure to add line ends and start to your regex where needed.

Comments

0

You could split on the double \n\n and work from there:

>>> s= """
... Engineer1 
... 31500 N. Mopac Circle.
... Company, Building A, 3K.A01
... Dallas, TX 78759
... United States
... 
... Delivery estimate: February 3, 2011
... 
... 1 "Writing Compilers and Interpreters"
... Ronald Mak; Paperback; $21.80
... 
... Sold by: Textbooksrus LLC
... """
>>> name, estimate, author_price, seller = s.split("\n\n")
>>> print name
Engineer1 
31500 N. Mopac Circle.
Company, Building A, 3K.A01
Dallas, TX 78759
United States

1 Comment

The thing is that those information is just one part of big file, so I need to find a way to detect them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.