3

I am using imaplib to read gmail messages in my python command window. The only problem is if that the emails come with with newlines and return carriages. Also, the text does not seem to be formatted correct. Instead of Amount: $36.49, it returns =2436.49. How can I go about cleaning up this text? Thanks!

Sample email content:

r\nItem name: Scanner\r\nItem=23: 130585100869\r\nPurchase Date: Oct 7, 2011\r\nUnit Price: =2436.49 USD\r\nQty: 1\r\nAmount: =2436.49USD\r\nSubtotal: =2436.49 USD\r\nShipping and handling: =240.00 USD\r\nInsurance - not offered

Code:

import imaplib
import libgmail
import re
import email
from BeautifulSoup import BeautifulSoup

USER = '[email protected]'
PASSWORD = 'password'

#connecting to the gmail imap server
imap_server = imaplib.IMAP4_SSL('imap.gmail.com', 993)
imap_server.login(USER, PASSWORD)
imap_server.select('Inbox')

typ, response = imap_server.search(None, '(SUBJECT "payment received")')

Data = []

for i in response[0].split():
    results, data = imap_server.fetch(i, "(RFC822)")
    Data.append(data)
    break

for i in Data:
    print i
3
  • 1
    This is not HTML, so Beautiful Soup will not help you here. Does it help to know that \r\n is a line terminator, and (if this is the encoding it appears to be) all occurrences of =XX need to be replaced with the ASCII character with hexadecimal codepoint XX? Commented Feb 15, 2012 at 17:54
  • 1
    Are those actual \r\n characters or carriage-return-linefeeds? Commented Feb 15, 2012 at 17:54
  • 1
    Oh, you asked for modules: quopri will decode the =XX notation for you. Commented Feb 15, 2012 at 17:56

4 Answers 4

6

The data is in quoted-printable encoding, this is a little data massager that should get you what you want:

text = '''\r\nPurchase Date: Oct 7, 2011\r\nUnit Price: =2436.49 USD\r\nQty: 1\r\nAmount: =2436.49 USD\r\nSubtotal: =2436.49 USD\r\nShipping and handling: =240.00 USD\r\nInsurance - not offered : ----\r\n----------------------------------------------------------------------\r\nTax: --\r\nTotal: =2436.49 USD\r\nPayment: =2436.49 USD\r\nPayment sent to: emailaddress=40gmail.com\r\n----------------------------------------------------------------------\r\n\r\nSincerely,\r\nPayPal\r\n=20\r\n----------------------------------------------------------------------\r\nHelp Center:=20\r\nhttps://www.paypal.com/us/cgi-bin/helpweb?cmd=3D_help\r\nSecurity Center:=20\r\nhttps://www.paypal.com/us/security\r\n\r\nThis email was sent by an automated system, so if you reply, nobody will =\r\nsee it. To get in touch with us, log in to your account and click =\r\n=22Contact Us=22 at the bottom of any page.\r\n\r\n'''

raw_data = text.decode("quopri") #replace =XX for the real characters

data = [map(str.strip, l.split(":")) for l in raw_data.splitlines() if ": " in l]

print data
# [['Purchase Date', 'Oct 7, 2011'], ['Unit Price', '$36.49 USD'], ['Qty', '1'], ['Amount', '$36.49 USD'], ['Subtotal', '$36.49 USD'], ['Shipping and handling', '$0.00 USD'], ['Insurance - not offered', '----'], ['Tax', '--'], ['Total', '$36.49 USD'], ['Payment', '$36.49 USD'], ['Payment sent to', '[email protected]'], ['Help Center', ''], ['Security Center', '']]

There you have your data in a much easier to process format, I hope it helps.

Edit: to make it even cuter:

>>> cooked = dict(data)
>>> print cooked["Unit Price"]
$36.49 USD
Sign up to request clarification or add additional context in comments.

Comments

3

The \r\n issue

The \r\n problem is caused by you not printing strings, but internal representations thereof. Try this to understand what I mean:

print ['test\n']
print 'test\n'

The i that you print above is a list of strings, so first representation kicks in. Try this:

print(Data[0][0][1])

I identified this by inspection of the object -- you should read the documentation of the libraries you are using to understand what exactly this object is composed of to understand why specifically this field represents the message. Or how to convert the Data object to something more... palatable.

The encoding issue

Try:

import quopri
print quopri.decodestring(Data[0][0][1])

1 Comment

Advice: when you copy-paste code, make sure to select it and press the double brace button to indent it by 4 spaces. Also check the preview of your text below the input box, before you submit, to make sure it looks alright. It is generally a "turn-off" for others when they see badly formatted questions -- it's in your interest to make them look attractive. Now to the actual problem at hand...
1

If these are actually email messages, you can use the email module to get you started. You can use it to do the proper quoted-printable decoding and get some clean text.

After that, though, you will need to write your own code to extract the parts you want. This is not a standard format for which parsers would exist. I would use regular expressions.

Note that \r\n is most likely just the carriage-return character followed by a linefeed character, not "slash, r, slash, n". In an interactive terminal Python will represent control and whitespace characters with their symbolic form.

Comments

0

Just use split and then check to see if the line matches what you're looking for.

You can pretty it up a bit, but this is a fairly simply way to handle it.

f = yourBlockOfText

text = f.split('\\r\\n')
for line in text:
    if line[0:4] == "Unit":
         print line
    elif line[0:17] == "Payment sent to: ":
        print line

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.