4

Basically I want to read all new emails from an inbox and put them in a database. The reason I use python is because it has imaplib, but I know nothing about it.

Currently, I have something like this :

def primitive_get_text_blocks(email_message_instance):
    maintype = email_message_instance.get_content_maintype()
    if maintype == 'multipart':
        return_parts = ""
        for part in email_message_instance.get_payload():
            if part.get_content_maintype() == 'text':
                return_parts+= " "+ part.get_payload()
        return return_parts
    elif maintype == 'text':
        return email_message_instance.get_payload()
    return ""

fromField=con.escape(email_message["From"])
contentField=con.escape(primitive_get_text_blocks(email_message))

primitive get_text_blocks is copy pasted from somewhere. The result is that I get database entries like this :

<META http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8">

From what I understand, that has something to do with being encoded in utf-7. So I changed to get_payload(decode=True), but that gives me byte-arrays. If I append another decode('utf-8'), it sometimes crashes with errors like

'codec error can't decode to ...'.

I don't know how encodings work, I only want a unicode string with the body of my email.

Why is there no simple convert(charset from, charset to)? How do I get a readable email body (and address?). I've discovered IMAP Fetch Encoding and using decode_header I got no further.

--

I assume encoding is the way bytes represent characters, so with that in mind, shouldn't decode take a byte array and spit out a string? and here on stack overflow I came across somebody claming it had something to do with beeing encoded with utf-8 and utf-7. What does that even mean?

I did google and there appear to be tons of duplicates but the answers they got didn't really help me out (I've tried most of them)

2
  • 2
    That's not UTF-7, that's quoted-printable. Generally you should expect most single-part body parts to be either QP or base64-encoded. The Content-Transfer-Encoding header tells you which (or no encoding, which is one of 7bit, 8bit, or binary). Commented May 27, 2014 at 12:38
  • 1
    For text parts, you should not assume UTF-8 or try to guess; you should be examining the charset attribute of the Content-Type header. Commented May 27, 2014 at 12:40

1 Answer 1

1

Turns out it's quite easy. Even though all documentation points to the glorious past when the unicode function still was a real thing, 'str' does the same.

So to recap, you have to pass 'decode=True' with 'getPayload' and wrap that around a str(...,'utf-8').

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.