python email encoding and decoding problems

Question

Basically I want to read all new emails from an inbox and put them in a database. The reason I use python is because it has imaplib, but I know nothing about it.

Currently, I have something like this :

def primitive_get_text_blocks(email_message_instance):
    maintype = email_message_instance.get_content_maintype()
    if maintype == 'multipart':
        return_parts = ""
        for part in email_message_instance.get_payload():
            if part.get_content_maintype() == 'text':
                return_parts+= " "+ part.get_payload()
        return return_parts
    elif maintype == 'text':
        return email_message_instance.get_payload()
    return ""

fromField=con.escape(email_message["From"])
contentField=con.escape(primitive_get_text_blocks(email_message))

primitive get_text_blocks is copy pasted from somewhere. The result is that I get database entries like this :

<META http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8">

From what I understand, that has something to do with being encoded in utf-7. So I changed to get_payload(decode=True), but that gives me byte-arrays. If I append another decode('utf-8'), it sometimes crashes with errors like

'codec error can't decode to ...'.

I don't know how encodings work, I only want a unicode string with the body of my email.

Why is there no simple convert(charset from, charset to)? How do I get a readable email body (and address?). I've discovered IMAP Fetch Encoding and using decode_header I got no further.

--

I assume encoding is the way bytes represent characters, so with that in mind, shouldn't decode take a byte array and spit out a string? and here on stack overflow I came across somebody claming it had something to do with beeing encoded with utf-8 and utf-7. What does that even mean?

I did google and there appear to be tons of duplicates but the answers they got didn't really help me out (I've tried most of them)

That's not UTF-7, that's quoted-printable. Generally you should expect most single-part body parts to be either QP or base64-encoded. The Content-Transfer-Encoding header tells you which (or no encoding, which is one of 7bit, 8bit, or binary). — tripleee
– tripleee, Commented May 27, 2014 at 12:38
For text parts, you should not assume UTF-8 or try to guess; you should be examining the charset attribute of the Content-Type header. — tripleee
– tripleee, Commented May 27, 2014 at 12:40

user3679326 · Accepted Answer · 2014-05-28 23:56:11Z

1

Turns out it's quite easy. Even though all documentation points to the glorious past when the unicode function still was a real thing, 'str' does the same.

So to recap, you have to pass 'decode=True' with 'getPayload' and wrap that around a str(...,'utf-8').

answered May 28, 2014 at 23:56

user3679326

513 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

python email encoding and decoding problems

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related