Python Decoding/Encoding Problems

Question

I know that a lot of people on the Internet have expressed having problems with string encodings in Python but no matter what I try, I can't figure out how to fix my problem. Essentially, I'm using TCP sockets to connect to a Web Server and then I send that Server a HTTP Request. I read the response into a series of buffers that I decode and concatenate to create a complete response as a string. When I get the response however, I'm getting UnicodeDecodingErrors. I want to use my program to go on to many different websites so is there any solution to this problem that would work with just about any site I give it?

Thank you for your time.

Some code:

def getAllFromSocket(socket):
    '''Reads all data from a socket and returns a string of it.'''
    more_bytes = True
    message = ''
    if(socket!=None):
        while(more_bytes):
        buffer = socket.recv(1024)
        if len(buffer) == 0:
            more_bytes = False
        else:
            message += buffer.decode('utf-8')
    return message

So when I do this:

received_message = getAllFromSocket(my_sock)

I get:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 1023: unexpected end of data

Can you give some sample code/data that illustrates your problem? — Michael Mior
– Michael Mior, Commented Apr 3, 2012 at 2:58
search for pycon 2012 unicode on youtube. There's an awesome video on unicode in python2/3 — Christopher Mahan
– Christopher Mahan, Commented Apr 3, 2012 at 2:59
You most probably need to parse the Content-Type header and decode appropriately. There is no "magic" solution (except for using a library instead of rolling your own code for a problem that has been solved one hundred thousand times). — Niklas B.
– Niklas B., Commented Apr 3, 2012 at 3:00
Yeah, the data you're receiving isn't utf-8. Determine the actual type and decode it from that. — agf
– agf, Commented Apr 3, 2012 at 3:21
@Hudson: The header will be ASCII (no characters with codepoints > 0x7f)). Can't you just use urllib for the request? It's part of the Python stdlib. — Niklas B.
– Niklas B., Commented Apr 3, 2012 at 12:28

Vlad the Impala · Accepted Answer · 2012-04-03 03:20:04Z

1

You can try finding the encoding of the data using UnicodeDammit. Make sure you're getting utf-8. You can also choose to ignore errors:

buffer.decode("utf-8", "ignore")

answered Apr 3, 2012 at 3:20

Vlad the Impala

16k19 gold badges86 silver badges130 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Niklas B. Over a year ago

The data doesn't seem to be UTF-8, so this is a bad workaround at best.

Collectives™ on Stack Overflow

Python Decoding/Encoding Problems

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related