1

I know that a lot of people on the Internet have expressed having problems with string encodings in Python but no matter what I try, I can't figure out how to fix my problem. Essentially, I'm using TCP sockets to connect to a Web Server and then I send that Server a HTTP Request. I read the response into a series of buffers that I decode and concatenate to create a complete response as a string. When I get the response however, I'm getting UnicodeDecodingErrors. I want to use my program to go on to many different websites so is there any solution to this problem that would work with just about any site I give it?

Thank you for your time.

Some code:

def getAllFromSocket(socket):
    '''Reads all data from a socket and returns a string of it.'''
    more_bytes = True
    message = ''
    if(socket!=None):
        while(more_bytes):
        buffer = socket.recv(1024)
        if len(buffer) == 0:
            more_bytes = False
        else:
            message += buffer.decode('utf-8')
    return message

So when I do this:

received_message = getAllFromSocket(my_sock)

I get:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 1023: unexpected end of data
9
  • 2
    Can you give some sample code/data that illustrates your problem? Commented Apr 3, 2012 at 2:58
  • 1
    search for pycon 2012 unicode on youtube. There's an awesome video on unicode in python2/3 Commented Apr 3, 2012 at 2:59
  • 1
    You most probably need to parse the Content-Type header and decode appropriately. There is no "magic" solution (except for using a library instead of rolling your own code for a problem that has been solved one hundred thousand times). Commented Apr 3, 2012 at 3:00
  • 1
    Yeah, the data you're receiving isn't utf-8. Determine the actual type and decode it from that. Commented Apr 3, 2012 at 3:21
  • 1
    @Hudson: The header will be ASCII (no characters with codepoints > 0x7f)). Can't you just use urllib for the request? It's part of the Python stdlib. Commented Apr 3, 2012 at 12:28

1 Answer 1

1

You can try finding the encoding of the data using UnicodeDammit. Make sure you're getting utf-8. You can also choose to ignore errors:

buffer.decode("utf-8", "ignore")
Sign up to request clarification or add additional context in comments.

1 Comment

The data doesn't seem to be UTF-8, so this is a bad workaround at best.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.