6

I have a string say s = 'Chocolate Moelleux-M\xe8re' When i am doing:

In [14]: unicode(s)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

Similarly when i am trying to decode this by using s.decode() it returns same error.

In [13]: s.decode()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 20: ordinal not in range(128)

How to decode such string into unicode.

2 Answers 2

11

I have had to face this problem one too many times. The problem that I had contained strings in different encoding schemes. So I wrote a method to decode a string heuristically based on certain features of different encodings.

def decode_heuristically(string, enc = None, denc = sys.getdefaultencoding()):
    """
    Try to interpret 'string' using several possible encodings.
    @input : string, encode type.
    @output: a list [decoded_string, flag_decoded, encoding]
    """
    if isinstance(string, unicode): return string, 0, "utf-8"
    try:
        new_string = unicode(string, "ascii")
        return string, 0, "ascii"
    except UnicodeError:
        encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]

        if denc != "ascii": encodings.insert(0, denc)

        if enc: encodings.insert(0, enc)

        for enc in encodings:
            if (enc in ("iso-8859-15", "iso-8859-1") and
                re.search(r"[\x80-\x9f]", string) is not None):
                continue

            if (enc in ("iso-8859-1", "cp1252") and
                re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", string)\
                is not None):
                continue

            try:
                new_string = unicode(string, enc)
            except UnicodeError:
                pass
            else:
                if new_string.encode(enc) == string:
                    return new_string, 0, enc

        # If unable to decode,doing force decoding i.e.neglecting those chars.
        output = [(unicode(string, enc, "ignore"), enc) for enc in encodings]
        output = [(len(new_string[0]), new_string) for new_string in output]
        output.sort()
        new_string, enc = output[-1][1]
        return new_string, 1, enc

To add to this this link gives a good feedback on why encoding etc - Why we need sys.setdefaultencoging in py script

Sign up to request clarification or add additional context in comments.

Comments

4

You need to tell s.decode your encoding. In your case s.decode('latin-1') seems fitting.

5 Comments

is it going to help me in all situation? Is there any generalised solution?
Can we remove those character such as '\x' in my example , from the original string.
@alis: You could use chardet (chardet.feedparser.org) to guess the encoding.
s.decode('ascii','ignore') will take out all 'weird' characters
@alis: This converts Chocolate Moelleux-Mère into Chocolate Moelleux-Mre. I don't understand how this could be an actual solution for anything. Further, assume you encounter an ISO-8859-5 encoded version of Мойст Шоколад Матери. If you decode that by ignoring all non-ascii characters, all that will remain are two blanks. In other words, decode your strings by specifying the matching encoding. In your example, unicode(s, 'latin-1').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.