1

So I'm trying to load this line in as a name for a model:

"Auf der grünen Wiese (1953)"

but I get the error

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 70: invalid start byte

I'm looking at: http://docs.python.org/2/howto/unicode.html#the-unicode-type but I'm still not exactly sure about the fix to this problem. I can cast it as a unicode with the option to replace/ignore the error but I don't think that is the most ideal solution?

I also see that django provides a few functions to help with this stuff: https://docs.djangoproject.com/en/dev/ref/unicode/ but I'm still not quite sure how to approach it.

1 Answer 1

3

The line is encoded using latin1. To properly decode it you should do (assuming Python 2.x):

line = 'Auf der gr\xfcnen Wiese (1953)'
name = line.decode('latin1')

If you are reading this from a file, you can also do:

f = codecs.open(path, 'r', 'latin1')
name = f.readline().strip()
Sign up to request clarification or add additional context in comments.

2 Comments

So generally, are most strings that have symbols in different languages encoded as latin1? Languages that have similar characters to English I mean.
That's a complicated question. It depends on your data source. UTF-8 is generally more common that latin-1 these days, at least in my experience, but it's dangerous to generalize without context. joelonsoftware.com/articles/Unicode.html is a very good and justifiably popular explanation of the basics of character sets, Unicode and the different encodings that should help make the Python tools make more sense if you're not already well grounded in this area.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.