72

I tried using some code like this to read a JSON file (encoded using UTF-8):

input = open("json/world_bank.json")
i=0
for l in input:
    i+=1
print(i)

But I got a UnicodeDecodeError. However, it started working once I tried explicitly specifying an encoding:

input = open("json/world_bank.json",encoding="utf8")

I thought the open function would use "utf8" as the default encoding? Why does it need to be specified?

3
  • 4
    What does sys.getfilesystemencoding() return on your system? Commented Mar 30, 2016 at 9:47
  • 1
    here it is 'mbcs' @marcelm Commented Mar 30, 2016 at 9:48
  • 1
    Ah hmm, that doesn't tell me too much; could you check open("json/world_bank.json").encoding as well? Commented Mar 30, 2016 at 11:29

2 Answers 2

90

The default UTF-8 encoding of Python 3 only extends to conversions between bytes and str types. open() instead chooses an appropriate default encoding based on the environment:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

For example, a Windows machine with a Western Europe/North America locale will normally use the 8-bit Windows-1252 character set (Python calls this encoding 'cp1252').

Sign up to request clarification or add additional context in comments.

7 Comments

Fortunately there are recent attempts to end this madness... someday.
3.9 is installed on my machine and it's still using Windows 1252 encoding. PEP 597 linked by @Jeyekomon now says Python 3.10.
This is a good example of a very bad decision that was made a long time ago. Why give the pretense of being cross-platform when things that do not have to be cross-platform are not cross-platform by default.
The madness will finally be ended in Python 3.15. PEP 686: Make UTF-8 mode default has been accepted.
The madness will finally be ended with the demise of Windows. sys.getfilesystemencoding() # 'utf-8' locale.getpreferredencoding() # 'cp1252'
|
8

Following the advice here, the problem can also be solved by setting the environment variable PYTHONUTF8=1. This causes open to use UTF-8 encoding by default, rather than the platform's default encoding.

3 Comments

This is called "UTF-8 Mode", which forces Python to ignore local environment locales. See docs.python.org/3/library/os.html#utf8-mode. As always, it's better to fix the root cause by setting the correct locale, which should lead to a healthier system.
@AlastairMcCormack I would say it's better to fix the root cause by specifying the encoding in the program. There are any number of reasons why the file the program needs to read would be in a different encoding from the one described in the "correct locale". As they say, explicit is better than implicit.
@KarlKnechtel ah...well...yes and no 😀. I'm not saying that you shouldn't explicitly set the encoding when opening a known file type (as the OP did). My point was that overriding Python's locale detection by using "utf8-mode" is unwise as you'll lose two important features: 1) Terminal/console encoding detection. This was particularly important on Windows consoles. 2) A sensible default open encoding. On Windows machines, this is more important, where text files written by local MS apps will be encoded using the local 8-bit codepage.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.