10

I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.

I type:

>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'

Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?

I continue, and type:

>>> us=unicode(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'

Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?

Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.

I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows

2
  • related: Python, Unicode, and the Windows console Commented Aug 2, 2016 at 12:16
  • In the first question, you're talking about representation, not encoding.s is an object that contains a single byte with a specific value; u is an object containing a single character. In both cases you see the repr of the object reported back. Commented May 4, 2024 at 21:34

8 Answers 8

14

Unicode is not an encoding. You encode into byte strings and decode into Unicode:

>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'

The windows terminal uses legacy code pages for DOS. For US Windows it is:

>>> import sys
>>> sys.stdout.encoding
'cp437'

Windows applications use windows code pages. Python's IDLE will show the windows encoding:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Your results may vary.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the sys.stdout.encoding tip. Now it is clear to me how I can determine the encoding used in the terminal
7

Avoid Windows Terminal

I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.

Output to a file instead

Set the PYTHONIOENCODING environment variable and then redirect the output to a file.

set PYTHONIOENCODING=utf-8

./myscript.py > output.txt

Then using Notepad++ you can then see the UTF-8 version of your output.

Install win-unicode-console

win-unicode-console can fix your problems. You should try it out

pip install win-unicode-console

If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.

py -m run script.py

Runs it per script or you can follow their directions to add win_unicode_console.enable() to every invocation by adding it to usercustomize or sitecustomize.

2 Comments

Starting with Python 3.6 the Windows console is much more usable. Outputting to the console bypasses the code page nonsense entirely and works directly with Unicode.
That's good to know. Next time I'm doing some python development for a Windows shop I'll try and push them forward
2

In case others get this page when searching Easiest way is to set the codepage in the terminal first

CHCP 65001

then run your program.

working well for me. For power shell start it with

powershell.exe -NoExit /c "chcp.com 65001"

Its from python: unicode in Windows terminal, encoding used?

1 Comment

print already works for the OP. The issue is not the console encoding.
1

Read through this python HOWTO about unicode after you read this section from the tutorial

Creating Unicode strings in Python is just as simple as creating normal strings:

>>> u'Hello World !'
u'Hello World !'

To answer your first question, they are different because only when using u''are you creating a unicode string.

2nd question:

sys.getdefaultencoding()

returns the default encoding

But to quote from link:

Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.

Comments

1

You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb.

Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab and in Win-1252 it's actually the same as the unicode code-point - ie hex eb. So, it's a bit of a mystery.

2 Comments

OK. But why then, when i turn the string s into a unicode string by doing us=unicode(s, 'latin-1') is the resulting unicode string us not shown as u'\xeb'
As Mark says, your encoding is probably CP850 rather than Latin-1.
1

It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.

>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'

1 Comment

That is a sad state of affairs and central to what plagues the OS
1
  1. Actually, unicode object has no 'encoding'. You should read up on Unicode in python to avoid constant confusion. This presentation looks adequate - http://farmdev.com/talks/unicode/ .

  2. You are on russian version of windows, right? You terminal uses cp1251.

2 Comments

Good point about 'unicode object has no encoding'. I am not on a russian version of Windows, but an English one.
The linked presentation is indeed very good (at least for me)
1

As you've figured out:

>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё

Do you open any file when get such errors? If so, try to open it with

import codecs
f = codecs.open('filename.txt','r','utf-8')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.