Understanding encoding and decoding in Python

Question

I'm looking around how works encoding in python 2.7, and I can't quite understand some aspects of it. I've worked with files with different encodings, and yet so far I was doing okay. Until I started to work with certain API, and it requires to work with Unicode strings

u'text'

and I was using Normal strings

'text'

Which araised a lot of problems.

So I want to know how to go from Unicode String to Normal String and backwards, because the data that I'm working with is handled by Normal Strings, and I only know how to get the Unicode ones without having issues, over the Python Shell.

What I've tried is:

>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'

Now, to get an Unicode string what I do is:

>>> foobar = unicode(foo, "latin1")
u'gur\xa3'

But this doesn't work for me, since I'm doing some comparisons in my code like this:

>>> foobar in u"Foo gurú Bar"
False

Which fails, even if the original value is the same, because of the encoding.

[Edit]

I'm using Python Shell on Windows 10.

It's because you're using the wrong encoding, latin1 is incorrect. If you're using Windows you should try mbcs, because that uses the native encoding for your flavor of Windows. — Mark Ransom
– Mark Ransom, Commented Jul 19, 2017 at 22:01
@MarkRansom You are right, indeed I was not using the correct encoding, unfortunately for me mbcs, was not the way to go. But I found a proper response in another question in stackoverflow, so I'll add it as an asnwer and link to it, for further questions. — S. Tyr
– S. Tyr, Commented Jul 26, 2017 at 13:45
The only time mbcs won't work is if you're in a command window, I should have thought of that. I'm glad you figured out your answer. — Mark Ransom
– Mark Ransom, Commented Jul 30, 2017 at 3:20

S. Tyr · Accepted Answer · 2017-07-26 14:23:36Z

1

The windows terminal uses legacy code pages for DOS. For US Windows it is:

>>> import sys
>>> sys.stdout.encoding
'cp437'

Windows application use windows code pages. Python's IDLE will show the windows encoding:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Your results may vary!... Source

So if you want to go from normal String to Unicode and backwards. Then first you have to findout the encoding of your system, which is used for normal Strings in Python 2.X. And later on, use it to make the proper conversion.

I leave you with an example:

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
>>>
>>> foobar = unicode(foo, 'cp850')
u'gur\xfa'
>>>
>>> foobar in u"Foo gurú Bar"
True

edited Jul 26, 2017 at 14:23

answered Jul 26, 2017 at 14:08

S. Tyr

6794 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Understanding encoding and decoding in Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related