2

I'm looking around how works encoding in python 2.7, and I can't quite understand some aspects of it. I've worked with files with different encodings, and yet so far I was doing okay. Until I started to work with certain API, and it requires to work with Unicode strings

u'text'

and I was using Normal strings

'text'

Which araised a lot of problems.

So I want to know how to go from Unicode String to Normal String and backwards, because the data that I'm working with is handled by Normal Strings, and I only know how to get the Unicode ones without having issues, over the Python Shell.

What I've tried is:

>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'

Now, to get an Unicode string what I do is:

>>> foobar = unicode(foo, "latin1")
u'gur\xa3'

But this doesn't work for me, since I'm doing some comparisons in my code like this:

>>> foobar in u"Foo gurú Bar"
False

Which fails, even if the original value is the same, because of the encoding.

[Edit]

I'm using Python Shell on Windows 10.

3
  • It's because you're using the wrong encoding, latin1 is incorrect. If you're using Windows you should try mbcs, because that uses the native encoding for your flavor of Windows. Commented Jul 19, 2017 at 22:01
  • @MarkRansom You are right, indeed I was not using the correct encoding, unfortunately for me mbcs, was not the way to go. But I found a proper response in another question in stackoverflow, so I'll add it as an asnwer and link to it, for further questions. Commented Jul 26, 2017 at 13:45
  • The only time mbcs won't work is if you're in a command window, I should have thought of that. I'm glad you figured out your answer. Commented Jul 30, 2017 at 3:20

1 Answer 1

1

The windows terminal uses legacy code pages for DOS. For US Windows it is:

>>> import sys
>>> sys.stdout.encoding
'cp437'

Windows application use windows code pages. Python's IDLE will show the windows encoding:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Your results may vary!... Source

So if you want to go from normal String to Unicode and backwards. Then first you have to findout the encoding of your system, which is used for normal Strings in Python 2.X. And later on, use it to make the proper conversion.

I leave you with an example:

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
>>>
>>> foobar = unicode(foo, 'cp850')
u'gur\xfa'
>>>
>>> foobar in u"Foo gurú Bar"
True
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.