python unicode to original text character when being used as string not when printing

Question

I want an unicoded string I am getting from a method, I want to look like original text character rather than an unicode.

a=u'\u2018\u0997\u09c7\u09ae\u09bf\u0982 \u09aa\u09cd\u09b2\u09be\u099f\u09ab\u09b0\u09cd\u09ae\u2019 \u09a4\u09c8\u09b0\u09bf \u0995\u09b0\u09ac\u09c7 \u09ab\u09c7\u09b8\u09ac\u09c1\u0995'

print a #‘গেমিং প্লাটফর্ম’ তৈরি করবে ফেসবুক

Print always works, but my use case is different. The things that it is printing, I want it put it on my RESTful API, or at least I want to use it as a string of original character and if I leave as it is my clients who will be using it on html won't be able to use it easily, I suspect.

The end result should look like this:

{title: ‘গেমিং প্লাটফর্ম’ তৈরি করবে ফেসবুক }

but json dumps is like:

json.dumps({'a': u})
'{"a": "\\\\u0996\\\\u09be\\\\u09b2\\\\u09bf\\\\u09df\\\\u09be\\\\u099c\\\\u09c1\\\\u09b0\\\\u09c0\\\\u09a4\\\\u09c7 \\\\u09a6\\\\u09c1\\\\u0987 \\\\u0997\\\\u09cd\\\\u09b0\\\\u09c1\\\\u09aa\\\\u09c7\\\\u09b0 \\\\u09b8\\\\u0982\\\\u0998\\\\u09b0\\\\u09cd\\\\u09b7\\\\u09c7 \\\\u09a8\\\\u09be\\\\u09b0\\\\u09c0\\\\u09b8\\\\u09b9 \\\\u0986\\\\u09b9\\\\u09a4 \\\\u09e7\\\\u09e6"}'

So, chances are I would need something like,

blog={}
blog['title']= str(a) # or something else

I have tried following so far, but no luck so far:

>>> str(a) 

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

>>> a.encode('utf-8')
'\xe2\x80\x98\xe0\xa6\x97\xe0\xa7\x87\xe0\xa6\xae\xe0\xa6\xbf\xe0\xa6\x82 \xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb2\xe0\xa6\xbe\xe0\xa6\x9f\xe0\xa6\xab\xe0\xa6\xb0\xe0\xa7\x8d\xe0\xa6\xae\xe2\x80\x99 \xe0\xa6\xa4\xe0\xa7\x88\xe0\xa6\xb0\xe0\xa6\xbf \xe0\xa6\x95\xe0\xa6\xb0\xe0\xa6\xac\xe0\xa7\x87 \xe0\xa6\xab\xe0\xa7\x87\xe0\xa6\xb8\xe0\xa6\xac\xe0\xa7\x81\xe0\xa6\x95'

>>> a.encode('utf8')
'\xe2\x80\x98\xe0\xa6\x97\xe0\xa7\x87\xe0\xa6\xae\xe0\xa6\xbf\xe0\xa6\x82 \xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb2\xe0\xa6\xbe\xe0\xa6\x9f\xe0\xa6\xab\xe0\xa6\xb0\xe0\xa7\x8d\xe0\xa6\xae\xe2\x80\x99 \xe0\xa6\xa4\xe0\xa7\x88\xe0\xa6\xb0\xe0\xa6\xbf \xe0\xa6\x95\xe0\xa6\xb0\xe0\xa6\xac\xe0\xa7\x87 \xe0\xa6\xab\xe0\xa7\x87\xe0\xa6\xb8\xe0\xa6\xac\xe0\xa7\x81\xe0\xa6\x95'

>>> a.__str__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

>>> a.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Your JSON output example looks wrong to me, repr() should convert that to '{"a": "\\u0996..."}', not to '{"a": "\\\\u0996..."}'. Probably you printed that by writing a="\u0996..." rather than a=u"\u0996...". — roeland
– roeland, Commented Aug 24, 2016 at 2:10

ShadowRanger · Accepted Answer · 2016-08-23 05:44:25Z

3

You're misunderstanding the repr of a Python object. Those escapes in your literal string are actually being converted internally to the "real" characters that Python is displaying when you print (that is, internally, it's storing a single Unicode ordinal for each of the escapes, not the escapes themselves). You don't need to encode it unless you need the raw bytes in a particular encoding (and decoding it is nonsensical; unicode objects have that method in Py2, but it's usually wrong to use it, because unicode is by definition not encoded bytes).

Basically, just use the unicode object you've got and it's the text you expect, it just may not display that way when you're using the interactive interpreter (which is echoing reprs of the object, which displays the escapes instead of the actual characters, partially to ensure it won't error out if you lack the fonts or language support to display the real characters). Unicode friendly libraries will work with it exactly the way you expect, the length is usually the character count (in Py2, on 16 bit wchar systems with non-BMP ordinals, this may not be true, but it's usually true).

That said, I'd recommend switching to Python 3 for any non-ASCII intensive work; Python 2 support for Unicode is less consistent and has many more gaps and pitfalls. Many third party packages, and even some built-in packages (cough csv cough) are not unicode friendly, so you end up needing to explicitly encode to use them, then decode their results.

answered Aug 23, 2016 at 5:44

ShadowRanger

158k12 gold badges221 silver badges316 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sadaf2605 Over a year ago

end of the day I want it to serve as a RESTful API, so keeping it in unicoded form is a good to go that means? But I did not like the fact that I can't read it what is written there when it will be served as API on browser. Is there anything I can do about it?

ShadowRanger Over a year ago

@sadaf2605: REST APIs usually have their own specific data transfer format, e.g. JSON. Typically, you'd take that dictionary you made, pass it to json.dump/json.dumps, then send what it produces out on the wire (many frameworks do this for you automatically if you tell them to send a dict). The receiver on the other side then parses what it receives with their own JSON library (maybe not even Python's), but as long as you produced legal JSON, they can parse it on their end however they like; if their parser is standardized, it should get the same logical results.

sadaf2605 Over a year ago

Thank you for helping me this far, but yet json.dump on unicoded dictionary seems to make it even worse! :(

ShadowRanger Over a year ago

@sadaf2605: It's not "worse", it's JSON encoded. The whole point is that it's a non-Python specific encoding that can be interpreted by any JSON libraries on the receiver side. The internet works in bytes with specific encodings, it can't send the logical idea of a character, but rather bytes that define the character in a mutually agreed upon format.

sadaf2605 Over a year ago

thank you for your patience, and clarification, I really appreciate that :)

Collectives™ on Stack Overflow

python unicode to original text character when being used as string not when printing

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related