0

I'm receiving unicode data from a client, stored in a dictionary called "data". The following code

variable1 = '\u03b5\u0061\u0073\u0064\u0066'
print("TYPE1 = " + str(type(variable1)))
print("VAR1 = " + variable1)

variable2 = data['text']
print("TYPE2 = " + str(type(variable2)))
print("VAR2 = " + variable2)

prints

TYPE1 = <class 'str'>
VAR1 = εasdf
TYPE2 = <class 'str'>
VAR2 = \u03b5\u0061\u0073\u0064\u0066

This suggests that the data from the client is somehow not interpreted properly. Writing the variables to file also gives the exact same result: the file has the literal "\u03b5\u0061\u0073\u0064\u0066". How can I "reinterpret" that unicode string so that I get the same result as the inline variable?

The following did NOT work:

  • eval(variable2) (Error: "unexpected character after line continuation character")

With print(variable2.encode().decode()), I get VAR2 = ε.

By using .encode('ascii').decode('unicode_escape'), I get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

In the shell:

>>> "\u03b5\u0061\u0073\u0064\u0066"
'εasdf'
1
  • What about u'\u03b5\u0061\u0073\u0064\u0066'? Commented Aug 22, 2018 at 8:56

2 Answers 2

1

It depends on how consistently the input data got corrupted (or encoded in some particular way), but for the given example the following should work:

>>> data = '\\u03b5\\u0061\\u0073\\u0064\\u0066'
>>> print(data)
\u03b5\u0061\u0073\u0064\u0066
>>> text = data.encode('ascii').decode('unicode_escape')
>>> print(text)
εasdf

The "unicode_escape" codec is provided exactly for Python-style Unicode escapes. It also works with escapes of the form \xNN and \U000NNNNN, mixed with literal ASCII characters.

A few things to note:

  • The .encode('ascii') step is necessary, because .decode only exists for bytes, not str.
  • If you have a mixture of non-ASCII literals and Unicode escapes (as is allowed in Python str literals), you can try encode('utf-8'), but I haven't thought this through.
  • eval doesn't work here because there are no quotes around the data.
  • It's possible that your data originates from JSON, where the \uNNNN escapes also exist (but not the \xNN and \U000NNNNN ones). If this is the case, you have to deal separately with characters above U+FFFF (eg. emojis), which are represented by surrogate pairs.
Sign up to request clarification or add additional context in comments.

9 Comments

Updated my question with your suggestions and the results.
The data does indeed originate from JSON.
Have you tried .encode('utf-8') instead of ASCII? Maybe you could also update your example to reproduce the UnicodeEncodeError as well.
I don't know how to reproduce it without the client sending it (that would have made it a lot simpler). In the Python shell, printing these literals gives the right output. I guess str = "ε" is the closest thing: if you can make that into an ε then I guess it would work.
It would seem that "ε" indeed is what is actually received.
|
0

Escaping the unicode (JavaScript example below)

function escapeUnicode(str) {
    return str.replace(/[^\0-~]/g, function(ch) {
        return "\\u" + ("000" + ch.charCodeAt().toString(16)).slice(-4);
    });
}

before sending the data and using

input.encode("utf-8").decode('unicode-escape')

seemed to work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.