Reinterpreting Unicode strings

Question

I'm receiving unicode data from a client, stored in a dictionary called "data". The following code

variable1 = '\u03b5\u0061\u0073\u0064\u0066'
print("TYPE1 = " + str(type(variable1)))
print("VAR1 = " + variable1)

variable2 = data['text']
print("TYPE2 = " + str(type(variable2)))
print("VAR2 = " + variable2)

prints

TYPE1 = <class 'str'>
VAR1 = εasdf
TYPE2 = <class 'str'>
VAR2 = \u03b5\u0061\u0073\u0064\u0066

This suggests that the data from the client is somehow not interpreted properly. Writing the variables to file also gives the exact same result: the file has the literal "\u03b5\u0061\u0073\u0064\u0066". How can I "reinterpret" that unicode string so that I get the same result as the inline variable?

The following did NOT work:

eval(variable2) (Error: "unexpected character after line continuation character")

With print(variable2.encode().decode()), I get VAR2 = Îµ.

By using .encode('ascii').decode('unicode_escape'), I get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

In the shell:

>>> "\u03b5\u0061\u0073\u0064\u0066"
'εasdf'

What about u'\u03b5\u0061\u0073\u0064\u0066'?

Joop Eggen
– Joop Eggen

2018-08-22 08:56:42 +00:00
Commented Aug 22, 2018 at 8:56 — Joop Eggen
– Joop Eggen, Commented Aug 22, 2018 at 8:56

lenz · Accepted Answer · 2018-08-22 05:20:01Z

1

It depends on how consistently the input data got corrupted (or encoded in some particular way), but for the given example the following should work:

>>> data = '\\u03b5\\u0061\\u0073\\u0064\\u0066'
>>> print(data)
\u03b5\u0061\u0073\u0064\u0066
>>> text = data.encode('ascii').decode('unicode_escape')
>>> print(text)
εasdf

The "unicode_escape" codec is provided exactly for Python-style Unicode escapes. It also works with escapes of the form \xNN and \U000NNNNN, mixed with literal ASCII characters.

A few things to note:

The .encode('ascii') step is necessary, because .decode only exists for bytes, not str.
If you have a mixture of non-ASCII literals and Unicode escapes (as is allowed in Python str literals), you can try encode('utf-8'), but I haven't thought this through.
eval doesn't work here because there are no quotes around the data.
It's possible that your data originates from JSON, where the \uNNNN escapes also exist (but not the \xNN and \U000NNNNN ones). If this is the case, you have to deal separately with characters above U+FFFF (eg. emojis), which are represented by surrogate pairs.

answered Aug 22, 2018 at 5:20

lenz

5,8585 gold badges27 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Christian Neverdal Over a year ago

Updated my question with your suggestions and the results.

Christian Neverdal Over a year ago

The data does indeed originate from JSON.

lenz Over a year ago

Have you tried .encode('utf-8') instead of ASCII? Maybe you could also update your example to reproduce the UnicodeEncodeError as well.

Christian Neverdal Over a year ago

I don't know how to reproduce it without the client sending it (that would have made it a lot simpler). In the Python shell, printing these literals gives the right output. I guess str = "Îµ" is the closest thing: if you can make that into an ε then I guess it would work.

Christian Neverdal Over a year ago

It would seem that "Îµ" indeed is what is actually received.

|

Christian Neverdal · Accepted Answer · 2018-08-22 08:14:28Z

0

Escaping the unicode (JavaScript example below)

function escapeUnicode(str) {
    return str.replace(/[^\0-~]/g, function(ch) {
        return "\\u" + ("000" + ch.charCodeAt().toString(16)).slice(-4);
    });
}

before sending the data and using

input.encode("utf-8").decode('unicode-escape')

seemed to work.

answered Aug 22, 2018 at 8:14

Christian Neverdal

5,4357 gold badges41 silver badges97 bronze badges

Collectives™ on Stack Overflow

Reinterpreting Unicode strings

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related