Python 2.7 Decoding both UTF-8 and unicode-escape in python causes UnicodeEncodeError

Question

I have a tsv file which in some lines a particular column contains mixed formats such as: Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e which should be Hapoel_Be'er_Sheva_A.F.C..

And here is the code I use to read the file and split the columns:

with open(path, 'rb') as f:
  for line in f:
      cols = line.decode('utf-8').split('\t')
      text = cols[3].decode('unicode-escape') #Here is the column that has the above mentioned mixed format

Error message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0160' in position 6: ordinal not in range(128)

I would like to know how to convert from the first mixed format to the other while reading the file? I'm using python 2.7.

Thank you so much,

Is this python 2 or 3?

FHTMitchell
– FHTMitchell

2018-09-06 15:38:49 +00:00
Commented Sep 6, 2018 at 15:38 — FHTMitchell
– FHTMitchell, Commented Sep 6, 2018 at 15:38
@FHTMitchell sorry forgot to specify. it's python 2.7.

userofstackoverflow
– userofstackoverflow

2018-09-06 15:41:59 +00:00
Commented Sep 6, 2018 at 15:41 — userofstackoverflow
– userofstackoverflow, Commented Sep 6, 2018 at 15:41

FHTMitchell · Accepted Answer · 2018-09-06 15:43:06Z

1

You can use ast.literal_eval to convert the raw bytes into a unicode

import ast

raw_bytes = br'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'
print(raw_bytes)  # b'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'

unicode_string = ast.literal_eval('"{}"'.format(raw_bytes.decode('utf8')))

output of unicode_string:

Hapoel_Be'er_Sheva_A.F.C.

Update - tested in python 2.7 and works a charm

answered Sep 6, 2018 at 15:43

FHTMitchell

12.2k2 gold badges40 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

userofstackoverflow Over a year ago

thank you for the efforts, but it triggers an error I think it's because I'm already decoding the whole line (question is edited accordingly).

Mark Ransom · Accepted Answer · 2018-09-25 23:30:27Z

1

You can use decode('unicode-escape') to convert those hex sequences to characters.

>>> 'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e'.decode('unicode-escape')
u"Hapoel_Be'er_Sheva_A.F.C."

Edit: according to your update to the question, you actually have a combination of hex sequences and Unicode characters outside of the ASCII range. The error comes from an automatic conversion that Python 2.7 attempts when you try to use .decode() on a Unicode string - decode only works on byte strings, so it tries to convert from Unicode using the ASCII codec. Python 3 won't allow this mistake.

To fix this you need a double conversion, one to convert those non-ASCII characters to hex sequences and another to convert them back. The 'unicode-escape' codec will double up the backslashes so those must be corrected as well.

>>> print u'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e\u0160'.encode('unicode-escape').replace(b'\\\\u', b'\\u').decode('unicode-escape')
Hapoel_Be'er_Sheva_A.F.C.Š

edited Sep 25, 2018 at 23:30

answered Sep 6, 2018 at 15:46

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

3 Comments

FHTMitchell Over a year ago

damn that's much better than mine

Mark Ransom Over a year ago

@FHTMitchell I always consider any kind of eval a last resort. One of the guiding principles of Python was that there should always be one obvious way to do something, but it sure violates that principle a lot.

userofstackoverflow Over a year ago

@MarkRansom it caused an error. I'll be posting more details and the error message.

Collectives™ on Stack Overflow

Python 2.7 Decoding both UTF-8 and unicode-escape in python causes UnicodeEncodeError

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related