0

I have a tsv file which in some lines a particular column contains mixed formats such as: Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e which should be Hapoel_Be'er_Sheva_A.F.C..

And here is the code I use to read the file and split the columns:

with open(path, 'rb') as f:
  for line in f:
      cols = line.decode('utf-8').split('\t')
      text = cols[3].decode('unicode-escape') #Here is the column that has the above mentioned mixed format

Error message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0160' in position 6: ordinal not in range(128)

I would like to know how to convert from the first mixed format to the other while reading the file? I'm using python 2.7.

Thank you so much,

2
  • 1
    Is this python 2 or 3? Commented Sep 6, 2018 at 15:38
  • @FHTMitchell sorry forgot to specify. it's python 2.7. Commented Sep 6, 2018 at 15:41

2 Answers 2

1

You can use ast.literal_eval to convert the raw bytes into a unicode

import ast

raw_bytes = br'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'
print(raw_bytes)  # b'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'

unicode_string = ast.literal_eval('"{}"'.format(raw_bytes.decode('utf8')))

output of unicode_string:

Hapoel_Be'er_Sheva_A.F.C.

Update - tested in python 2.7 and works a charm

Sign up to request clarification or add additional context in comments.

1 Comment

thank you for the efforts, but it triggers an error I think it's because I'm already decoding the whole line (question is edited accordingly).
1

You can use decode('unicode-escape') to convert those hex sequences to characters.

>>> 'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e'.decode('unicode-escape')
u"Hapoel_Be'er_Sheva_A.F.C."

Edit: according to your update to the question, you actually have a combination of hex sequences and Unicode characters outside of the ASCII range. The error comes from an automatic conversion that Python 2.7 attempts when you try to use .decode() on a Unicode string - decode only works on byte strings, so it tries to convert from Unicode using the ASCII codec. Python 3 won't allow this mistake.

To fix this you need a double conversion, one to convert those non-ASCII characters to hex sequences and another to convert them back. The 'unicode-escape' codec will double up the backslashes so those must be corrected as well.

>>> print u'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e\u0160'.encode('unicode-escape').replace(b'\\\\u', b'\\u').decode('unicode-escape')
Hapoel_Be'er_Sheva_A.F.C.Š

3 Comments

damn that's much better than mine
@FHTMitchell I always consider any kind of eval a last resort. One of the guiding principles of Python was that there should always be one obvious way to do something, but it sure violates that principle a lot.
@MarkRansom it caused an error. I'll be posting more details and the error message.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.