How can I change unicode to ascii and drop unrecognized characters

Question

My file is in unicode. However, for some reason, I want to change it to plain ascii while dropping any characters that are not recognized in ascii. For example, I want to change u'This is a string�' to just 'This is a string'. Following is the code I use to do so.

ascii_str = unicode_str.encode('ascii', 'ignore')

However, I still get the following annoying error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: 
  ordinal not in range(128)

How can I solve this problem? I am fine with plain ascii strings.

print repr(unicode_str) also pleas post your complete traceback — Joran Beasley
– Joran Beasley, Commented Nov 12, 2014 at 19:17
Look at my solution. I think reading the file (if there is one?) with the right encoding is the best starting point to handle that. — wenzul
– wenzul, Commented Nov 12, 2014 at 19:47
I can highly recommend this library for converting Unicode to ASCII: pypi.python.org/pypi/Unidecode — Simeon Visser
– Simeon Visser, Commented Nov 12, 2014 at 20:29

wenzul · Accepted Answer · 2014-11-12 20:28:19Z

3

I assume that your unicode_str is a real unicode string.

>>> u"\xf3".encode("ascii", "ignore")
''

If not use this

>>> "\xf3".decode("ascii", "ignore").encode("ascii")

Always the best way would be, find out which encoding you deal with and than decode it. So you have an unicode string in the right format. This means start at unicode_str either to be a real unicode string or read it with the right codec. I assume that there is a file. So the very best would be:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

Another desperate approach would be:

>>> import string
>>> a = "abc\xf3abc"
>>> "".join(b for b in a if b in string.printable)
'abcabc'

edited Nov 12, 2014 at 20:28

answered Nov 12, 2014 at 19:32

wenzul

4,0882 gold badges24 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mark Ransom Over a year ago

This. The error message indicates an error in decoding, not encoding.

score 1 · Accepted Answer · 2014-11-12 19:24:38Z

1

You need to decode it. if you have a file

with open('example.csv', 'rb') as f:
    csv = f.read().decode("utf-8")

if you wanna decode a string, you can do it this way

data.decode('UTF-8')

UPDATE You can use ord() to get code ascii of every character

d=u'This is a string'
l=[ord(s) for s in d.encode('ascii', 'ignore')]
print l

If you need to concatenate them, you can use join

print "".join(l)

edited Nov 12, 2014 at 19:24

answered Nov 12, 2014 at 19:14

user4179775

2 Comments

MetallicPriest Over a year ago

After decoding it, how can I convert it to ascii and ignore the characters not recognized by ascii?

user4179775 Over a year ago

@MetallicPriest You can use ord() to get code ascii of every character. I've updated the post.

Kasravnd · Accepted Answer · 2014-11-12 19:35:50Z

1

As you have a Replacement character ( a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table) in your string , you need to specify that for your interpreter before decoding , with add u at the leading of your string :

>>> unicode_str=u'This is a string�'
>>> unicode_str.encode('ascii', 'ignore')
'This is a string'

edited Nov 12, 2014 at 19:35

answered Nov 12, 2014 at 19:22

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Collectives™ on Stack Overflow

How can I change unicode to ascii and drop unrecognized characters

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related