4

In a text file I'm processing, I have characters like ����. Not sure what they are.

I'm wondering how to remove/convert these characters.

I have tried to convert it into ascii by using .encode(‘ascii’,'ignore’). python told me char is not whithin 0,128

I have also tried unicodedata, unicodedata.normalize('NFKD', text).encode('ascii','ignore'), with the same error

Anyone help?

Thanks!

1
  • od -x reports bfef efbd bdbf bfef efbd bdbf. Commented Jun 30, 2012 at 0:54

2 Answers 2

8

You can always take a Unicode string an use the code you showed:

my_ascii = my_uni_string.encode('ascii', 'ignore')

If that gave you an error, then you didn't really have a Unicode string to begin with. If that is true, then you have a byte string instead. You'll need to know what encoding it's using, and you can turn it into a Unicode string with:

my_uni_string = my_byte_string.decode('utf8')

(assuming your encoding is UTF-8).

This split between byte string and Unicode string can be confusing. My presentation, Pragmatic Unicode, or, How Do I Stop The Pain can help you to keep it all straight.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks you for the presentation. But I how do I find out the encoding of the original text?
@cheng I'm not sure you can just intuit the encoding of a random string easily. It's probably shown somewhere to you though, in the file or elsewhere
As is explained in the presentation, you have to know the encoding by some prior agreement. You can guess the encoding, but the only way to know for sure is to have a spec that explains what the encoding is.
1

It's not perfect (especially for shorter strings) but the chardet library would be of use here:

http://pypi.python.org/pypi/chardet

To have chardet figure out the encoding and then encode as unicode you would do:

import chardet
encoding = chardet.detect(some_string)['encoding']
unicode_string = unicode(some_string, encoding)

Of course, you won't be able to encode them as ascii if they're out of the ascii range.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.