0

How to replace unicode values using re in Python ? I'm looking for something like this:

line.replace('Ã','')
line.replace('¢','')
line.replace('â','')

Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF file to ASCII, where I'm getting some non-ASCII characters [e.g. bullets in PDF]

Please help me.

2
  • Please respect language, you are not on IRC. Commented Jul 5, 2011 at 11:14
  • Even on IRC it is not appropriate Commented Jul 5, 2011 at 11:38

4 Answers 4

1

Edit after feedback in comments.

Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 - 127. Like so:

# coding=utf-8

def removeUnicode():
    text = "hejsanäöåbadasd wodqpwdk"
    asciiText = ""
    for char in text:
        if(ord(char) < 128):
            asciiText = asciiText + char

    return asciiText

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

Here's an altered version of jd's answer with benchmarks:

# coding=utf-8

def removeUnicode():
    text = u"hejsanäöåbadasd wodqpwdk"
    if(isinstance(text, str)):
        return text.decode('utf-8').encode("ascii", "ignore")
    else:
        return text.encode("ascii", "ignore")        

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

Output first solution using a str string as input:

computer:~ Ancide$ python test1.py
Time taken: 5.88719677925

Output first solution using a unicode string as input:

computer:~ Ancide$ python test1.py
Time taken: 7.21077990532

Output second solution using a str string as input:

computer:~ Ancide$ python test1.py
Time taken: 2.67580914497

Output second solution using a unicode string as input:

computer:~ Ancide$ python test1.py
Time taken: 1.740680933

Conclusion

Encoding is the faster solution and encoding the string is less code; Thus the better solution.

Sign up to request clarification or add additional context in comments.

3 Comments

Your benchmarks are useless. You should always use timeit for timing Python stuff; (a) it measures just the actual code, not interpreter initialisation and destruction, and (b) it does many runs. As you should expect, unicode.encode() is oodles faster. about 0.889, while your removeUnicode takes over 7; even dealing with a str with str.decode('utf-8').encode('ascii', 'ignore') is not too bad, around 2.5 (str.decode('ascii', 'ignore') is slow, about 13.3).
Thanks for your feedback, Chris. I've edited my answer according to your feedback. Are you content with my edit?
yep, it'll do; I'd recommend using just timeit.timeit(removeUnicode) instead of instantiating a Timer object as it's shorter and easier.
1

You have to encode your Unicode string to ASCII, ignoring any error that occurs. Here's how:

>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'

Comments

1

Why you want to replace if you have

title.decode('latin-1').encode('utf-8')

or if you want to ignore

unicode(title, errors='replace')

1 Comment

You typed 'replace' for ignore
0

Try to pass re.UNICODE flag to params. Like this:

re.compile("pattern", re.UNICODE)

For more info see manual page.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.