How to replace Unicode values using re in Python?

Question

How to replace unicode values using re in Python ? I'm looking for something like this:

line.replace('Ã','')
line.replace('¢','')
line.replace('Ã¢','')

Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF file to ASCII, where I'm getting some non-ASCII characters [e.g. bullets in PDF]

Please help me.

Please respect language, you are not on IRC.

Emre Yazici
– Emre Yazici

2011-07-05 11:14:01 +00:00
Commented Jul 5, 2011 at 11:14 — Emre Yazici
– Emre Yazici, Commented Jul 5, 2011 at 11:14
Even on IRC it is not appropriate

user2665694
– user2665694

2011-07-05 11:38:50 +00:00
Commented Jul 5, 2011 at 11:38 — user2665694
– user2665694, Commented Jul 5, 2011 at 11:38

Kumar · Accepted Answer · 2017-10-08 09:22:01Z

1

Edit after feedback in comments.

Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 - 127. Like so:

# coding=utf-8

def removeUnicode():
    text = "hejsanäöåbadasd wodqpwdk"
    asciiText = ""
    for char in text:
        if(ord(char) < 128):
            asciiText = asciiText + char

    return asciiText

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

Here's an altered version of jd's answer with benchmarks:

# coding=utf-8

def removeUnicode():
    text = u"hejsanäöåbadasd wodqpwdk"
    if(isinstance(text, str)):
        return text.decode('utf-8').encode("ascii", "ignore")
    else:
        return text.encode("ascii", "ignore")        

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

Output first solution using a str string as input:

computer:~ Ancide$ python test1.py
Time taken: 5.88719677925

Output first solution using a unicode string as input:

computer:~ Ancide$ python test1.py
Time taken: 7.21077990532

Output second solution using a str string as input:

computer:~ Ancide$ python test1.py
Time taken: 2.67580914497

Output second solution using a unicode string as input:

computer:~ Ancide$ python test1.py
Time taken: 1.740680933

Conclusion

Encoding is the faster solution and encoding the string is less code; Thus the better solution.

edited Oct 8, 2017 at 9:22

Kumar

7861 gold badge6 silver badges19 bronze badges

answered Jul 5, 2011 at 11:43

rzetterberg

10.3k5 gold badges47 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Chris Morgan Over a year ago

Your benchmarks are useless. You should always use timeit for timing Python stuff; (a) it measures just the actual code, not interpreter initialisation and destruction, and (b) it does many runs. As you should expect, unicode.encode() is oodles faster. about 0.889, while your removeUnicode takes over 7; even dealing with a str with str.decode('utf-8').encode('ascii', 'ignore') is not too bad, around 2.5 (str.decode('ascii', 'ignore') is slow, about 13.3).

rzetterberg Over a year ago

Thanks for your feedback, Chris. I've edited my answer according to your feedback. Are you content with my edit?

Chris Morgan Over a year ago

yep, it'll do; I'd recommend using just timeit.timeit(removeUnicode) instead of instantiating a Timer object as it's shorter and easier.

jd. · Accepted Answer · 2011-07-05 11:15:25Z

1

You have to encode your Unicode string to ASCII, ignoring any error that occurs. Here's how:

>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'

answered Jul 5, 2011 at 11:15

jd.

11k3 gold badges49 silver badges55 bronze badges

Comments

Tauquir · Accepted Answer · 2011-07-05 11:42:07Z

1

Why you want to replace if you have

title.decode('latin-1').encode('utf-8')

or if you want to ignore

unicode(title, errors='replace')

answered Jul 5, 2011 at 11:42

Tauquir

6,9437 gold badges39 silver badges48 bronze badges

1 Comment

user Over a year ago

You typed 'replace' for ignore

servik · Accepted Answer · 2011-07-05 11:17:40Z

0

Try to pass re.UNICODE flag to params. Like this:

re.compile("pattern", re.UNICODE)

For more info see manual page.

answered Jul 5, 2011 at 11:17

servik

3,1114 gold badges31 silver badges37 bronze badges

Collectives™ on Stack Overflow

How to replace Unicode values using re in Python?

4 Answers 4

Conclusion

3 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Conclusion

3 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related