Handle Unicode characters with Python regexes

Question

I'm writing a simple application where I want to replace certain words with other words. I'm running into problems with words that use single quotes such as aren't, ain't, isn't.

I have a text file with the following

aren’t=ain’t
hello=hey

I parse the text file and create a dictionary out of it

u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'

Then I try to replace all the characters in a given text

text = u"aren't"

def replace_all(text, dict):
    for i, k in dict.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

The problem is that re.sub() doesnt match u'aren\u2019t' with u"aren't".

What can I do so that my replace_all() function will match both "hello" and `"aren't" and replace them with the appropriate text? Can I do something in Python so that my dictionary doesn't contain Unicode? Could I convert my text to use a Unicode character, or could I modify the regex to match the Unicode character as well as all the other text?

The expected result is that the text "aren't" is replaced with "ain't". — Pim
– Pim, Commented Feb 24, 2011 at 15:21

Mikel · Accepted Answer · 2011-02-23 22:59:53Z

3

I guess your problem is:

text = u"aren't"

instead of:

text = u"aren’t"

(note the different apostrophes?)

Here's your code modified to make it work:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

d = {
    u'aren’t': u'ain’t',
    u'hello': u'hey'
    }
#text = u"aren't"
text = u"aren’t"


def replace_all(text, d):
    for i, k in d.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

if __name__ == '__main__':
    newtext = replace_all(text, d)
    print newtext

Output:

ain’t

answered Feb 23, 2011 at 22:59

Mikel

25.9k8 gold badges70 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pim Over a year ago

Was able to fix my problems which came from the text having different types of apostrophes

Adam Rosenfield · Accepted Answer · 2011-02-23 22:52:44Z

0

This works fine for me in Python 2.6.4:

>>> re.sub(ur'\baren\u2019t\b', 'rep', u'aren\u2019t')
u'rep'

Make sure that your pattern string is a Unicode string, otherwise it might not work.

answered Feb 23, 2011 at 22:52

Adam Rosenfield

403k103 gold badges524 silver badges600 bronze badges

Comments

eos87 · Accepted Answer · 2011-02-23 22:53:18Z

0

try saving your file into UTF-8 encode

answered Feb 23, 2011 at 22:53

eos87

9,41313 gold badges51 silver badges77 bronze badges

Comments

kjhughes · Accepted Answer · 2013-12-23 17:40:53Z

0

u"aren\u2019t" == u"aren't"

False

u"aren\u2019t" == u"aren’t"

True

edited Dec 23, 2013 at 17:40

kjhughes

113k31 gold badges198 silver badges276 bronze badges

answered Feb 23, 2011 at 23:37

intrepion

39k4 gold badges26 silver badges22 bronze badges

Collectives™ on Stack Overflow

Handle Unicode characters with Python regexes

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related