2

I'm writing a simple application where I want to replace certain words with other words. I'm running into problems with words that use single quotes such as aren't, ain't, isn't.

I have a text file with the following

aren’t=ain’t
hello=hey

I parse the text file and create a dictionary out of it

u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'

Then I try to replace all the characters in a given text

text = u"aren't"

def replace_all(text, dict):
    for i, k in dict.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

The problem is that re.sub() doesnt match u'aren\u2019t' with u"aren't".

What can I do so that my replace_all() function will match both "hello" and `"aren't" and replace them with the appropriate text? Can I do something in Python so that my dictionary doesn't contain Unicode? Could I convert my text to use a Unicode character, or could I modify the regex to match the Unicode character as well as all the other text?

2
  • what output would you like to get? Commented Feb 23, 2011 at 22:52
  • The expected result is that the text "aren't" is replaced with "ain't". Commented Feb 24, 2011 at 15:21

4 Answers 4

3

I guess your problem is:

text = u"aren't"

instead of:

text = u"aren’t"

(note the different apostrophes?)

Here's your code modified to make it work:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

d = {
    u'aren’t': u'ain’t',
    u'hello': u'hey'
    }
#text = u"aren't"
text = u"aren’t"


def replace_all(text, d):
    for i, k in d.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

if __name__ == '__main__':
    newtext = replace_all(text, d)
    print newtext

Output:

ain’t
Sign up to request clarification or add additional context in comments.

1 Comment

Was able to fix my problems which came from the text having different types of apostrophes
0

This works fine for me in Python 2.6.4:

>>> re.sub(ur'\baren\u2019t\b', 'rep', u'aren\u2019t')
u'rep'

Make sure that your pattern string is a Unicode string, otherwise it might not work.

Comments

0

try saving your file into UTF-8 encode

Comments

0
u"aren\u2019t" == u"aren't"

False

u"aren\u2019t" == u"aren’t"

True

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.