I'm writing a simple application where I want to replace certain words with other words. I'm running into problems with words that use single quotes such as aren't, ain't, isn't.
I have a text file with the following
aren’t=ain’t
hello=hey
I parse the text file and create a dictionary out of it
u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'
Then I try to replace all the characters in a given text
text = u"aren't"
def replace_all(text, dict):
for i, k in dict.iteritems():
#replace all whole words of I with K in lower cased text, regex = \bSTRING\b
text = re.sub(r"\b" + i + r"\b", k , text.lower())
return text
The problem is that re.sub() doesnt match u'aren\u2019t' with u"aren't".
What can I do so that my replace_all() function will match both "hello" and `"aren't" and replace them with the appropriate text? Can I do something in Python so that my dictionary doesn't contain Unicode? Could I convert my text to use a Unicode character, or could I modify the regex to match the Unicode character as well as all the other text?