94

I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'

I know they exist here for sure. I tried:

re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')

but it doesn't work. String stays the same. What am I doing wrong?

0

2 Answers 2

112

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.

Sign up to request clarification or add additional context in comments.

3 Comments

Hmm, did not know you could concatenate both u and r prefixes. That's pretty cool!
@BalthazarRouberol I get SyntaxError: invalid syntax in Python 3.6
You can't use ur in python 3. Just use r.
80

Use unicode strings. Use the re.UNICODE flag.

>>> myre = re.compile(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', 
                      re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم

Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

4 Comments

@nosklo, why the curly braces that sets the number of chars -- {5} -- are not working with unicode characters, I'm having problems with it, yet, the + works fine..do you have any idea? Thanks!
@securecurve I have no idea, and without my magic crystal ball there's no way to help. I just tested it, and it works fine for me. If it doesn't work for you, I suggest you ask a new question, providing your code and the result you're getting.
In case you want to use re in python, you have to know that it doesn't support Unicode character property (like \p{L}). pypi.python.org/pypi/regex does.
re.UNICODE flag is useless here, since it only affects shorthand character classes \w, \d, \s.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.