Python and regular expression with Unicode

Question

I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'

I know they exist here for sure. I tried:

re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')

but it doesn't work. String stays the same. What am I doing wrong?

ʞɔıu · Accepted Answer · 2008-12-26 16:03:26Z

112

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.

edited Dec 26, 2008 at 16:03

answered Dec 26, 2008 at 14:57

ʞɔıu

48.7k36 gold badges110 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Balthazar Rouberol Over a year ago

Hmm, did not know you could concatenate both u and r prefixes. That's pretty cool!

Umair Ayub Over a year ago

@BalthazarRouberol I get SyntaxError: invalid syntax in Python 3.6

Mansour.M Over a year ago

You can't use ur in python 3. Just use r.

nosklo · Accepted Answer · 2008-12-26 15:55:11Z

80

Use unicode strings. Use the re.UNICODE flag.

>>> myre = re.compile(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', 
                      re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم

Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

answered Dec 26, 2008 at 15:55

nosklo

224k58 gold badges300 silver badges299 bronze badges

4 Comments

securecurve Over a year ago

@nosklo, why the curly braces that sets the number of chars -- {5} -- are not working with unicode characters, I'm having problems with it, yet, the + works fine..do you have any idea? Thanks!

nosklo Over a year ago

@securecurve I have no idea, and without my magic crystal ball there's no way to help. I just tested it, and it works fine for me. If it doesn't work for you, I suggest you ask a new question, providing your code and the result you're getting.

noisy Over a year ago

In case you want to use re in python, you have to know that it doesn't support Unicode character property (like \p{L}). pypi.python.org/pypi/regex does.

nhahtdh Over a year ago

re.UNICODE flag is useless here, since it only affects shorthand character classes \w, \d, \s.

Collectives™ on Stack Overflow

Python and regular expression with Unicode

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related