Regular expression with backslash in Python3

Question

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:

m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").

The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?

"\ue04a abc" == " abc" -- there's no actual backslash in your string — anthony sottile
– anthony sottile, Commented Jun 12, 2018 at 4:46

tripleee · Accepted Answer · 2018-06-12 05:10:56Z

Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.

Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.

It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like

m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")

to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:

m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")

where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.

The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.

I see the difference. The substring should be interpreted as a unicode substring. Thank you!

Rakesh · Accepted Answer · 2018-06-12 04:56:09Z

1

This should help.

import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
    print( m.group() )

Output:

\ue04a

edited Jun 12, 2018 at 4:56

answered Jun 12, 2018 at 4:48

Rakesh

82.9k17 gold badges86 silver badges122 bronze badges

2 Comments

Yujian Over a year ago

The m is still empty. I'm using Python 3.6.

Rakesh Over a year ago

Add r before ""\ue04a abc""

Collectives™ on Stack Overflow

Regular expression with backslash in Python3

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related