2

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:

m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").

The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?

2
  • "\ue04a abc" == " abc" -- there's no actual backslash in your string Commented Jun 12, 2018 at 4:46
  • @Rakesh Thanks, but it doesn't work. Commented Jun 12, 2018 at 4:49

2 Answers 2

5

Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.

Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.

It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like

m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")

to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:

m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")

where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.

The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.

Sign up to request clarification or add additional context in comments.

1 Comment

I see the difference. The substring should be interpreted as a unicode substring. Thank you!
1

This should help.

import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
    print( m.group() )

Output:

\ue04a

2 Comments

The m is still empty. I'm using Python 3.6.
Add r before ""\ue04a abc""

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.