0

I am trying to write a regular expression that will match all cases of

[[any text or char her]]

in a series of text.

Eg:

My name is [[Sean]]
There is a [[new and cool]] thing here.

This all works fine using my regex.

data = "this is my tes string [[ that does some matching ]] then returns."
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)

The problem is when I have multiple instances of the match occuring :[[hello]] and [[bye]]

Eg:

data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)

This will match the opening bracket of hello and the closing bracket of bye. I want it to replace them both.

1
  • 2
    You should inlude your programming language in the tags of your question so that people can help you better. Commented Oct 31, 2012 at 12:07

3 Answers 3

3

.* is greedy and matches as much text as it can, including ]] and [[, so it plows on through your "tag" boundaries.

A quick solution is to make the star lazy by adding a ?:

p = re.compile(r"\[\[(.*?)\]\]")

A better (more robust and explicit but slightly slower) solution is to make it clear that we cannot match across tag boundaries:

p = re.compile(r"\[\[((?:(?!\]\]).)*)\]\]")

Explanation:

\[\[        # Match [[
(           # Match and capture...
 (?:        # ...the following regex:
  (?!\]\])  # (only if we're not at the start of the sequence ]]
  .         # any character
 )*         # Repeat any number of times
)           # End of capturing group
\]\]        # Match ]]
Sign up to request clarification or add additional context in comments.

5 Comments

n.b. the second method proposed will slow down the regex considerably
@BillyMoon: I just timeit.timeit()ed it. There's not much of a difference (3.8 µs vs. 4.2 µs, about 10 %).
thanks so much. I dont use regex that much, so the first example is a lot more readable to me - therefore easier to maintain.
I am surprised, because look-around assertions always add quite an overhead as the pointer must traverse the whole string to check it
@BillyMoon: In this regex, it only needs to "look ahead" at the next two characters (because it starts at the current position in the string, so no additional string traversal required). That's not so bad.
2

Use ungreedy matching .*? <~~ the ? after a + or * makes it match as few characters as possible. The default is to be greedy, and consume as many characters as possible.

p = re.compile("\[\[(.*?)\]\]")

Comments

1

You can use this:

p = re.compile(r"\[\[[^\]]+\]\]")

>>> data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
>>> p = re.compile(r"\[\[[^\]]+\]\]")
>>> data = p.sub('STAR', data)
>>> data
'this is my new string it contains STAR and STAR and nothing else'

1 Comment

Good idea, and measurably the fastest (3.2 µs on my machine vs. 3.8 µs for .*?). The only drawback is that single closing brackets cannot be part of the match (within double brackets), but that sounds like a reasonable tradeoff.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.