Regular expression find and replace multiple

Question

I am trying to write a regular expression that will match all cases of

[[any text or char her]]

in a series of text.

Eg:

My name is [[Sean]]
There is a [[new and cool]] thing here.

This all works fine using my regex.

data = "this is my tes string [[ that does some matching ]] then returns."
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)

The problem is when I have multiple instances of the match occuring :[[hello]] and [[bye]]

Eg:

data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)

This will match the opening bracket of hello and the closing bracket of bye. I want it to replace them both.

You should inlude your programming language in the tags of your question so that people can help you better. — Edwin Dalorzo
– Edwin Dalorzo, Commented Oct 31, 2012 at 12:07

Tim Pietzcker · Accepted Answer · 2012-10-31 12:23:06Z

3

.* is greedy and matches as much text as it can, including ]] and [[, so it plows on through your "tag" boundaries.

A quick solution is to make the star lazy by adding a ?:

p = re.compile(r"\[\[(.*?)\]\]")

A better (more robust and explicit but slightly slower) solution is to make it clear that we cannot match across tag boundaries:

p = re.compile(r"\[\[((?:(?!\]\]).)*)\]\]")

Explanation:

\[\[        # Match [[
(           # Match and capture...
 (?:        # ...the following regex:
  (?!\]\])  # (only if we're not at the start of the sequence ]]
  .         # any character
 )*         # Repeat any number of times
)           # End of capturing group
\]\]        # Match ]]

edited Oct 31, 2012 at 12:23

answered Oct 31, 2012 at 12:07

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Billy Moon Over a year ago

n.b. the second method proposed will slow down the regex considerably

Tim Pietzcker Over a year ago

@BillyMoon: I just timeit.timeit()ed it. There's not much of a difference (3.8 µs vs. 4.2 µs, about 10 %).

Mark Over a year ago

thanks so much. I dont use regex that much, so the first example is a lot more readable to me - therefore easier to maintain.

Billy Moon Over a year ago

I am surprised, because look-around assertions always add quite an overhead as the pointer must traverse the whole string to check it

Tim Pietzcker Over a year ago

@BillyMoon: In this regex, it only needs to "look ahead" at the next two characters (because it starts at the current position in the string, so no additional string traversal required). That's not so bad.

Billy Moon · Accepted Answer · 2012-10-31 12:07:49Z

2

Use ungreedy matching .*? <~~ the ? after a + or * makes it match as few characters as possible. The default is to be greedy, and consume as many characters as possible.

p = re.compile("\[\[(.*?)\]\]")

answered Oct 31, 2012 at 12:07

Billy Moon

58.9k27 gold badges148 silver badges244 bronze badges

Comments

jvallver · Accepted Answer · 2012-10-31 12:19:20Z

1

You can use this:

p = re.compile(r"\[\[[^\]]+\]\]")

>>> data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
>>> p = re.compile(r"\[\[[^\]]+\]\]")
>>> data = p.sub('STAR', data)
>>> data
'this is my new string it contains STAR and STAR and nothing else'

edited Oct 31, 2012 at 12:19

answered Oct 31, 2012 at 12:12

jvallver

2,3602 gold badges12 silver badges20 bronze badges

1 Comment

Tim Pietzcker Over a year ago

Good idea, and measurably the fastest (3.2 µs on my machine vs. 3.8 µs for .*?). The only drawback is that single closing brackets cannot be part of the match (within double brackets), but that sounds like a reasonable tradeoff.

Collectives™ on Stack Overflow

Regular expression find and replace multiple

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related