Python regex for matching bb code

Question

I'm writing a very simple bbcode parse. If i want to replace hello i'm a [b]bold[/b] text, i have success with replacing this regex

r'\[b\](.*)\[\/b\]'

with this

<strong>\g<1></strong>

to get hello, i'm a <strong>bold</strong> text.

If I have two or more tags of the same type, it fails. eg:

i'm [b]bold[/b] and i'm [b]bold[/b] too

gives

i'm <strong>bold[/b] and i'm [b]bold</strong> too

How to solve the problem? Thanks

I think you forgot to close the last [b] tag in your example. So your example string should be this one: "i'm [b]bold[/b] and i'm [b]bold too[/b]" ;) — Andrea Zilio
– Andrea Zilio, Commented Jan 31, 2010 at 14:41
It's going to have to be very simple, since [b][i]this[/i][/b] use case will defeat it. — Robert Rossney
– Robert Rossney, Commented Jan 31, 2010 at 17:33

danben · Accepted Answer · 2010-01-31 14:40:13Z

7

You shouldn't use regular expressions to parse non-regular languages (like matching tags). Look into a parser instead.

Edit - a quick Google search takes me here.

answered Jan 31, 2010 at 14:40

danben

83.8k18 gold badges127 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike Over a year ago

I'm new to Python. I know this post was a long time ago, but why is it that a parser would be recommended over regex? How do the two process things differently? Thanks

danben Over a year ago

@Mike Hayes: This isn't specific to Python - it is language theory. One simple example of why you need a parser to parse something like matching tags is the string <b>I am nesting my <b>bold tags</b></b>. If you just match between pairs of <b> and </b>, you get the wrong text in this example. To learn more, you should read about the difference between regular languages (for which you can use regular expressions) and context-free languages (for which you need a parser).

Andrea Zilio · Accepted Answer · 2010-01-31 14:46:29Z

5

Just change your regular expression from:

r'\[b\](.*)\[\/b\]'

to

r'\[b\](.*?)\[\/b\]'

The * qualifier is greedy, appending a ? to it you make it performing as a non-greedy qualifier.

Here's a more complete explaination taken from the python re documentation:

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

Source: http://docs.python.org/library/re.html

edited Jan 31, 2010 at 14:46

answered Jan 31, 2010 at 14:40

Andrea Zilio

4,5443 gold badges31 silver badges34 bronze badges

Collectives™ on Stack Overflow

Python regex for matching bb code

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related