2

I'm writing a very simple bbcode parse. If i want to replace hello i'm a [b]bold[/b] text, i have success with replacing this regex

r'\[b\](.*)\[\/b\]'

with this

<strong>\g<1></strong>

to get hello, i'm a <strong>bold</strong> text.

If I have two or more tags of the same type, it fails. eg:

i'm [b]bold[/b] and i'm [b]bold[/b] too

gives

i'm <strong>bold[/b] and i'm [b]bold</strong> too

How to solve the problem? Thanks

3
  • I think you forgot to close the last [b] tag in your example. So your example string should be this one: "i'm [b]bold[/b] and i'm [b]bold too[/b]" ;) Commented Jan 31, 2010 at 14:41
  • It's going to have to be very simple, since [b][i]this[/i][/b] use case will defeat it. Commented Jan 31, 2010 at 17:33
  • I corrected that missing [/b] tag. Commented Jan 31, 2010 at 21:46

2 Answers 2

7

You shouldn't use regular expressions to parse non-regular languages (like matching tags). Look into a parser instead.

Edit - a quick Google search takes me here.

Sign up to request clarification or add additional context in comments.

2 Comments

I'm new to Python. I know this post was a long time ago, but why is it that a parser would be recommended over regex? How do the two process things differently? Thanks
@Mike Hayes: This isn't specific to Python - it is language theory. One simple example of why you need a parser to parse something like matching tags is the string <b>I am nesting my <b>bold tags</b></b>. If you just match between pairs of <b> and </b>, you get the wrong text in this example. To learn more, you should read about the difference between regular languages (for which you can use regular expressions) and context-free languages (for which you need a parser).
5

Just change your regular expression from:

r'\[b\](.*)\[\/b\]'

to

r'\[b\](.*?)\[\/b\]'

The * qualifier is greedy, appending a ? to it you make it performing as a non-greedy qualifier.

Here's a more complete explaination taken from the python re documentation:

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

Source: http://docs.python.org/library/re.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.