1

Exemple :

a = "bzzzzzz <!-- blabla --> blibli * bloblo * blublu"

I want to catch the first comment. A comment may be

(<!-- .* -->) or (\* .* \*)

That is ok :

re.search("<!--(?P<comment> .* )-->",a).group(1)

Also that :

re.search("\*(?P<comment> .* )\*",a).group(1)

But if i want one or the other in comment, i have tried something like :

re.search("(<!--(?P<comment> .* )-->|\*(?P<comment> .* )\*)",a).group(1)

But it does't work

Thanks

1
  • BTW, your regexs are greedy and would fail on something like <!-- first comment --> real material <!-- second comment -->. Commented Sep 23, 2011 at 15:45

5 Answers 5

2

Try conditional expression:

>>> for m in re.finditer(r"(?:(<!--)|(\*))(?P<comment> .*? )(?(1)-->)(?(2)\*)", a):
...   print m.group('comment')
...
 blabla
 bloblo
Sign up to request clarification or add additional context in comments.

Comments

1

the exception you get in the "doesn't work" part is quite explicit about what is wrong:

sre_constants.error: redefinition of group name 'comment' as group 3; was group 2

both groups have the same name: just rename the second one

>>> re.search("(<!--(?P<comment> .* )-->|\*(?P<comment2> .* )\*)",a).group(1)
'<!-- blabla -->'
>>> re.search("(<!--(?P<comment> .* )-->|\*(?P<comment2> .* )\*)",a).groups()
('<!-- blabla -->', ' blabla ', None)
>>> re.findall("(<!--(?P<comment> .* )-->|\*(?P<comment2> .* )\*)",a)
[('<!-- blabla -->', ' blabla ', ''), ('* bloblo *', '', ' bloblo ')]

Comments

1

As Gurney pointed out, you have two captures with the same name. Since you're not actually using the name, just leave that out.

Also, the r"" raw string notation is a good habit.

Oh, and a third thing: you're grabbing the wrong index. 0 is the whole match, 1 is the whole "either-or" block, and 2 will be the inner capture that was successful.

re.search(r"(<!--( .* )-->|\*( .* )\*)",a).group(2)

1 Comment

There can never be an index 3 with this regex.
0

re.findall might be a better fit for this:

import re

# Keep your regex simple. You'll thank yourself a year from now. Note that
# this doesn't include the surround spaces. It also uses non-greedy matching
# so that you can embed multiple comments on the same line, and it doesn't
# break on strings like '<!-- first comment --> fragment -->'.
pattern = re.compile(r"(?:<!-- (.*?) -->|\* (.*?) \*)")

inputstring = 'bzzzzzz <!-- blabla --> blibli * bloblo * blublu foo ' \
              '<!-- another comment --> goes here'

# Now use re.findall to search the string. Each match will return a tuple
# with two elements: one for each of the groups in the regex above. Pick the
# non-blank one. This works even when both groups are empty; you just get an
# empty string.
results = [first or second for first, second in pattern.findall(inputstring)]

Comments

0

You could go 1 of 2 ways (if supported by Python) -

1: Branch reset (?|pattern|pattern|...)
(?|<!--( .*? )-->|\*( .*? )\*)/ capture group 1 always contains the comment text

2: Conditional expression (?(condition)yes-pattern|no-pattern)
(?:(<!--)|\*)(?P<comment> .*? )(?(1)-->|\*) here the condition is did we capt grp1

Modifiers sg single line and global

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.