0

I begin to learn re module. First I'll show the original code:

import re
cheesetext = u'''<tag>I love cheese.</tag>
<tag>Yeah, cheese is all I need.</tag>
<tag>But let me explain one thing.</tag>
<tag>Cheese is REALLY I need.</tag>
<tag>And the last thing I'd like to say...</tag>
<tag>Everyone can like cheese.</tag>
<tag>It's a question of the time, I think.</tag>'''

def action1(source):
  regex = u'<tag>(.*?)</tag>'
  pattern = re.compile(regex, re.UNICODE | re.DOTALL | re.IGNORECASE)
  result = pattern.findall(source)
  return(result)

def action2(match, source):
  pattern = re.compile(match, re.UNICODE | re.DOTALL | re.IGNORECASE)
  result = bool(pattern.findall(source))
  return(result)

result = action1(cheesetext)
result = [item for item in result if action2(u'cheese', item)]
print result
>>> [u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']

And now what I need. I need to do the same thing using one regex. It was an example, I have to process much more information than these cheesy texts. :-) Is it possible to combine these two actions in one regex? So the question is: how can I use conditions in regex?

1
  • By the way, it looks like you're trying to parse SGML/HTML/XML using regular expressions. That's not always the best way to go, regular expressions treat everything as a flat string while markup languages describe a tree. Whatever you do, do not try to escape HTML using regular expressions, or samy will be your hero. Commented Feb 8, 2012 at 9:25

3 Answers 3

2
>>> p = u'<tag>((?:(?!</tag>).)*cheese.*?)</tag>'
>>> patt = re.compile(p, re.UNICODE | re.DOTALL | re.IGNORECASE)
>>> patt.findall(cheesetext)
[u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']

This uses a negative-lookahead assertion. A good explanation of this is given by Tim Pietzcker in this question.

Sign up to request clarification or add additional context in comments.

5 Comments

you need the negative-lookahead on both sides of "cheese"
Why? You're already using a reluctant .*?, so the match will stop at </tag> anyway.
Ha, no problem. I'm pretty sure your version also works, it just does some unnecessary computation.
@beerbajay: Thanks, I thank it's the best answer! One question. Can I add here two more conditions: word will be in list if "cheese" is not a part of 'BARcheeseFOO' or 'FOOcheeseBAR'? I don't understand where I must insert condition.
The more conditions you have, the more difficult to read the regex becomes. You can have these conditions, but it's almost easier to do the analysis in several steps. Also, what about this case: <tag>I love cheese, but hate BARcheeseFOO</tag>?
1

You can use |.

>>> import re
>>> m = re.compile("(Hello|Goodbye) World")
>>> m.match("Hello World")
<_sre.SRE_Match object at 0x01ECF960>
>>> m.match("Goodbye World")
<_sre.SRE_Match object at 0x01ECF9E0>
>>> m.match("foobar")
>>> m.match("Hello World").groups()
('Hello',)

In addition, if you need actual conditions, you can use conditionals on previously matched groups with (?=...), (?!...), (?P=name) and friends. See Python's re module docs.

Comments

1

I propose to use look foward to check you don't get a </tag> inside

re.findall(r'<tag>((?:(?!</tag>).)*?cheese(?:(?!</tag>).)*?)</tag>', cheesetext)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.