python's re: multiple regex

Question

I begin to learn re module. First I'll show the original code:

import re
cheesetext = u'''<tag>I love cheese.</tag>
<tag>Yeah, cheese is all I need.</tag>
<tag>But let me explain one thing.</tag>
<tag>Cheese is REALLY I need.</tag>
<tag>And the last thing I'd like to say...</tag>
<tag>Everyone can like cheese.</tag>
<tag>It's a question of the time, I think.</tag>'''

def action1(source):
  regex = u'<tag>(.*?)</tag>'
  pattern = re.compile(regex, re.UNICODE | re.DOTALL | re.IGNORECASE)
  result = pattern.findall(source)
  return(result)

def action2(match, source):
  pattern = re.compile(match, re.UNICODE | re.DOTALL | re.IGNORECASE)
  result = bool(pattern.findall(source))
  return(result)

result = action1(cheesetext)
result = [item for item in result if action2(u'cheese', item)]
print result
>>> [u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']

And now what I need. I need to do the same thing using one regex. It was an example, I have to process much more information than these cheesy texts. :-) Is it possible to combine these two actions in one regex? So the question is: how can I use conditions in regex?

By the way, it looks like you're trying to parse SGML/HTML/XML using regular expressions. That's not always the best way to go, regular expressions treat everything as a flat string while markup languages describe a tree. Whatever you do, do not try to escape HTML using regular expressions, or samy will be your hero. — cha0site
– cha0site, Commented Feb 8, 2012 at 9:25

Community · Accepted Answer · 2017-05-23 12:11:23Z

2

>>> p = u'<tag>((?:(?!</tag>).)*cheese.*?)</tag>'
>>> patt = re.compile(p, re.UNICODE | re.DOTALL | re.IGNORECASE)
>>> patt.findall(cheesetext)
[u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']

This uses a negative-lookahead assertion. A good explanation of this is given by Tim Pietzcker in this question.

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered Feb 8, 2012 at 9:29

beerbajay

20.5k8 gold badges63 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ptitpoulpe Over a year ago

you need the negative-lookahead on both sides of "cheese"

beerbajay Over a year ago

Why? You're already using a reluctant .*?, so the match will stop at </tag> anyway.

beerbajay Over a year ago

Ha, no problem. I'm pretty sure your version also works, it just does some unnecessary computation.

ghostmansd Over a year ago

@beerbajay: Thanks, I thank it's the best answer! One question. Can I add here two more conditions: word will be in list if "cheese" is not a part of 'BARcheeseFOO' or 'FOOcheeseBAR'? I don't understand where I must insert condition.

beerbajay Over a year ago

The more conditions you have, the more difficult to read the regex becomes. You can have these conditions, but it's almost easier to do the analysis in several steps. Also, what about this case: <tag>I love cheese, but hate BARcheeseFOO</tag>?

cha0site · Accepted Answer · 2012-02-08 09:13:30Z

1

You can use |.

>>> import re
>>> m = re.compile("(Hello|Goodbye) World")
>>> m.match("Hello World")
<_sre.SRE_Match object at 0x01ECF960>
>>> m.match("Goodbye World")
<_sre.SRE_Match object at 0x01ECF9E0>
>>> m.match("foobar")
>>> m.match("Hello World").groups()
('Hello',)

In addition, if you need actual conditions, you can use conditionals on previously matched groups with (?=...), (?!...), (?P=name) and friends. See Python's re module docs.

answered Feb 8, 2012 at 9:13

cha0site

10.8k3 gold badges36 silver badges52 bronze badges

Comments

ptitpoulpe · Accepted Answer · 2012-02-08 09:15:54Z

1

I propose to use look foward to check you don't get a </tag> inside

re.findall(r'<tag>((?:(?!</tag>).)*?cheese(?:(?!</tag>).)*?)</tag>', cheesetext)

answered Feb 8, 2012 at 9:15

ptitpoulpe

6944 silver badges17 bronze badges

Collectives™ on Stack Overflow

python's re: multiple regex

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related