Python regex unexpected behaviour

Question

str1='<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"'
str2='<a href="/states/florida/433" title="florida">'
pat = re.compile('/states/.*/([^"]+)')
if ( pat.findall(str2) == pat.findall(str1)):
    print "TRUE"
else:
    print "FALSE"

OUTPUT: FALSE,

output2: 433
output1: abc.com

Can somebody explain?

What's strange? Your RegEx is working properly.

Madbreaks
– Madbreaks

2013-01-30 18:36:30 +00:00
Commented Jan 30, 2013 at 18:36 — Madbreaks
– Madbreaks, Commented Jan 30, 2013 at 18:36

Rohit Jain · Accepted Answer · 2013-01-30 18:35:48Z

3

Use reluctant quantifier - .*?, instead of greedy one - .* and all will be well: -

pat = re.compile('/states/.*?/([^"]+)')

Quantifiers are by default greedy, in the sense they try to cover as much string as they can, and still leaving the rest of the pattern to match the remaining string. Using ? after the quantifier makes it reluctant, in which case, they will stop at the first match of the following character - / in this case.

answered Jan 30, 2013 at 18:35

Rohit Jain

214k45 gold badges419 silver badges534 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

david king Over a year ago

Additionally, I'd say to not use .* here at all. Use the regex to say what a state-name looks like too so that you know what you're grabbing before you grab it

Fabien · Accepted Answer · 2013-01-30 18:36:05Z

1

On the first URL, your regexp matches the whole string :

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com
         /states/                                .*                         /([^"]+)

and not

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com
         /states/ .*   /([^"])+

They are greedy and .* eats as much data as it can.

answered Jan 30, 2013 at 18:36

Fabien

13.6k10 gold badges48 silver badges66 bronze badges

Comments

nhahtdh · Accepted Answer · 2013-01-30 19:01:16Z

1

Your RegEx is working properly:

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"
         ^^^^^^^^............................................................^^^^^^^
         /states/                      .*/                                     [^"]+

And:

<a href="/states/florida/433" title="florida">
         ^^^^^^^^........^^^

If you don't want to consume the whole string in the first case, use ?, the non-greedy matching quantifier to say "/states/ followed by any number of characters up until the first / followed by one or more non-quote characters"

edited Jan 30, 2013 at 19:01

nhahtdh

56.9k15 gold badges131 silver badges164 bronze badges

answered Jan 30, 2013 at 18:39

Madbreaks

19.6k7 gold badges62 silver badges75 bronze badges

Comments

Trevor · Accepted Answer · 2013-01-30 18:51:06Z

0

You're pattern is greedy (you can read about greedy and non greedy regex patterns here: http://docs.python.org/2/library/re.html and here: http://www.itworld.com/nl/perl/01112001. Changing the pattern from

'/states/.*/([^"]+)'

to

'/states/.*/([^"]+)'

returns true. Here's the full modified source:

import re

str1='<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"'
str2='<a href="/states/florida/433" title="florida">'
pat = re.compile('/states/.*?/([^"]+)')
if ( pat.findall(str2) == pat.findall(str1)):
    print "TRUE"
else:
    print "FALSE"

answered Jan 30, 2013 at 18:51

Trevor

13.5k13 gold badges82 silver badges105 bronze badges

Collectives™ on Stack Overflow

Python regex unexpected behaviour

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related