2
str1='<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"'
str2='<a href="/states/florida/433" title="florida">'
pat = re.compile('/states/.*/([^"]+)')
if ( pat.findall(str2) == pat.findall(str1)):
    print "TRUE"
else:
    print "FALSE"

OUTPUT: FALSE,

output2: 433
output1: abc.com

Can somebody explain?

1
  • What's strange? Your RegEx is working properly. Commented Jan 30, 2013 at 18:36

4 Answers 4

3

Use reluctant quantifier - .*?, instead of greedy one - .* and all will be well: -

pat = re.compile('/states/.*?/([^"]+)')

Quantifiers are by default greedy, in the sense they try to cover as much string as they can, and still leaving the rest of the pattern to match the remaining string. Using ? after the quantifier makes it reluctant, in which case, they will stop at the first match of the following character - / in this case.

Sign up to request clarification or add additional context in comments.

1 Comment

Additionally, I'd say to not use .* here at all. Use the regex to say what a state-name looks like too so that you know what you're grabbing before you grab it
1

On the first URL, your regexp matches the whole string :

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com
         /states/                                .*                         /([^"]+)

and not

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com
         /states/ .*   /([^"])+

They are greedy and .* eats as much data as it can.

Comments

1

Your RegEx is working properly:

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"
         ^^^^^^^^............................................................^^^^^^^
         /states/                      .*/                                     [^"]+

And:

<a href="/states/florida/433" title="florida">
         ^^^^^^^^........^^^

If you don't want to consume the whole string in the first case, use ?, the non-greedy matching quantifier to say "/states/ followed by any number of characters up until the first / followed by one or more non-quote characters"

Comments

0

You're pattern is greedy (you can read about greedy and non greedy regex patterns here: http://docs.python.org/2/library/re.html and here: http://www.itworld.com/nl/perl/01112001. Changing the pattern from

'/states/.*/([^"]+)'

to

'/states/.*/([^"]+)'

returns true. Here's the full modified source:

import re

str1='<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"'
str2='<a href="/states/florida/433" title="florida">'
pat = re.compile('/states/.*?/([^"]+)')
if ( pat.findall(str2) == pat.findall(str1)):
    print "TRUE"
else:
    print "FALSE"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.