Python: Regex finds only part of sought string

Question

The content variable contains multiline string:

content = """
/blog/1:text:Lorem ipsum dolor sit amet, consectetur ### don't need this
<break>
text:Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
<break>
text:Excepteur sint occaecat cupidatat non proident.

/blog/16:text:Other Lorem ipsum dolor ### SEEKING THIS!!!
<break>
text:Other, really other
<break>
text:Blah blah.
"""

I'm trying to find the desired occurrence with the pattern /blog/16:

re.findall('^(?ism)%s?:(.*?)(\n\n)' % '/blog/16', content)

and expecting to get this

[(u'/blog/16:text:Other Lorem ipsum dolor ### SEEKING THIS!!!
<break>
text:Other, really other
<break>
text:Blah blah.', u'\n\n')]

but getting wrong result (/blog/1)

[(u'/blog/1:text:Lorem ipsum dolor sit amet, consectetur ### don't need this
<break>
text:Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
<break>
text:Excepteur sint occaecat cupidatat non proident.', u'\n\n')]

What is my mistake?

It is not clear. What is the pattern you are looking for and what is the problem? — thefourtheye
– thefourtheye, Commented Apr 26, 2014 at 7:00
What is my mistake?, Ans: Your mistake is, you didn't posted sample pattern you want to match. — Ravi Dhoriya ツ
– Ravi Dhoriya ツ, Commented Apr 26, 2014 at 7:02
Sorry for that, I'm looking for /blog/16, but it finds /blog/1. Updated the question. — Vlad T.
– Vlad T., Commented Apr 26, 2014 at 7:03

jonrsharpe · Accepted Answer · 2014-04-26 07:14:43Z

2

Once you insert the blog text, this part of your regex:

/blog/16?:

Means "match: /blog/1 literally; then 6 literally (zero or one times); then : literally". Instead, try:

(?ism)^/blog/16:(.*?)$

This finds all of /blog/16: literally at the start of the line, then does a non-greedy search for any characters up to the end of a line (i.e. captures the rest of the text on the line).

You might find regex101 useful for developing and testing regular expressions.

edited Apr 26, 2014 at 7:14

answered Apr 26, 2014 at 7:09

jonrsharpe

123k31 gold badges278 silver badges488 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

vinit_ivar · Accepted Answer · 2014-04-26 07:09:22Z

2

I think you forgot to put the non-capturing group in parentheses. The ?:. Right now, your ? says "0 or 1 of the previous element," which means that the 6 is unnecessary.

answered Apr 26, 2014 at 7:09

vinit_ivar

6606 silver badges17 bronze badges

1 Comment

Vlad T. Over a year ago

Thank you, I thought it related to the whole pattern, not to the last character.

thefourtheye · Accepted Answer · 2014-04-26 07:23:39Z

2

When the String replacement is done, your string looks like this

^(?ism)/blog/16?:(.*?)(\n\n)

Here, ? means that match the previous pattern 0 or 1 times. So, when the input is /blog/1, it matches 0 times and allows the match.

The actual RegEx you are looking for is,

import re
print re.findall('(?ims)(/blog/16:.*)(?:/blog|$)', content)

Output

['/blog/16:text:Other Lorem ipsum dolor ### SEEKING THIS!!!\n<break>\ntext:Other, really other\n<break>\ntext:Blah blah.\n']

edited Apr 26, 2014 at 7:23

answered Apr 26, 2014 at 7:04

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Collectives™ on Stack Overflow

Python: Regex finds only part of sought string

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related