2

Using Python 2.7.3 on Linux. Here is a shell session verbatim.

>>> f = open("feed.xml")
>>> text = f.read()
>>> import re
>>> regexp1 = re.compile(r'</?item>')
>>> regexp2 = re.compile(r'<item>.*</item>')
>>> regexp1.findall(text)
['<item>', '</item>', '<item>', '</item>', '<item>', '</item>', '<item>', '</item>']
>>> regexp2.findall(text)
[]

Is this a bug, or is there something I'm not understanding about Python regular expressions?

2 Answers 2

5

By default, '.' does not match a newline. Try with

regexp2 = re.compile(r'<item>.*</item>', re.DOTALL)
Sign up to request clarification or add additional context in comments.

Comments

0

Here is the best answer to this question: Don't use regular expressions to parse non-regular languages such as XML. It drove one S-O user insane. Another relevant link.

5 Comments

This doesn't address his misunderstanding of regular expressions, however.
A valid point, but I'm only using this code for a quick hack and thus don't want or need to learn any new APIs.
I finally followed the link to the insane S-O user. I'd retract my downvote for that if I could :)
@chepner: made a trivial (whitespace only) edit so you can retract the downvote.
@Jangler: quick hacks often become scripts that you rely on. if you learn the new API then you can do a quick hack with the new API

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.