I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?
-
2@Michael waiting for the link.ApprenticeHacker– ApprenticeHacker2011-12-20 14:37:02 +00:00Commented Dec 20, 2011 at 14:37
-
4You can use regex for extracting bits of information from small, predictable, restricted snippets of XML, no problem, but regex is not meant for parsing XML as a whole. It's like using a ball-peen hammer to peel an orange.BoltClock– BoltClock2011-12-20 14:37:58 +00:00Commented Dec 20, 2011 at 14:37
-
2It actually is a good question - it would be good to have a definitive answer here, which could be referred to whenever there are questions regarding parsing XML with regular expressions...Avi– Avi2011-12-20 14:38:30 +00:00Commented Dec 20, 2011 at 14:38
-
2This answer is about parsing HTML, but nevertheless insightful: stackoverflow.com/questions/4231382/…martin clayton– martin clayton2011-12-20 14:51:22 +00:00Commented Dec 20, 2011 at 14:51
-
3The best answer is, stackoverflow.com/a/1732454/135078 (Beware Zalgo)Kelly S. French– Kelly S. French2012-01-12 22:38:21 +00:00Commented Jan 12, 2012 at 22:38
3 Answers
The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.
For example, this is a tricky one to parse...
<div>
<div id="parse-this">
<!-- oops</div> -->
try to get this value with regex
</div>
</div>
You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.
1 Comment
This has been discussed so many times here on SO. See e.g.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Just follow the links on the right side of the screen to more answers.
My conclusion:
Simple, because a regular expression is not a parser, its a tool to find patterns.
If you want to find a very specific pattern in a (ht|x)ml file, go on, regex is perfect for that.
But if you are searching for something in in every Foo tag, that could have attributes in different orders, that can be nested, that can be malformed (and still valid), then use a parser, because thats not pattern matching anymore.
4 Comments
XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression. You might be successful 99% of the time, but then someone will find a way of writing the XML that throws you.
If you're writing some kind of screen-scraper then a 99% success rate might be adequate. For most applications, it isn't.
4 Comments
r'[\s \t,]*("[^"]+"|\'[^\']+\'|[^ \t,]+)[ \t,]*' and r'[\s \t]*([+-]?"[^"]+"|\'[^\']+\'|[^ \t]+)[ \t]*' respectively. I throw up a little in my mouth thinking about the fact I wrote a generator for these abominations. ;^P And this is still (extremely) fragile to quote balances!