0

In an attempt to match the single first node in an xml document, using the regex

~<(\S+).*>.*</\1>~, it matches nothing until the text is a certain length. In one document, after I had stripped away text until it was 1186 characters, the regex successfully found something. In the following example, I stripped away text until it was only 960 characters, and then the regex was successful. As you can imagine, this seemingly inconsistent behavior is very confusing. I would appreciate any information on why this is occurring.

Original text:

<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-09-10</publish_date> <description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description> </book> <book id="bk106"> <author>Randall, Cynthia</author> <title>Lover Birds</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-09-02</publish_date> <description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description> </book> <book id="bk107"> <author>Thurman, Paula</author> <title>Splish Splash</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-11-02</publish_date> <description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description> </book> <book id="bk108"> <author>Knorr, Stefan</author> <title>Creepy Crawlies</title> <genre>Horror</genre> <price>4.95</price> <publish_date>2000-12-06</publish_date> <description>An anthology of horror stories about roaches, centipedes, scorpions and other insects.</description> </book> <book id="bk109"> <author>Kress, Peter</author> <title>Paradox Lost</title> <genre>Science Fiction</genre> <price>6.95</price> <publish_date>2000-11-02</publish_date> <description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description> </book> <book id="bk110"> <author>O'Brien, Tim</author> <title>Microsoft .NET: The Programming Bible</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-09</publish_date> <description>Microsoft's .NET initiative is explored in detail in this deep programmer's reference.</description> </book> <book id="bk111"> <author>O'Brien, Tim</author> <title>MSXML3: A Comprehensive Guide</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-01</publish_date> <description>The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.</description> </book> <book id="bk112"> <author>Galos, Mike</author> <title>Visual Studio 7: A Comprehensive Guide</title> <genre>Computer</genre> <price>49.95</price> <publish_date>2001-04-16</publish_date> <description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description> </book> </catalog>

Trimmed (successful) text:

<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Core</catalog>

I apologize for the formatting of the texts, but I do not want to put something in the data to make it behave differently for others (like new line characters).

EDIT: I have been testing the regex using this site.

10
  • 3
    Why the regex? Just use one of the 4 or 5 xml parsers that PHP has built in Commented Jul 15, 2013 at 16:53
  • 2
    FOR SCIENCE! Look, this isn't about xml parsers, this is about php's regex exhibiting odd behavior. Either there is something wrong with my regex, or there is something wrong with php's regex implementation, or there is an option or something I of which I am unaware. I just want to know why it's behaving as it is. Commented Jul 15, 2013 at 16:58
  • 2
    Maybe you can try something with pcre.backtrack_limit (using ini_set)? Commented Jul 15, 2013 at 17:03
  • 1
    I believe what Pieter is getting at, in case it's not clear, is that your expression might be so "flexible" that it needs to attempt a horrific number of paths (and backtracking) before being fully satisfied that it has not found a match, and you may be encountering a limit (either a backtracking limit or a memory/time limit) before that occurs. Commented Jul 15, 2013 at 17:05
  • 3
    As far as I can tell, the problem lies with the first .* being greedy. Try putting a ? afterwards like .*? to make the regex match only up to the >, or else it will match up to the last > (meaning pretty much the entire xml document). Commented Jul 15, 2013 at 17:07

3 Answers 3

2

The function preg_match() has - similar to many other PHP functions - a return value.

Depending on what that return value is, you can base the decision how the script should go on.

In you're case you're missing to actually check the return value being FALSE. Because - as your example shows, it is FALSE.

Reading the manual suggests that the return value of FALSE signals an error. You can learn more about that error by calling the function preg_last_error() which gives the last error code. So you can learn about the error your call to preg_match() gives:

int(2) - PREG_BACKTRACK_LIMIT_ERROR

See as well:

Sign up to request clarification or add additional context in comments.

1 Comment

This is truthfully what I was looking for. Thank you very much.
1

You can have a better control of your quantifiers using constraignant character classes:

example with a lazy quantifier:

$pattern = '~<([^>\s]++)[^>]*+>.*?</\1>~';

example with only possessive quantifiers (much better):

$pattern = '~<([^>\s]++)[^>]*+>(?>[^<]++|<(?!/\1>))+</\1>~';

But these two patterns don't deal with nested structures, to do that you must use:

$pattern = '~<([^>/\s]++)[^>]*+>(?>[^<]++|(?R))*</\1>~';



details:

second pattern: (?>[^<]++|<(?!/\1>))+

(?>           # open an atomic group
   [^<]++     # all characters but < one or more times (possessive)
  |           # OR
   <(?!/\1)   # < not followed by / and the content of the first backreference
              #  (the tag name here)
)+            # close the atomic group and repeat one or more times

the goal of this is to match all until </\1>, the idea is to match all that is not a < or all < not followed by /tagname>

More informations about possessive quantifiers and atomic groups.


third pattern: the recursive pattern

<                                
  ([^>/\s]++)     # tagname, 
                  # note that you must exclude the / to avoid closing tags
  [^>]*+          # leading characters in the tag
>


(?>               # open an atomic group
   [^<]++         # all characters but <, one or more times (possessive)
  |               # OR
   (?R)           # repeat the whole pattern
)*                # close the atomic group, repeat zero or more times

</\1>             # close tag with the first back reference

1 Comment

Could you by any chance explain a little bit of what is going on? This works beautifully, but I have no idea what is happening (at least in the second one).
-1

Well, first of all - the general attitude is that XML should not be parsed with RegEx. Use SimpleXML instead, if possible. And as nickb has said, way too greedy...

2 Comments

While the only use case example I have of this occurrence is using xml as text, that does not mean that it cannot happen with other texts. This isn't really an answer- it's a cop out.
Yeah, sorry - I was focussed more on "getting the job done" than on the details of the regex...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.