1

I have a xml inside content place holder that I need to get, like:

<asp:Content ID="Content2" ContentPlaceHolderID="header" runat="server">
    <div>
        <categories>
            <category>
                <name>item 1</name>
                <categories>
                    <category>
                        <name>item 1.1.</name>
                    </category>
                    <category>
                        <name>item 1.2.</name>
                    </category>
                </categories>
            </category>
        </categories>
    </div>
</asp:Content>

And so on. I ll build the proper html using LINQ to XML over the root categories, but I'm failing to extract all the xml with regular expression. Is there a better way to extract the xml?

2
  • 3
    Don't use regex for this, it doesn't work. Use a real XML parser. Commented Dec 4, 2011 at 22:50
  • 1
    I need to extract all the xml tree giving the root element. But it's important to keep in mind that the xml will be surounded by html. Commented Dec 4, 2011 at 22:53

2 Answers 2

1

See Reading XML documents using LINQ to XML and XML Made Easy with LINQ to XML

Does it matter if the .xml is surrounded? Just give the root to Linq and work your way through it. Simple, robust and easy to maintain. In general don't even think about doing what you are about to do.

Sign up to request clarification or add additional context in comments.

Comments

0

The following regex matches your xml. It also captures everything inside the asp:content tags and places it in Group 1.

(?s)<asp:Content ID="[^"]*"\W+ContentPlaceHolderID="[^"]*"\W+runat="[^"]*">(.*?)</asp:Content>

Note that (?s) is the inline modifier that turns on the "dot matches new line" mode in certain regex flavors, such as .NET, Java, Perl, Python, PCRE for PHP's preg functions.

If you are using a different regex flavor, you will need to remove (?s) and activate "dot matches new line" differently.

The following code retrieves the group captures. To show a general solution, the subject string contains two of these placeholders.

<?php
$subject='
<asp:Content ID="blah" ContentPlaceHolderID="blah" runat="blah">Capture Me!</asp:Content>
<asp:Content ID="Content2" ContentPlaceHolderID="header" runat="server">
<div>
<categories>
<category>
     <name>item 1</name>
            <categories>
                <category>
                    <name>item 1.1.</name>
                </category>
                <category>
                    <name>item 1.2.</name>
                </category>
            </categories>
        </category>
    </categories>
</div>
</asp:Content>
';

preg_match_all('%(?s)<asp:Content ID="[^"]*"\W+ContentPlaceHolderID="[^"]*"\W+runat="[^"]*">(.*?)</asp:Content>%', $subject, $result,PREG_OFFSET_CAPTURE | PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result); $i++) {
echo "Capture number: ".$i."<br />".htmlentities($result[1][$i][0])."<br /><br />"; 
// echo "Match number: ".$i."<br />".htmlentities($result[0][$i][0])."<br /><br/>"; 
}
?>

Here is the output:

Capture number: 0
Capture Me!

Capture number: 1
<div> <categories> <category> <name>item 1</name> <categories> <category> <name>item   1.1.</name> </category> <category> <name>item 1.2.</name> </category> </categories> </category> </categories> </div> 

If you also want to display the whole match (not just the capture), just uncomment the second echo line in the for loop.

I think this is what you were looking for?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.