regular expression to extract html tags

Question

I have a xml inside content place holder that I need to get, like:

<asp:Content ID="Content2" ContentPlaceHolderID="header" runat="server">
    <div>
        <categories>
            <category>
                <name>item 1</name>
                <categories>
                    <category>
                        <name>item 1.1.</name>
                    </category>
                    <category>
                        <name>item 1.2.</name>
                    </category>
                </categories>
            </category>
        </categories>
    </div>
</asp:Content>

And so on. I ll build the proper html using LINQ to XML over the root categories, but I'm failing to extract all the xml with regular expression. Is there a better way to extract the xml?

Don't use regex for this, it doesn't work. Use a real XML parser. — Greg Hewgill
– Greg Hewgill, Commented Dec 4, 2011 at 22:50
I need to extract all the xml tree giving the root element. But it's important to keep in mind that the xml will be surounded by html. — user989818
– user989818, Commented Dec 4, 2011 at 22:53

Community · Accepted Answer · 2017-05-23 11:55:30Z

1

See Reading XML documents using LINQ to XML and XML Made Easy with LINQ to XML

Does it matter if the .xml is surrounded? Just give the root to Linq and work your way through it. Simple, robust and easy to maintain. In general don't even think about doing what you are about to do.

edited May 23, 2017 at 11:55

CommunityBot

11 silver badge

answered Dec 4, 2011 at 22:58

FailedDev

27k9 gold badges56 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

zx81 · Accepted Answer · 2011-12-05 19:42:49Z

The following regex matches your xml. It also captures everything inside the asp:content tags and places it in Group 1.

(?s)<asp:Content ID="[^"]*"\W+ContentPlaceHolderID="[^"]*"\W+runat="[^"]*">(.*?)</asp:Content>

Note that (?s) is the inline modifier that turns on the "dot matches new line" mode in certain regex flavors, such as .NET, Java, Perl, Python, PCRE for PHP's preg functions.

If you are using a different regex flavor, you will need to remove (?s) and activate "dot matches new line" differently.

The following code retrieves the group captures. To show a general solution, the subject string contains two of these placeholders.

<?php
$subject='
<asp:Content ID="blah" ContentPlaceHolderID="blah" runat="blah">Capture Me!</asp:Content>
<asp:Content ID="Content2" ContentPlaceHolderID="header" runat="server">
<div>
<categories>
<category>
     <name>item 1</name>
            <categories>
                <category>
                    <name>item 1.1.</name>
                </category>
                <category>
                    <name>item 1.2.</name>
                </category>
            </categories>
        </category>
    </categories>
</div>
</asp:Content>
';

preg_match_all('%(?s)<asp:Content ID="[^"]*"\W+ContentPlaceHolderID="[^"]*"\W+runat="[^"]*">(.*?)</asp:Content>%', $subject, $result,PREG_OFFSET_CAPTURE | PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result); $i++) {
echo "Capture number: ".$i."<br />".htmlentities($result[1][$i][0])."<br /><br />"; 
// echo "Match number: ".$i."<br />".htmlentities($result[0][$i][0])."<br /><br/>"; 
}
?>

Here is the output:

Capture number: 0
Capture Me!

Capture number: 1
<div> <categories> <category> <name>item 1</name> <categories> <category> <name>item   1.1.</name> </category> <category> <name>item 1.2.</name> </category> </categories> </category> </categories> </div>

If you also want to display the whole match (not just the capture), just uncomment the second echo line in the for loop.

I think this is what you were looking for?

Collectives™ on Stack Overflow

regular expression to extract html tags

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related