Load HTML containing namespaces with DOMDocument

Question

I've a problem. I want to load a HTML snippet with namespaces in it with DOMDocument.

<div class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu">
    </div>
</div>

But I can't figure out how to preserve the namespaces. I tried loading it with loadHTML() but HTML does not have namespaces and so they get stripped.

I tried loading it with loadXML() but this doesn't work neither cause <my:text value="huhu"> is not correct XML.

What I need is a loadHTML() method which doesn't strip namespaces or a loadXML() method which does not validate the markup. So a combination of this two methods.

My code so far:

$html = '<div class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu">
    </div>
</div>';

libxml_use_internal_errors(true);

$domDoc = new DOMDocument();
$domDoc->formatOutput = false;
$domDoc->resolveExternals = false;
$domDoc->substituteEntities = false;
$domDoc->strictErrorChecking = false;
$domDoc->validateOnParse = false;

$domDoc->loadHTML($html/*, LIBXML_NOERROR | LIBXML_NOWARNING*/);
$xpath = new DOMXPath($domDoc);
$xpath->registerNamespace ( 'my', 'http://www.example.com/' );

// -----> This results in zero nodes cause namespace gets stripped by loadHTML()
$nodes = $xpath->query('//my:*');
var_dump($nodes);

Is there a way to achieve what I want? I would be very happy for any advices.

EDIT I opened an enhancment request for libxml2 to provide an option to preserve namespaces in HTML: https://bugzilla.gnome.org/show_bug.cgi?id=711670

Loading something that is neither valid XML nor valid HTML is always going to be tricky when using loadXML or loadHTML... — lonesomeday
– lonesomeday, Commented Nov 8, 2013 at 9:52
Is it possible to declare the namespace? Something like <my:root_node xmlns:my="http://www.w3.org/TR/html4/">…<my:text>…. DOMDocument should be able to handle namespaces when loaded through loadXML() or load(). — jazZRo
– jazZRo, Commented Nov 8, 2013 at 10:09
Have deleted my answer as it don't fit your needs. But maybe it's - sad but true - simply not working. Definitely an interesting question.. +1 — hek2mgl
– hek2mgl, Commented Nov 8, 2013 at 10:09
@jazZRo No it won't work cause <my:text value="huhu"> is no valid XML :-(. — TiMESPLiNTER
– TiMESPLiNTER, Commented Nov 8, 2013 at 10:10
@jazZRo Yeah, that's what I was asking me too.. But when parsing just snippets of HTML like a <div> then it is common that the namespace declaration isn't available in that snippet — hek2mgl
– hek2mgl, Commented Nov 8, 2013 at 10:10

hek2mgl · Accepted Answer · 2013-11-08 13:54:11Z

2

First, namespaces are allowed in XML (or XHTML) only. HTML does not support namespaces.

Given that it is XHTML and the xmlns declaration is present in the snippet, then you can access elements by namespace using DOMDocument::getElementsByTagNameNS():

$html = <<<EOF
<div xmlns:my="http://www.example.com/" class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu" />
    </div>
</div>
EOF;

$domDoc = new DOMDocument();
$domDoc->loadXML($html);
var_dump(
  // it is possible to use wildcard `*` here
  $domDoc->getElementsByTagNameNS('http://www.example.com/', '*')
);

However as it is common that the namespace declaration is defined in the root element <html> rather than in sub nodes, the code above will not work in most cases..

So part two of the solution would be to check if the declaration is present and if not inject it.... (working on this)

As I said, the code above works for XML / XHTML only. It is still open how to do that with HTML. (check the discussion below)

edited Nov 8, 2013 at 13:54

answered Nov 8, 2013 at 9:53

hek2mgl

159k31 gold badges263 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

22 Comments

TiMESPLiNTER Over a year ago

This won't work because the namespace gets stripped during the parsing of my HTML snippet with loadHTML().

hek2mgl Over a year ago

Yeah you are right. You can only select the text nodes.. (seems so, let me dig more into this)

TiMESPLiNTER Over a year ago

I want access all elements with namespace my. So access the elements with //text unfortunately isn't an option neither :-(. Would be great if you find a way to achieve what I wan't :-).

hek2mgl Over a year ago

I'm searching for a way

TiMESPLiNTER Over a year ago

So far so good. That I had earlier this day too. Problem is, you have to put in valid XML. So if your snippet is missing a closing </div> or something like this. loadXML() will fail.

|

Anthony · Accepted Answer · 2013-11-08 17:15:55Z

2

Technically it's neither valid XML or HTML (or XHTML) because HTML does not allow for namespaced elements while valid XML requires that empty elements be self-closing and that the namespace be registered. So your basically asking "how can I have DOMDocument treat this invalid HTML as valid XML even though it's not valid XML either?" which is going to prove difficult and one might ask why should libxml be updated to allow for this? If I update your snippet to:

$html = <<<XML
<div xmlns:my="http://www.example.com/" class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu" />
    </div>
</div>
XML;

adding in the NS registration and closing the my:text, it works just fine with:

$domDoc = new DOMDocument();
$domDoc->loadXML($html);
echo $domDoc->saveXML();

Notice that the namespace is not stripped out. The namespace is stripped out, as I understand it, because it's not valid XML or HTML. The XPath can't query by the namespace since the namespace wasn't defined via xmlns and therefore was dropped.

So I guess the question is: Why are you petitioning for invalid XML support rather than adding that closing slash? Is it because the data is from an external source or because in some context the empty non-closing tag is valid?

answered Nov 8, 2013 at 17:15

Anthony

37.2k26 gold badges103 silver badges167 bronze badges

8 Comments

hek2mgl Over a year ago

Nice to see another opinion here.. Unfortunately it is the same than my... (what should you otherwise say, I think it just like you and me said.) .. However, the <fb:*> elements are pain in the ass! Do you really think facebook writes unvalid HTML .... (just a question)?.. maybe we should ask them...

hek2mgl Over a year ago

Might be that you didn't see that the behaviour is different when using loadHTML and loadXML (like me before)... I think this is a reasonable question as it is a real world problem.. (OP hasn't designed the HTML. it could be anything)

Anthony Over a year ago

My guess is that Facebook serves valid XHTML, though I can't say for sure, since I don't ever interact with Facebook. If the xmlns for fb namespace is provided, then its valid. It's one thing for html to be malformed, but XML is generally more strictly parsed and with namespaces it's required beyond best practices to have xmlns . Chrome won't display the original snippet, why should a less forgiving lexer?

Anthony Over a year ago

@hek2mgl - ignore last comment. fat fingers on a touch screen. Something I find really interesting is the on-going pushback for closing empty elements. There are probably 100+ questions related to the goal of not enforcing this rule, not to mention tons of back-and-forth in the HTML spec on whether to enforce this, but to me it always made sense that if you have a tag that could be interpreted as an opening tag but does not have closing tag (like <br>) it should have some polite indicator (like <br/>) informing the parser that there's no end tag coming.

hek2mgl Over a year ago

Yeah. I cannot understand this discussion (<br> or <br/>).. It's <br/> .. That's it! point! :) However we need to discuss this <fb:*> elements (if you like, of course).. Because they aren't just served by facebook. they are included in several (millions?) of other (HTML) sites.. I'm tired for today.. but would really like to find the final answer here. (for that reasons I would even deal with the devil and create a fb account (if necessary)) :)

|

Collectives™ on Stack Overflow

Load HTML containing namespaces with DOMDocument

2 Answers 2

22 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

22 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related