1

I've a problem. I want to load a HTML snippet with namespaces in it with DOMDocument.

<div class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu">
    </div>
</div>

But I can't figure out how to preserve the namespaces. I tried loading it with loadHTML() but HTML does not have namespaces and so they get stripped.

I tried loading it with loadXML() but this doesn't work neither cause <my:text value="huhu"> is not correct XML.

What I need is a loadHTML() method which doesn't strip namespaces or a loadXML() method which does not validate the markup. So a combination of this two methods.

My code so far:

$html = '<div class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu">
    </div>
</div>';

libxml_use_internal_errors(true);

$domDoc = new DOMDocument();
$domDoc->formatOutput = false;
$domDoc->resolveExternals = false;
$domDoc->substituteEntities = false;
$domDoc->strictErrorChecking = false;
$domDoc->validateOnParse = false;

$domDoc->loadHTML($html/*, LIBXML_NOERROR | LIBXML_NOWARNING*/);
$xpath = new DOMXPath($domDoc);
$xpath->registerNamespace ( 'my', 'http://www.example.com/' );

// -----> This results in zero nodes cause namespace gets stripped by loadHTML()
$nodes = $xpath->query('//my:*');
var_dump($nodes);

Is there a way to achieve what I want? I would be very happy for any advices.

EDIT I opened an enhancment request for libxml2 to provide an option to preserve namespaces in HTML: https://bugzilla.gnome.org/show_bug.cgi?id=711670

11
  • Loading something that is neither valid XML nor valid HTML is always going to be tricky when using loadXML or loadHTML... Commented Nov 8, 2013 at 9:52
  • 1
    Is it possible to declare the namespace? Something like <my:root_node xmlns:my="http://www.w3.org/TR/html4/">…<my:text>…. DOMDocument should be able to handle namespaces when loaded through loadXML() or load(). Commented Nov 8, 2013 at 10:09
  • Have deleted my answer as it don't fit your needs. But maybe it's - sad but true - simply not working. Definitely an interesting question.. +1 Commented Nov 8, 2013 at 10:09
  • @jazZRo No it won't work cause <my:text value="huhu"> is no valid XML :-(. Commented Nov 8, 2013 at 10:10
  • @jazZRo Yeah, that's what I was asking me too.. But when parsing just snippets of HTML like a <div> then it is common that the namespace declaration isn't available in that snippet Commented Nov 8, 2013 at 10:10

2 Answers 2

2

First, namespaces are allowed in XML (or XHTML) only. HTML does not support namespaces.


Given that it is XHTML and the xmlns declaration is present in the snippet, then you can access elements by namespace using DOMDocument::getElementsByTagNameNS():

$html = <<<EOF
<div xmlns:my="http://www.example.com/" class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu" />
    </div>
</div>
EOF;

$domDoc = new DOMDocument();
$domDoc->loadXML($html);
var_dump(
  // it is possible to use wildcard `*` here
  $domDoc->getElementsByTagNameNS('http://www.example.com/', '*')
);

However as it is common that the namespace declaration is defined in the root element <html> rather than in sub nodes, the code above will not work in most cases..

So part two of the solution would be to check if the declaration is present and if not inject it.... (working on this)


As I said, the code above works for XML / XHTML only. It is still open how to do that with HTML. (check the discussion below)

Sign up to request clarification or add additional context in comments.

22 Comments

This won't work because the namespace gets stripped during the parsing of my HTML snippet with loadHTML().
Yeah you are right. You can only select the text nodes.. (seems so, let me dig more into this)
I want access all elements with namespace my. So access the elements with //text unfortunately isn't an option neither :-(. Would be great if you find a way to achieve what I wan't :-).
I'm searching for a way
So far so good. That I had earlier this day too. Problem is, you have to put in valid XML. So if your snippet is missing a closing </div> or something like this. loadXML() will fail.
|
2

Technically it's neither valid XML or HTML (or XHTML) because HTML does not allow for namespaced elements while valid XML requires that empty elements be self-closing and that the namespace be registered. So your basically asking "how can I have DOMDocument treat this invalid HTML as valid XML even though it's not valid XML either?" which is going to prove difficult and one might ask why should libxml be updated to allow for this? If I update your snippet to:

$html = <<<XML
<div xmlns:my="http://www.example.com/" class="something-first">
    <div class="something-child something-good another something-great">
        <my:text value="huhu" />
    </div>
</div>
XML;

adding in the NS registration and closing the my:text, it works just fine with:

$domDoc = new DOMDocument();
$domDoc->loadXML($html);
echo $domDoc->saveXML();

Notice that the namespace is not stripped out. The namespace is stripped out, as I understand it, because it's not valid XML or HTML. The XPath can't query by the namespace since the namespace wasn't defined via xmlns and therefore was dropped.

So I guess the question is: Why are you petitioning for invalid XML support rather than adding that closing slash? Is it because the data is from an external source or because in some context the empty non-closing tag is valid?

8 Comments

Nice to see another opinion here.. Unfortunately it is the same than my... (what should you otherwise say, I think it just like you and me said.) .. However, the <fb:*> elements are pain in the ass! Do you really think facebook writes unvalid HTML .... (just a question)?.. maybe we should ask them...
Might be that you didn't see that the behaviour is different when using loadHTML and loadXML (like me before)... I think this is a reasonable question as it is a real world problem.. (OP hasn't designed the HTML. it could be anything)
My guess is that Facebook serves valid XHTML, though I can't say for sure, since I don't ever interact with Facebook. If the xmlns for fb namespace is provided, then its valid. It's one thing for html to be malformed, but XML is generally more strictly parsed and with namespaces it's required beyond best practices to have xmlns . Chrome won't display the original snippet, why should a less forgiving lexer?
@hek2mgl - ignore last comment. fat fingers on a touch screen. Something I find really interesting is the on-going pushback for closing empty elements. There are probably 100+ questions related to the goal of not enforcing this rule, not to mention tons of back-and-forth in the HTML spec on whether to enforce this, but to me it always made sense that if you have a tag that could be interpreted as an opening tag but does not have closing tag (like <br>) it should have some polite indicator (like <br/>) informing the parser that there's no end tag coming.
Yeah. I cannot understand this discussion (<br> or <br/>).. It's <br/> .. That's it! point! :) However we need to discuss this <fb:*> elements (if you like, of course).. Because they aren't just served by facebook. they are included in several (millions?) of other (HTML) sites.. I'm tired for today.. but would really like to find the final answer here. (for that reasons I would even deal with the devil and create a fb account (if necessary)) :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.