PHP DOMDocument adds extra tags [duplicate]

Question

I'm trying to parse a document and get all the image tags and change the source for something different.

$domDocument = new DOMDocument();

$domDocument->loadHTML($text);

$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
  $Image->setAttribute('src', 'lalala');
  $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();

The $text initially looks like this:

<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>

and this is the output $text:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hi, this is a test, here is an image<img src="lalala" width="68" height="95"> Because I like Beer!</p></body></html>

I'm getting a bunch of extra tags (HTML, body, and the comment at the top) that I don't really need. Any way to set up the DOMDocument to avoid adding these extra tags?

Wiktor Stribiżew · Accepted Answer · 2015-07-15 09:22:38Z

22

You just need to add 2 flags to the loadHTML() method: LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD. I.e.

$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);

See IDEONE demo:

$text = '<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>';
$domDocument = new DOMDocument;
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
      $Image->setAttribute('src', 'lalala');
      $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();
echo $text;

Output:

<p>Hi, this is a test, here is an image<img src="lalala" width="60" height="95"> Because I like Beer!</p>

answered Jul 15, 2015 at 9:22

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mike Over a year ago

For me that just strips all html out of there. My paragraphs are gone too.

Wiktor Stribiżew Over a year ago

@Mike: That is impossible as the code does not remove anything. Maybe the HTML you have is not fully valid. Try adding libxml_use_internal_errors(true); before initializing the DOMDocument with $domDocument = new DOMDocument;.

Mike Over a year ago

@WiktorStribiżew I was using it to strip the Script tags out of a text field as per here: stackoverflow.com/questions/7130867/…

nickb · Accepted Answer · 2012-03-14 16:23:01Z

4

DomDocument is unfortunately retarded and won't let you do this. Try this:

$text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));

edited Mar 14, 2012 at 16:23

nickb

59.7k13 gold badges115 silver badges149 bronze badges

answered Jan 26, 2011 at 1:39

bowens

1483 bronze badges

2 Comments

Enrico Detoma Over a year ago

it should read: $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));

sglessard Over a year ago

preg_replace, really?

mhitza · Accepted Answer · 2011-01-26 00:59:04Z

1

If you are up to a hack, this is the way I managed to go around this annoyance. Load the string as XML and save it as HTML. :)

answered Jan 26, 2011 at 0:59

mhitza

5,7152 gold badges32 silver badges52 bronze badges

Comments

Tomer Almog · Accepted Answer · 2015-01-12 23:09:56Z

0

you can use http://beerpla.net/projects/smartdomdocument-a-smarter-php-domdocument-class/ :

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain and tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

answered Jan 12, 2015 at 23:09

Tomer Almog

3,8783 gold badges33 silver badges37 bronze badges

Comments

Richard · Accepted Answer · 2024-10-12 17:41:04Z

I found a nice fellow who posted a function on github which, though it uses DOMDocument, it does NOT add extra tags, just fixes malformed ones in your code. Best I've seen so far. https://gist.github.com/hubgit/1322324

Here's my version of it:

function repair_html($html){
  // hide DOM parsing errors
  libxml_use_internal_errors(true);
  libxml_clear_errors();

  // load the possibly malformed HTML into a DOMDocument
  $dom = new DOMDocument();
  $dom->recover = true;
  $rnd = mt_rand(9, 9999) . time();  // just in case we have something else with ID "repair"
  $dom->loadHTML('<?xml encoding="UTF-8"><body id="repair' . $rnd . '">' . $html . '</body>'); // input UTF-8

  // copy the document content into a new document
  $doc = new DOMDocument();
  foreach ($dom->getElementById('repair' . $rnd)->childNodes as $child) {
    $doc->appendChild($doc->importNode($child, true));
  }
  
  // output the new document as HTML
  $doc->encoding = 'UTF-8'; // output UTF-8
  $doc->formatOutput = false;
  return trim($doc->saveHTML());
}

lonesomeday · Accepted Answer · 2011-01-26 00:51:41Z

-2

If you're going to save as HTML, you have to expect a valid HTML document to be created!

There is another option: DOMDocument::saveXML has an optional parameter allowing you to access the XML content of a particular element:

$el = $domDocument->getElementsByTagName('p')->item(0);
$text = $domDocument->saveXML($el);

This presumes that your content only has one p element.

answered Jan 26, 2011 at 0:51

lonesomeday

239k54 gold badges330 silver badges329 bronze badges

1 Comment

Dr.Molle Over a year ago

depending on the used elements inside the document it's not always a good idea to use saveXML() to retrieve a HTML-source. The created XML will use the shorthand for all elements without content, what will damage the HTML-document(e.g. <script src="some.js"/>). You'll need to parse the result and correct it or transform it using XSLT to get a valid HTML-document.

Collectives™ on Stack Overflow

PHP DOMDocument adds extra tags [duplicate]

6 Answers 6

3 Comments

2 Comments

Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

2 Comments

Comments

Comments

Comments

1 Comment

Linked

Related