9

I'm trying to parse a document and get all the image tags and change the source for something different.

$domDocument = new DOMDocument();

$domDocument->loadHTML($text);

$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
  $Image->setAttribute('src', 'lalala');
  $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();

The $text initially looks like this:

<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>

and this is the output $text:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hi, this is a test, here is an image<img src="lalala" width="68" height="95"> Because I like Beer!</p></body></html>

I'm getting a bunch of extra tags (HTML, body, and the comment at the top) that I don't really need. Any way to set up the DOMDocument to avoid adding these extra tags?

0

6 Answers 6

22

You just need to add 2 flags to the loadHTML() method: LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD. I.e.

$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);

See IDEONE demo:

$text = '<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>';
$domDocument = new DOMDocument;
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
      $Image->setAttribute('src', 'lalala');
      $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();
echo $text;

Output:

<p>Hi, this is a test, here is an image<img src="lalala" width="60" height="95"> Because I like Beer!</p>
Sign up to request clarification or add additional context in comments.

3 Comments

For me that just strips all html out of there. My paragraphs are gone too.
@Mike: That is impossible as the code does not remove anything. Maybe the HTML you have is not fully valid. Try adding libxml_use_internal_errors(true); before initializing the DOMDocument with $domDocument = new DOMDocument;.
@WiktorStribiżew I was using it to strip the Script tags out of a text field as per here: stackoverflow.com/questions/7130867/…
4

DomDocument is unfortunately retarded and won't let you do this. Try this:

$text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));

2 Comments

it should read: $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));
preg_replace, really?
1

If you are up to a hack, this is the way I managed to go around this annoyance. Load the string as XML and save it as HTML. :)

Comments

0

you can use http://beerpla.net/projects/smartdomdocument-a-smarter-php-domdocument-class/ :

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain and tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

Comments

0

I found a nice fellow who posted a function on github which, though it uses DOMDocument, it does NOT add extra tags, just fixes malformed ones in your code. Best I've seen so far. https://gist.github.com/hubgit/1322324

Here's my version of it:

function repair_html($html){
  // hide DOM parsing errors
  libxml_use_internal_errors(true);
  libxml_clear_errors();

  // load the possibly malformed HTML into a DOMDocument
  $dom = new DOMDocument();
  $dom->recover = true;
  $rnd = mt_rand(9, 9999) . time();  // just in case we have something else with ID "repair"
  $dom->loadHTML('<?xml encoding="UTF-8"><body id="repair' . $rnd . '">' . $html . '</body>'); // input UTF-8

  // copy the document content into a new document
  $doc = new DOMDocument();
  foreach ($dom->getElementById('repair' . $rnd)->childNodes as $child) {
    $doc->appendChild($doc->importNode($child, true));
  }
  
  // output the new document as HTML
  $doc->encoding = 'UTF-8'; // output UTF-8
  $doc->formatOutput = false;
  return trim($doc->saveHTML());
}

Comments

-2

If you're going to save as HTML, you have to expect a valid HTML document to be created!

There is another option: DOMDocument::saveXML has an optional parameter allowing you to access the XML content of a particular element:

$el = $domDocument->getElementsByTagName('p')->item(0);
$text = $domDocument->saveXML($el);

This presumes that your content only has one p element.

1 Comment

depending on the used elements inside the document it's not always a good idea to use saveXML() to retrieve a HTML-source. The created XML will use the shorthand for all elements without content, what will damage the HTML-document(e.g. <script src="some.js"/>). You'll need to parse the result and correct it or transform it using XSLT to get a valid HTML-document.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.