12

I cannot figure out how to stop DOMDocument from mangling these characters.

<?php

$doc = new DOMDocument();
$doc->substituteEntities = false;
$doc->loadHTML('<p>¯\(°_o)/¯</p>');
print_r($doc->saveHTML());

?>

Expected Output: ¯(°_o)/¯

Actual Output: ¯(°_o)/¯

http://codepad.org/W83eHSsT

5
  • Why would you want that gibberish in an HTML document? Commented Aug 21, 2011 at 0:02
  • Anyway, it's more likely that your editor/file transfer program/the fact that PHP code is not Unicode is "mangling" them, than it is that DOMDocument has any problem. Commented Aug 21, 2011 at 0:03
  • I found out the answer here: stackoverflow.com/questions/2142120/… Just use mb_convert_encoding($string, 'html-entities', 'utf-8'); Commented Aug 21, 2011 at 0:39
  • possible duplicate of PHP DOMDocument loadHTML not encoding UTF-8 correctly Commented Feb 11, 2013 at 10:17
  • Well, this works. Commented Jul 15, 2013 at 15:53

3 Answers 3

6

I've found a hint in the comments of DOMDocument::loadHTML documentation:

(Comment from <mdmitry at gmail dot com> 21-Dec-2009 05:02: "You can also load HTML as UTF-8 using this simple hack:")

Just add '<?xml encoding="UTF-8">' before the HTML-input:

$doc = new DOMDocument();
//$doc->substituteEntities = false;
$doc->loadHTML('<?xml encoding="UTF-8">' . '<p>¯\(°_o)/¯</p>');
print_r($doc->saveHTML());
Sign up to request clarification or add additional context in comments.

2 Comments

It doesn't work. I tried everything on that page already. codepad.org/Sr3d710Q
It does work for me. Using UTF-8 for PHP files–I've tested that. I don't know, what Codepad is doing internally, but they are returning entities…
3
<?xml version="1.0" encoding="utf-8">

in the top of the document takes care of tags.. for both saveXML and saveHTML.

Comments

0

PHP DOMDocument will not convert characters to htmlentities if the HTML is properly loaded in UTF-8 and has the meta charset=utf-8 tag.

The idea is to:

  • Properly detect the HTML source encoding and convert it in UTF-8
  • Load the DOMDocument with the UTF-8 charset
  • Add the meta charset=utf-8 tag to the DOMDocument
  • Do any stuff
  • Remove the meta charset=utf-8 tag from after saving the result.

Here's a sample code:

<?php
$htmlContent = file_get_contents('source.html');
$convertedContent = mb_convert_encoding($htmlContent, 'UTF-8', mb_detect_encoding($htmlContent));

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($convertedContent);

// Create the meta tag element
$metaTag = $dom->createElement('meta');
$metaTag->setAttribute('http-equiv', 'Content-Type');
$metaTag->setAttribute('content', 'text/html; charset=utf-8');

// Append the meta charset tag to the head element
$head = $dom->getElementsByTagName('head')->item(0);
$head->appendChild($metaTag);

// Do any stuff here

// save the content without the meta charset tag
$new_content = str_replace('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">', '', $dom->saveHTML());

// save to a destination file
file_put_contents('dest.html', $new_content);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.