7

Consider this example, test.php:

<?php
$mystr = "<p>Hello, με काचं  ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument();
$domdoc->loadHTML($mystr); // already here corrupt UTF-8?
var_dump($domdoc);
?>

If I run this with PHP 5.5.9 (cli), I get in terminal:

$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  NULL
  ["encoding"]=>
  NULL
  ["xmlEncoding"]=>
  NULL
...
  ["textContent"]=>
  string(70) "Hello, με à¤à¤¾à¤à¤  ça øy jeÅÄ"
}

Clearly, the original string is correct as UTF-8, but the textContent of the DOMDocument is incorrectly encoded.

So, how can I get the content as correct UTF-8 in the DOMDocument?

4
  • 1
    I'm not sure if this string is really utf8 if you put the text in there like that Commented Aug 25, 2016 at 14:40
  • Thanks @aleksv - any suggestion what should I do to get the string to be utf8? Commented Aug 25, 2016 at 14:42
  • 1
    maybe this can help stackoverflow.com/questions/2142120/… Commented Aug 25, 2016 at 14:45
  • Thanks, @aleksv - following that link I eventually found the hack php.net/manual/en/domdocument.loadhtml.php#95251 which solves the problem... Commented Aug 25, 2016 at 14:57

2 Answers 2

18

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML() will assume the content is ISO-8859-1.

Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.


Workarounds:

First use mb_convert_encoding() to translate anything above the ASCII range into its html entity equivalent.

$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));

Or hack in a meta tag or xml declaration specifying UTF-8.

$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @PaulCrovella - I managed to get it working with the prepending xml declaration hack; posted my solution below... Cheers!
1

Just wanted to post the OP code with the fixes that work for me:

<?php
$mystr = "<p>Hello, με काचं  ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'UTF-8'); //DOMDocument();
$domdoc->substituteEntities = true; // no effect if hack is done
//~ $domdoc->actualEncoding = 'UTF-8'; // Cannot write property
$domdoc->encoding = 'UTF-8'; // no effect
//~ $domdoc->xmlEncoding = 'UTF-8'; // Cannot write property
//~ $domdoc->loadHTML($mystr); // already here corrupt UTF-8?
//~ $domdoc->loadHTML(utf8_decode($mystr)); // this gets to <p>Hello, ?? ?????  ça øy je??</p>, so not all
//~ $domdoc->loadHTML( mb_convert_encoding($mystr, 'utf-8', mb_detect_encoding($mystr)) ); // no dice
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr); // hack, http://php.net/manual/en/domdocument.loadhtml.php#95251
// dirty fix
foreach ($domdoc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $domdoc->removeChild($item); // remove hack
$domdoc->encoding = 'UTF-8'; // insert proper (sets all three)
var_dump($domdoc);
print $domdoc->saveXML(); // without ->encoding = 'UTF-8': Hello, &#x3BC;&#x3B5; &#xFEFF;&#x915;&#x93E;&#x91A;&#x902; else OK
//~ print mb_convert_encoding($domdoc->saveXML(), 'UTF-8', 'HTML-ENTITIES'); // if without ->encoding = 'UTF-8', this is then OK: <p>Hello, με काचं  ça øy jeść</p>
?>

This outputs:

$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  string(5) "UTF-8"
  ["encoding"]=>
  string(5) "UTF-8"
  ["xmlEncoding"]=>
  string(5) "UTF-8"
...
  ["textContent"]=>
  string(43) "Hello, με काचं  ça øy jeść"
}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello, με काचं  ça øy jeść</p></body></html>

... which is all good now :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.