PHP DOMDocument will not convert characters to htmlentities if the HTML is properly loaded in UTF-8 and has the meta charset=utf-8 tag.
The idea is to:
- Properly detect the HTML source encoding and convert it in UTF-8
- Load the DOMDocument with the UTF-8 charset
- Add the
meta charset=utf-8 tag to the DOMDocument
- Do any stuff
- Remove the
meta charset=utf-8 tag from after saving the result.
Here's a sample code:
<?php
$htmlContent = file_get_contents('source.html');
$convertedContent = mb_convert_encoding($htmlContent, 'UTF-8', mb_detect_encoding($htmlContent));
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($convertedContent);
// Create the meta tag element
$metaTag = $dom->createElement('meta');
$metaTag->setAttribute('http-equiv', 'Content-Type');
$metaTag->setAttribute('content', 'text/html; charset=utf-8');
// Append the meta charset tag to the head element
$head = $dom->getElementsByTagName('head')->item(0);
$head->appendChild($metaTag);
// Do any stuff here
// save the content without the meta charset tag
$new_content = str_replace('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">', '', $dom->saveHTML());
// save to a destination file
file_put_contents('dest.html', $new_content);
DOMDocumenthas any problem.