I am downloading HTML files (raw HTML without any !DOCTYPE...) from a government website and then extracting paragraphs to put them into a MySQL database.
I am using DOMDocument, so I am going
$doc = DOMDocument();
$doc->loadHTMLFile( "../notifs/notif$notif_no.htm" );
The problem comes because certain characters get transformed into something strange: e.g. (one type of) apostrophe becomes ¢€™.
If I then try and save this para to a text field in a table either it is refused by MySQL or it is recorded as these strange characters... depending on the encoding of the text field.
Also, if I go $doc->saveHTMLFile( "test.htm" ); it actually prints out the strange characters, not the apostrophe.
I know this has something to do with encoding, but several days' googling and much looking at questions on SE have not led to the solution. Firefox tells me that the downloaded HTML files are in utf-8 encoding. I tried changing the php.ini file so the default_charset is "utf-8". No joy.
I am more an application programmer than a website person so I am quite new to encoding. I have tried cracking this one myself but just don't really understand what's going on or what to do.
later
have found that by putting
$file = file_get_contents("../notifs/notif$notif_no.htm");
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );
then saveHTMLFile() outputs with a correct apostrophe... as does my echo of the SQL INSERT INTO ... (...) VALUES (...) string. However the text in the MySQL text field obstinately refuses to cooperate. (naturally have tried multiple different collations). Meanwhile, mb_detect_encoding ( $clean_string ) prints "UTF-8" and mb_check_encoding ( $clean_string ) returns TRUE.
Another puzzling thing, though: if I do
$doc->loadHTML('<?xml encoding="latin1">' . $file )
this same partial success stays the same, right down to the "UTF-8" detected encoding. hmmmm
later
$doc = new DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
# without this following line adding an explicit encoding for the DOMDocument nothing worked!
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );
and then, when you've extracted some text and cleaned it up a bit, calling it $clean_string
# convert difficult UTF-8 characters into HTML special sequences ("’", etc.)
$clean_string = mb_convert_encoding($clean_string, "HTML-ENTITIES", "UTF-8");
After this $clean_string contains sequences like "... wine’s worth drinking"... but I, for one, can still be quite confused, because if you simply go
echo ">>> clean string $clean_string<br>";
... the "’" sequence will of course be displayed by the browser as ' (single quote).
This is probably absolutely obvious to most PHPers... but if you want to display an accurate picture of what you have in $clean_string you have to go
$decoded_clean_string = htmlspecialchars( $clean_string, ENT_QUOTES );
echo ">>> decoded string: $decoded_clean_string<br>";