0

I am downloading HTML files (raw HTML without any !DOCTYPE...) from a government website and then extracting paragraphs to put them into a MySQL database.

I am using DOMDocument, so I am going

$doc = DOMDocument();
$doc->loadHTMLFile( "../notifs/notif$notif_no.htm" );

The problem comes because certain characters get transformed into something strange: e.g. (one type of) apostrophe becomes ¢€™.

If I then try and save this para to a text field in a table either it is refused by MySQL or it is recorded as these strange characters... depending on the encoding of the text field.

Also, if I go $doc->saveHTMLFile( "test.htm" ); it actually prints out the strange characters, not the apostrophe.

I know this has something to do with encoding, but several days' googling and much looking at questions on SE have not led to the solution. Firefox tells me that the downloaded HTML files are in utf-8 encoding. I tried changing the php.ini file so the default_charset is "utf-8". No joy.

I am more an application programmer than a website person so I am quite new to encoding. I have tried cracking this one myself but just don't really understand what's going on or what to do.

later

have found that by putting

$file = file_get_contents("../notifs/notif$notif_no.htm");
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );

then saveHTMLFile() outputs with a correct apostrophe... as does my echo of the SQL INSERT INTO ... (...) VALUES (...) string. However the text in the MySQL text field obstinately refuses to cooperate. (naturally have tried multiple different collations). Meanwhile, mb_detect_encoding ( $clean_string ) prints "UTF-8" and mb_check_encoding ( $clean_string ) returns TRUE.

Another puzzling thing, though: if I do

$doc->loadHTML('<?xml encoding="latin1">' . $file )

this same partial success stays the same, right down to the "UTF-8" detected encoding. hmmmm

later

$doc = new DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
# without this following line adding an explicit encoding for the DOMDocument nothing worked!
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );

and then, when you've extracted some text and cleaned it up a bit, calling it $clean_string

# convert difficult UTF-8 characters into HTML special sequences ("&rsquo;", etc.) 
$clean_string = mb_convert_encoding($clean_string, "HTML-ENTITIES", "UTF-8"); 

After this $clean_string contains sequences like "... wine&rsquo;s worth drinking"... but I, for one, can still be quite confused, because if you simply go

echo ">>> clean string $clean_string<br>";

... the "&rsquo;" sequence will of course be displayed by the browser as ' (single quote).

This is probably absolutely obvious to most PHPers... but if you want to display an accurate picture of what you have in $clean_string you have to go

$decoded_clean_string = htmlspecialchars( $clean_string, ENT_QUOTES );
echo ">>> decoded string: $decoded_clean_string<br>";
5
  • 2
    that's a unicode mismatch. e.g. you're grabbing a utf-8 document, but processing it in iso-8859. the same charset has to be maintained throughout the entire rendering pipeline, or converted as appropriate as the "borders". Commented Nov 14, 2012 at 19:17
  • 3
    Even as an application programmer you need to know about encodings. What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, Handling Unicode Front To Back In A Web App Commented Nov 14, 2012 at 19:21
  • @Marc B thanks for the reply. Would I be right in thinking it is the DOMDocument loadHTMLFile method which has chosen iso-8859? That sort of thought led me to try to get PHP to use utf-8 as the default_charset. Did you mean "at" the borders... i.e. between one pipeline and another? Commented Nov 14, 2012 at 19:21
  • a border would be, say, php->mysql. a table in mysql can be in utf-8, but unless the db connection was set to be utf-8 as well, the text will be mangled while in flight from php -> mysql. Commented Nov 14, 2012 at 19:22
  • @deceze believe me, I have come across that webpage and read it. It doesn't help me with the particular problem I have here. Can you help me with this particular PHP/MySQL encoding problem? Commented Nov 14, 2012 at 19:23

1 Answer 1

1
$doc = DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
$file = mb_convert_encoding($file, "UTF-8");
$doc->loadHTML( $file );

Worth a shot?

or

$file = mb_convert_encoding($file, 'HTML-ENTITIES', 'UTF-8');
Sign up to request clarification or add additional context in comments.

3 Comments

thanks... unfortunately neither worked. The second changed the pattern to ’. But there seem to be various encoding functions available in php... e.g. iconv... so thanks for pointing the way, possibly
thanks again... yes, you put me on the right track with HTML-ENTITIES... see my second "later" above
Ha. That's quite the journey of discovery your were on! :) I'm glad you got it figured out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.