0

I crawl some informations from a website. Therefore I create a new DOM document and load the website with loadHTMLFile. Now I´ve the problem that the website which I'm crawling uses apostrophes/ french accents. I´ve read that loadHTMLFile doesn´t use UTF-8 encoding by default.

So I´ve tried to add UTF-8 encoding manually, but it doesn´t work. The apostrophes still doesn´t show correctly.

For example the letter ì (with apostrophe) is shown as %C3%AC. Word without apostrophes are shown correctly.

This is the complete code:

header('Content-Type: text/html; charset=utf-8');

foreach (range(0, 50) as $number) {

$url = 'https://www.xxyyy.com/' . $number . '';
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTMLFile(mb_convert_encoding($url, 'HTML-ENTITIES', 'UTF-8'));
$dom->substituteEntities = true;
$xpath = new DOMXPath($dom);

$content = $xpath->query("//a[contains(@href,'character')]/@href");

    foreach ($content as $node) {

      echo $node->nodeValue
    }
}
0

1 Answer 1

1

Your problem isn't UTF-8 at all. Every URL in HTML should be URL encoded. When you fetch the value of <a href> you fetch the URL in its encoded form. You must decode the URL to its string form if you want to see the unicode characters. Use urldecode()

echo urldecode($node->nodeValue);
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.