1

I have converted results from a web scrape from DOMNodeLists to strings:

$node = $the_sentence->item(0);
$the_sentence = "{$node->nodeName} - {$node->nodeValue}";

However now when I print out the result it includes whatever tag the text had in the page as well as the &nbsp character:

Before:

"This is the sentence"

Now:

"h2 - This is the Âsentence Â"

Any ideas how I can get rid of these characters? Thanks for any help.

0

1 Answer 1

1

This looks like a character set problem.

Have a look at the source page and see what character set it is encoded in. This might be in a Content-Type HTTP header, or it might be in a <meta> tag at the start of the document. Then, when you handle the data, make sure that everything you do handles it in the same format.

You probably want to store the data in UTF-8. Thus, if you capture in another format, in general it is a good idea to convert it from that charset to UTF-8; this will mean you can capture from a wide range of sources and store it in the same database. Look at iconv in the PHP manual if you wish to learn more about charset conversion.

Are you printing the output to console or a browser? If the former, note that some consoles (old versions of Windows in particular) do not handle UTF-8 well at all. If you are echoing to a browser, make sure your character set is set to "UTF-8" in your own HTML.

Sign up to request clarification or add additional context in comments.

1 Comment

Great help thanks. The character set was UTF-8 after all. Using icon I am able to ignore non UTF-8 characters.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.