2

I'm trying to convert a string from iso-8859-1 to utf-8. But when I find these two charachter € and • the function returns a charachter that is a square with two number inside.

How can I solve this issue?

2
  • 1
    Please show some code and example data. Commented Sep 2, 2010 at 14:42
  • 1
    Both characters and are not contained in ISO 8859-1. So what encoding do you use? Commented Sep 2, 2010 at 14:55

4 Answers 4

8

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.

The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).

iconv('cp1252', 'utf-8', "\x80 and \x95")
-> "\xe2\x82\xac and \xe2\x80\xa2"
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you bobince! Now it works. I want to ask you another question now. How can I check all the sites that are sets in text/html;charset=iso-8859-1 really is in cp1252? (how did you explained in the answer).
If you see a byte in the range 0x80–0x9F, you are almost certainly looking at cp1252 rather than 8859-1, since the ‘C1 control codes’ are very rarely used (almost never, on the web). If the source of the “ISO-8859-1” string is web-based, it almost certainly means it's really cp1252, since that's what browsers use.
I've tried to do this -> mb_detect_encoding($string, 'cp1252'); and then with the same string mb_detect_encoding($string, 'ISO-8859-1'); the first returns me 'false' the second returns me that it is an ISO-8859-1 string. But it isn't. How can I make a certain charset check?
You can't make a certain charset check at all. Absolutely any sequence of bytes is a valid ISO-8859-1 string, and most single-byte encodings also map all or most bytes to valid characters. Only multi-byte encodings like UTF-8, where there are many invalid byte sequences, offer any realistic chance of ruling them out. So really you can only go on balance of probabilities, and the balance of probabilities when pitting cp1252 against ISO-8859-1 for text that's come from the web is always cp1252.
2

Always check your encoding first! You should never blindly trust your encoding (even if it is from your own website!):

function convert_cp1252_to_utf8($input, $default = '') {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
    }
    return $input;
}

Comments

0

iso-8859-1 doesn't contain the € sign so your string cannot be interpreted with iso-8859-1 if it contains it. Use iso-8859-15 instead.

1 Comment

Then what about the •? It's Windows-1252, not ISO-8859-15.
0

Those 2 characters are illegal in iso-8859-1 (did you mean iso-8859-15?)

$ php -r 'echo iconv("utf-8","iso-8859-1//TRANSLIT","ter € and • the");'
ter EUR and o the

3 Comments

ISO-8859-15 does not have a code point for •. It has to be Windows-1252.
Quite probably yes, 'has to' is a bit strong wording (there are multiple characterssets which have both € and •). The iconv solution stays the same as long as people know their input charset.
Good point! Then I fall back on my previous claim that ISO-8859-1 does not have a bullet.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.