0

We have a web app which uses the ISO-8859-1 character set. Occationaly users have 'strange' names which contain characters like Š (html encoded here for your convenience). We store this in our database, but we can't display it correctly.

What is the best way of dealing with this? I'm thinking I should automatically convert characters outside the character set with its HTML Entity number encoding ( Š to Š)

But I'm having problems finding out how to do this automatically (without using a table of all values).

This code works for extended ASCII characters like 'å' (that are present in ISO-8859-1). I would like to do the same with other characters. Is there a pattern in these HTML entity encoding values I can use?

unsigned int c;  
for( int i=0; i < html.GetLength(); i++)  
{  
    c = html[i];  
    if( c > 255 || c < 0 )  
    {  
        CString orig = CString(html[i]);  
        CString encoded = "&#";  
        encoded += CTool::String((byte)c);  
        encoded += ";";  
        html.Replace(orig, encoded);  
    }  
}  

2 Answers 2

1

The webpage should instruct the browser to display the response in UTF-8. This usually happens by supplying the charset in the Content-Type response header like text/html;charset=UTF-8.

Response.AppendHeader("Content-Type", "text/html;charset=UTF-8");

The HTML/XML entities are solely there so that you will be able to save the webpage source in an encoding other than UTF-8.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes, this works, but I believe we are running ISO-8859-1 for a reason. Hopefully not, though... I'm going to check with the people who should know. It's a risky operation changing character set on all our servers, though I'd like that instead of coding an uneccesary workaround.
It's not risky as long as you was already using HTML entities for "special characters" outside the 7bit ASCII range. ISO-8859-1 and UTF-8 namely have exactly the same byte representation of ASCII characters.
0

html appears to be a "Unicode" CString. That means it's UTF-16 encoded. The "&#ddd" syntax uses the Unicode code point number. Usually, this is quite simple. Š is U+0160, which means it's 0x0160 in UTF-16. Tha's of course 352 decimal, so you get &#352.

You only have a problem when you encounter a character outside the Basic Multilingual Plane (BMP), which is past U+FFFF. This no longer fits in 16 bits, and will therefore take TWO characters in your html string. Yet, it should produce only one &#ddddd value. This is so rare that you often can ignore it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.