HTML Encoding characters not in the character set

Question

We have a web app which uses the ISO-8859-1 character set. Occationaly users have 'strange' names which contain characters like Š (html encoded here for your convenience). ~~We store this in our database, but~~ we can't display it correctly.

What is the best way of dealing with this? I'm thinking I should automatically convert characters outside the character set with its HTML Entity number encoding ( Š to Š)

But I'm having problems finding out how to do this automatically (without using a table of all values).

This code works for extended ASCII characters like 'å' (that are present in ISO-8859-1). I would like to do the same with other characters. Is there a pattern in these HTML entity encoding values I can use?

unsigned int c;  
for( int i=0; i < html.GetLength(); i++)  
{  
    c = html[i];  
    if( c > 255 || c < 0 )  
    {  
        CString orig = CString(html[i]);  
        CString encoded = "&#";  
        encoded += CTool::String((byte)c);  
        encoded += ";";  
        html.Replace(orig, encoded);  
    }  
}

BalusC · Accepted Answer · 2010-12-15 14:31:30Z

1

The webpage should instruct the browser to display the response in UTF-8. This usually happens by supplying the charset in the Content-Type response header like text/html;charset=UTF-8.

Response.AppendHeader("Content-Type", "text/html;charset=UTF-8");

The HTML/XML entities are solely there so that you will be able to save the webpage source in an encoding other than UTF-8.

answered Dec 15, 2010 at 14:31

BalusC

1.1m377 gold badges3.7k silver badges3.6k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Polymorphix Over a year ago

Yes, this works, but I believe we are running ISO-8859-1 for a reason. Hopefully not, though... I'm going to check with the people who should know. It's a risky operation changing character set on all our servers, though I'd like that instead of coding an uneccesary workaround.

BalusC Over a year ago

It's not risky as long as you was already using HTML entities for "special characters" outside the 7bit ASCII range. ISO-8859-1 and UTF-8 namely have exactly the same byte representation of ASCII characters.

MSalters · Accepted Answer · 2010-12-16 10:40:31Z

0

html appears to be a "Unicode" CString. That means it's UTF-16 encoded. The "&#ddd" syntax uses the Unicode code point number. Usually, this is quite simple. Š is U+0160, which means it's 0x0160 in UTF-16. Tha's of course 352 decimal, so you get &#352.

You only have a problem when you encounter a character outside the Basic Multilingual Plane (BMP), which is past U+FFFF. This no longer fits in 16 bits, and will therefore take TWO characters in your html string. Yet, it should produce only one &#ddddd value. This is so rare that you often can ignore it.

answered Dec 16, 2010 at 10:40

MSalters

182k11 gold badges171 silver badges376 bronze badges

Collectives™ on Stack Overflow

HTML Encoding characters not in the character set

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related