Which encoding are my source strings in?

Question

When I have C++ code like this:

std::string narrow( "This is a narrow source string" );
std::string n2( "Win-1252 (that's the encoding we use for source files): ä,ö,ü,ß,€, ..." );

// What encoding should I pass to Win32's `MultiByteToWideChar` function
// to convert these string to a propoer wchar_t (= UTF-16 on Windows)?

Can I always assume Win-1252 if that's the (implicit) encoding of our cpp files? How does the Visual-C++ compiler decide which character encoding the source files are in?

What would happen if, say, a developer uses a machine where "normal" text files default to another single/multibyte encoding?

I assume the encoding is only an issue on the machine used to compile the code? That is, once the executable is built, converting a static string from a fixed narrow encoding to Windows' UTF-16 wchar_t will always yield the same result regardless of the laguage/locale on the users PC?

If you need string literals encoded as UTF-16, why not just use wide literals instead of narrow literals? Then you won't have this problem. std::wstring wide(L"This is a wide source string"); std::wstring w2(L"ä,ö,ü,ß,€, ..."); L"" equates to wchar_t[], and wchar_t-based strings can hold UTF-16 encoded data on Windows. On other platforms, like Linux, wchar_t is UTF-32, so you would have to convert at runtime, such as with iconv. Otherwise, if you need to support multiple platforms, you should use a cross-platform library, like ICU, to work with Unicode strings in a uniform manner. — Remy Lebeau
– Remy Lebeau, Commented Nov 29, 2012 at 19:42
If you are using C++11, you can use char16_t and u"" to force UTF-16 on all platforms,eg: std::basic_string<char16_t> utf16(u"This is a UTF-16 source string"); — Remy Lebeau
– Remy Lebeau, Commented Nov 29, 2012 at 19:44
@Remy - there's a playce in the source where the strng literal originates, and that place in the source needs to use narrow. Then, after some transporting around, the std::string has to be displayed in a window, and there I need a wide character string. — Martin Ba
– Martin Ba, Commented Nov 30, 2012 at 10:06
Then you have to perform a run-time conversion using the correct encoding that the narrow literal actually uses. I would suggest making the narrow literal hold UTF-8 bytes if possible. If you are using C++11, you can use char8_t and u8"" for that, otherwise use escape sequences for non-ASCII bytes. Either way, you can then use the CP_UTF8 codepage with MultiByteToWideChar() when converting to UTF-16. — Remy Lebeau
– Remy Lebeau, Commented Nov 30, 2012 at 19:56
@Remy - "perform a run-time conversion using the correct encoding that the narrow literal" - That's what the question is about! Which encoding is my literal in? I don't have C++11/u8, so I have to use a simple string "..." - which encoding will it be? — Martin Ba
– Martin Ba, Commented Dec 2, 2012 at 17:57

bames53 · Accepted Answer · 2021-05-24 02:40:56Z

5

Note: Since the below answer was written VC++ has added additional options for source and execution charset encodings. See here.

For wide literals VC++ will always produce UTF-16, and for narrow literals VC++ will always convert from the source encoding to the "encoding for non-Unicode programs" set on the host machine (the system you run the compiler on). So as long as VC++ correctly recognizes the source encoding that's what you'll get, UTF-16 and the encoding for non-Unicode programs.

To determine the source encoding VC++ detects so-called BOMs. It will recognize UTF-16 and UTF-8. If there is no BOM then it assumes that the source is encoded using the system's encoding for non-Unicode programs.

If this results in the wrong encoding being used then any conversions performed by the compiler on character and string literals will result in the wrong values for any characters outside the ASCII range.

Once the program is compiled then yes, the locale will stop mattering as far as these compile-time conversions go, as the data is static.

Encoding may matter for other things though, such as if you print one of these strings to the console. You'll either have to perform an appropriate conversion to whatever the console is using or ensure the console is set to accept the encoding you're using.

Note on #pragma setlocale

#pragma setlocale affects only the conversion to wide literals and it does so neither by setting the source encoding nor by changing the wide execution encoding. What it actually does is, frankly, horrifying. Just as an example the following assertion fails:

#pragma setlocale(".1251")
static_assert(L'Я' != L'ß', "wtf...");

It should definitely be avoided if you use any Unicode encoding for your source.

edited May 24, 2021 at 2:40

answered Nov 29, 2012 at 14:37

bames53

88.7k15 gold badges191 silver badges255 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Martin Ba Over a year ago

Cheers for expanding upon your answer! In your static_assert example at the bottom - what is the encoding of your cpp text file? Since you have both 'Я' and 'ß' in there, I assume it's UTF-8|16 ?

bames53 Over a year ago

@MartinBa Yeah, UTF-8. I know why the thing is happening but, as I said, I'm horrified.

bames53 Over a year ago

What that pragma should do is make the compiler see that expression as L'РЇ' != L'Гџ' (i.e., the UTF-8 bytes actually in the source should be interpreted as CP1251). I just find what VC++ preferred to do instead incredible.

Martin Ba Over a year ago

Actually, I'm not at all sure what's happening :-) The compiler has to read in your UTF-8 file, and it has to translate the UTF-8 code points to UTF-16 code points. Why on earth would it translate two different UTF-8 code points to the same UTF-16 code points based on the locale pragma? I believe you, I'm just stumped through which machinery I have to route a code point to get from U+00DF(ß) and U+042F(Я) to the same UTF-16 code point.

bames53 Over a year ago

It converts the UTF-8 encoded character to one of the code pages and then interprets that encoding as cp1251. So 'ß' is converted to cp1252, which is 0xDF, and 'Я' is converted to cp1251, which is also 0xDF. Then in both cases 0xDF is interpreted as cp1251, so what the compiler is actually seeing as a result of this insanity is L'Я' != L'Я'.

Raymond Chen · Accepted Answer · 2012-11-29 14:23:23Z

3

The language specification merely says that source characters are mapped in an implementation-defined way. You need to consult the documentation for the compiler you are using in order to see what that implementation's definition says. For example, Microsoft Visual C++ uses #pragma setlocale to specify the code page.

answered Nov 29, 2012 at 14:23

Raymond Chen

45.4k12 gold badges100 silver badges145 bronze badges

Collectives™ on Stack Overflow

Which encoding are my source strings in?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related