7

From c++2003 2.13

A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below

The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.

From c++0x 2.14.5

A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.

The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.

There's a post that states clearly how windows implements wchar_t in https://stackoverflow.com/questions/402283?tab=votes%23tab-top

In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.

for example,

wchar_t kk[] = L"\U000E0005";

This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).

However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).

But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.

It's apparently conflicting to each other.

I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).

Is there anything wrong with my understanding?

Thanks.

2 Answers 2

4

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.

Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.

The answer you linked to is somewhat misleading as well:

On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

Sign up to request clarification or add additional context in comments.

6 Comments

thank you. i got it now. some time it's just difficult to understand a new concept, but once you got it, it becomes simpler instantly.
Windows technically uses WCHAR, not wchar_t. It's been typedeffed as unsigned short in the past and could become char16_t in the future. But honestly, I don't see that happening - string literals would break.
@MSalters: Why would string literals break? That's what the TEXT("...") macros are there for -- people were never supposed to use raw L"..." literals. Also, at least on VS2005, WCHAR is a typedef for wchar_t, not unsigned short.
@casablanca: TEXT("") is a TCHAR[] literal, not a WCHAR[] literal. The typedef unsigned short WCHAR was used in VC6 and previous versions.
Today VC++ is incorrect. But the reason is that at the time when the decision was made that Windows NT should be Unicode, the Unicode standard itself was not going beyond 65536, and there was no mechsnicm to go beyond that.
|
2

Windows knows nothing about wchar_t, because wchar_t is a programming concept. Conversely, wchar_t is just storage, and it knows nothing about the semantic value of the data you store in it (that is, it knows nothing about Unicode or ASCII or whatever.)

If a compiler or SDK that targets Windows defines wchar_t to be 16 bits, then that compiler may be in conflict with the C++0x standard. (I don't know whether there are some get-out clauses that allow wchar_t to be 16 bits.) But in any case the compiler could define wchar_t to be 32 bits (to comply with the standard) and provide runtime functions to convert to/from UTF-16 for when you need to pass your wchar_t* to Windows APIs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.