3

Being in a non English speaking country I wanted to do a test with char array and non ASCII character.

I compiled this code with MSVC and Mingwin GCC :

#include <iostream>

int main()
{
    constexpr char const* c = "é";
    int i = 0;

    char const* s;

    for (s = c; *s; s++)
    {
        i++;
    }

    std::cout << "Size: " << i << std::endl;

    std::cout << "Char size: " << sizeof(char) << std::endl;
}

Both display Char size: 1 but MSVC displays Size: 1 and Mingwin GCC displays Size: 2.

Is this an undefined behaviour caused by the non ASCII character or is there an other reason behind it (GCC encoding in UTF-8 and MSVC in UTF-16 maybe) ?

10
  • 1
    use std::u8string if you want to guarantee UTF-8 encoding. Otherwise MSVC uses the text encoding of the file, presumably GCC is converting to UTF-8? Commented Oct 13, 2022 at 10:59
  • 1
    Try prefixing the literal with u8 or u Commented Oct 13, 2022 at 11:03
  • What is the character encoding of the source file that you are feeding to the compiler? According to this link, it must be UTF-8, if you are using gcc (which MinGW is based on). If you are unable to answer this question (for example because the text editor you are using does not provide this information), then please provide a hex dump of the source file. Commented Oct 13, 2022 at 11:04
  • If you don't know how to create a hex dump, this question may be useful. It explains how to open a file in binary mode, so that you can use Visual Studio as a hex editor. Commented Oct 13, 2022 at 11:11
  • Nothing is as hard to get right as "Plain text" : Fun video here : youtube.com/watch?v=_mZBa3sqTrI Commented Oct 13, 2022 at 11:13

1 Answer 1

0

The encoding used to map ordinary string literals to a sequence of code units is (mostly) implementation-defined.

GCC defaults to UTF-8 in which the character é uses two code units and my guess is that MSVC uses code page 1252, in which the same character uses up only one code unit. (That encoding uses a single code unit per character anyway.)

Compilers typically have switches to change the ordinary literal and execution character set encoding, e.g. for GCC with the -fexec-charset option.

Also be careful that the source file is encoded in an encoding that the compiler expects. If the file is UTF-8 encoded but the compiler expects it to be something else, then it is going to interpret the bytes in the file corresponding to the intended character é as a different (sequence of) characters. That is however independent of the ordinary literal encoding mentioned above. GCC for example has the -finput-charset option to explicitly choose the source encoding and defaults to UTF-8.

If you intent the literal to be UTF-8 encoded into bytes, then you should use u8-prefixed literals which are guaranteed to use this encoding:

constexpr auto c = u8"é";

Note that the type auto here will be const char* in C++17, but const char8_t* since C++20. s must be adjusted accordingly. This will then guarantee an output of 2 for the length (number of code units). Similarly there are u and U for UTF-16 and UTF-32 in both of which only one code unit would be used for é, but the size of code units would be 2 or 4 bytes (assuming CHAR_BIT == 8) respectively (types char16_t and char32_t).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.