Are non-ASCII char handled differently by GCC and MSVC or is it undefined behaviour

Question

Being in a non English speaking country I wanted to do a test with char array and non ASCII character.

I compiled this code with MSVC and Mingwin GCC :

#include <iostream>

int main()
{
    constexpr char const* c = "é";
    int i = 0;

    char const* s;

    for (s = c; *s; s++)
    {
        i++;
    }

    std::cout << "Size: " << i << std::endl;

    std::cout << "Char size: " << sizeof(char) << std::endl;
}

Both display Char size: 1 but MSVC displays Size: 1 and Mingwin GCC displays Size: 2.

Is this an undefined behaviour caused by the non ASCII character or is there an other reason behind it (GCC encoding in UTF-8 and MSVC in UTF-16 maybe) ?

use std::u8string if you want to guarantee UTF-8 encoding. Otherwise MSVC uses the text encoding of the file, presumably GCC is converting to UTF-8? — Alan Birtles
– Alan Birtles, Commented Oct 13, 2022 at 10:59
What is the character encoding of the source file that you are feeding to the compiler? According to this link, it must be UTF-8, if you are using gcc (which MinGW is based on). If you are unable to answer this question (for example because the text editor you are using does not provide this information), then please provide a hex dump of the source file. — Andreas Wenzel
– Andreas Wenzel, Commented Oct 13, 2022 at 11:04
If you don't know how to create a hex dump, this question may be useful. It explains how to open a file in binary mode, so that you can use Visual Studio as a hex editor. — Andreas Wenzel
– Andreas Wenzel, Commented Oct 13, 2022 at 11:11
Nothing is as hard to get right as "Plain text" : Fun video here : youtube.com/watch?v=_mZBa3sqTrI — Pepijn Kramer
– Pepijn Kramer, Commented Oct 13, 2022 at 11:13

user17732522 · Accepted Answer · 2022-10-13 20:56:33Z

The encoding used to map ordinary string literals to a sequence of code units is (mostly) implementation-defined.

GCC defaults to UTF-8 in which the character é uses two code units and my guess is that MSVC uses code page 1252, in which the same character uses up only one code unit. (That encoding uses a single code unit per character anyway.)

Compilers typically have switches to change the ordinary literal and execution character set encoding, e.g. for GCC with the -fexec-charset option.

Also be careful that the source file is encoded in an encoding that the compiler expects. If the file is UTF-8 encoded but the compiler expects it to be something else, then it is going to interpret the bytes in the file corresponding to the intended character é as a different (sequence of) characters. That is however independent of the ordinary literal encoding mentioned above. GCC for example has the -finput-charset option to explicitly choose the source encoding and defaults to UTF-8.

If you intent the literal to be UTF-8 encoded into bytes, then you should use u8-prefixed literals which are guaranteed to use this encoding:

constexpr auto c = u8"é";

Note that the type auto here will be const char* in C++17, but const char8_t* since C++20. s must be adjusted accordingly. This will then guarantee an output of 2 for the length (number of code units). Similarly there are u and U for UTF-16 and UTF-32 in both of which only one code unit would be used for é, but the size of code units would be 2 or 4 bytes (assuming CHAR_BIT == 8) respectively (types char16_t and char32_t).

Collectives™ on Stack Overflow

Are non-ASCII char handled differently by GCC and MSVC or is it undefined behaviour

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related