Why in C++ are some characters in a multibyte UTF-8 string represented by negative numbers?

Question

The following C++ source code I compile and run in 'Windows 10' and 'Ubuntu' (via 'WSL 2'):

#include <cstring>
#include <iostream>

int main()
{
    char str[] = "Hello, привет, 😎!";

    std::cout << str << "\n\n";

    for (int i = 0; i < std::strlen(str); i++) {
        std::cout << (int) str[i] << ' ';
    } std::cout << "\n\n";

    for (int i = 0; i < std::strlen(str); i++) {
        std::cout << std::hex << (int) str[i] << ' ';
    } std::cout << "\n\n";

    for (int i = 0; i < std::strlen(str); i++) {
        std::cout << std::hex << (str[i] & 0xff) << ' ';
    } std::cout << '\n';

    return 0;
}

I save this source code in a file chars.cpp in UTF-8 encoding without BOM. In Windows 10, I use the MSVC compiler (cl.exe) from 'Microsoft C++ Build Tools' from the command line cl /EHsc /utf-8 "chars.cpp". In 'Ubuntu' (via 'WSL 2') I am using the g++ compiler from the "GCC" set from the command line g++ /mnt/c/Users/Илья/source/repos/test/chars.cpp -o chars.

I got the following result (in 'Windows 10', you need to configure the code page in the console cmd.exe using the chcp 65001 command, in 'Ubuntu' (via 'WSL 2') this is not necessary):

Hello, привет, 😎!

72 101 108 108 111 44 32 -48 -65 -47 -128 -48 -72 -48 -78 -48 -75 -47 -126 44 32 -16 -97 -104 -114 33

48 65 6c 6c 6f 2c 20 ffffffd0 ffffffbf ffffffd1 ffffff80 ffffffd0 ffffffb8 ffffffd0 ffffffb2 ffffffd0 ffffffb5 ffffffd1 ffffff82 2c 20 fffffff0 ffffff9f ffffff98 ffffff8e 21

48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21

I'm curious why negative numbers are used to represent some characters. I tried to find an explanation in the cppreference.com and read two articles there:

https://en.cppreference.com/w/cpp/language/types, quote:

char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type). Multibyte characters strings use this type to represent code units. For every value of type unsigned char in range [0, 255], converting the value to char and then back to unsigned char produces the original value. (since C++11) The signedness of char depends on the compiler and the target platform: the defaults for ARM and PowerPC are typically unsigned, the defaults for x86 and x64 are typically signed.

and

https://en.cppreference.com/w/cpp/string/multibyte

But I didn't find a direct explanation there.

My questions. For what purpose are some characters represented by negative numbers? Is it in the standard or is it system-specific?

For what purpose are some characters represented by negative numbers? 1. Because some old computers did not have unsigned numeric. 2. 7 positive bits 0..127 (ASCII 7) were enough for the USA IT. — 3CEZVQ
– 3CEZVQ, Commented Apr 28, 2023 at 23:33
The C++ specifications say that char can be either signed or unsigned. And that it is up to the implementation (i.e. the compiler) to decide. As mentioned, your compiler have decided that on your system char is signed. This is the reason why char, signed char and unsigned char are considered three distinct types, not two as in the case for any other integer type. — Some programmer dude
– Some programmer dude, Commented Apr 28, 2023 at 23:37
UTF-8 is an encoding defined in terms of 8-bit bytes, in each byte only the order of bits matter (from the most-significant down to the least-significant) and byte is not, strictly speaking, treated as a "number". But there is still order of bits nonetheless. When you look at a byte in almost any programming language, you typically see a number. Typically either: an unsigned number (0..255) or a two-complement-encoded number (-128..127). It's programmers choice how to look at a byte. char in C++ is likely to be signed char and use the latter. But UTF-8 does not work with "numbers". — yeputons
– yeputons, Commented Apr 28, 2023 at 23:39
Note that C++20 introduced char8_t (similar to, but distinct from, unsigned char) and std::u8string, specifically to deal with UTF-8, bringing it on par with C++11's char16_t/std::u16string for UTF-16 and char32_t/std::u32string for UTF-32. — Remy Lebeau
– Remy Lebeau, Commented Apr 28, 2023 at 23:40

Ilya Chalov · Accepted Answer · 2023-04-29 03:25:42Z

3

Thanks to the people for the comments, I think I understood what was going on here. Please correct me if I'm wrong.

As far as I understand, the C++ language standard allows the compiler to interpret char as either signed char or unsigned char.

The MSVC and g++ compilers interpret char as signed char by default. Thus, the char type in my program can represent values in the range -128..127. Consider the example of the Cyrillic small letter 'п': in the Unicode table it is U+043F; in UTF-8 encoding, this is 2 bytes d0 bf (hex) or 208 191 (dec).

Since the numbers 208 191 do not fit into the range -128..127, they are converted to -48 -65 (208 - 256, 191 - 256). This is how all characters are processed. It turns out that if the character code falls into the range 0..127, then it does not change (ASCII table).

This behavior of the MSVC and g++ compilers can be changed using special switches (options). There is a /J option for the MSVC compiler:

cl /EHsc /utf-8 /J "chars.cpp"

And for the g++ compiler there is an option -funsigned-char:

g++ /mnt/c/Users/Илья/source/repos/test/chars.cpp -o chars -funsigned-char

As a result, the same source code after compiling and running with new options will give a different result:

Hello, привет, 😎!

72 101 108 108 111 44 32 208 191 209 128 208 184 208 178 208 181 209 130 44 32 240 159 152 142 33

48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21

48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21

With the new options, compilers interpret char as unsigned char (range 0..255), so there are no negative numbers in the string representation.

answered Apr 29, 2023 at 3:25

Ilya Chalov

3114 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RandomBits Over a year ago

You got it exactly right.

BoP Over a year ago

To be pedantic, char, signed char, and unsigned char are three different types. So type char can behave like either signed or unsigned, but is not the same as any of the other two. On the odd chance that you write overloaded functions for the char types, you will notice the difference.

Collectives™ on Stack Overflow

Why in C++ are some characters in a multibyte UTF-8 string represented by negative numbers?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related