The following C++ source code I compile and run in 'Windows 10' and 'Ubuntu' (via 'WSL 2'):
#include <cstring>
#include <iostream>
int main()
{
char str[] = "Hello, привет, 😎!";
std::cout << str << "\n\n";
for (int i = 0; i < std::strlen(str); i++) {
std::cout << (int) str[i] << ' ';
} std::cout << "\n\n";
for (int i = 0; i < std::strlen(str); i++) {
std::cout << std::hex << (int) str[i] << ' ';
} std::cout << "\n\n";
for (int i = 0; i < std::strlen(str); i++) {
std::cout << std::hex << (str[i] & 0xff) << ' ';
} std::cout << '\n';
return 0;
}
I save this source code in a file chars.cpp in UTF-8 encoding without BOM. In Windows 10, I use the MSVC compiler (cl.exe) from 'Microsoft C++ Build Tools' from the command line cl /EHsc /utf-8 "chars.cpp". In 'Ubuntu' (via 'WSL 2') I am using the g++ compiler from the "GCC" set from the command line g++ /mnt/c/Users/Илья/source/repos/test/chars.cpp -o chars.
I got the following result (in 'Windows 10', you need to configure the code page in the console cmd.exe using the chcp 65001 command, in 'Ubuntu' (via 'WSL 2') this is not necessary):
Hello, привет, 😎!
72 101 108 108 111 44 32 -48 -65 -47 -128 -48 -72 -48 -78 -48 -75 -47 -126 44 32 -16 -97 -104 -114 33
48 65 6c 6c 6f 2c 20 ffffffd0 ffffffbf ffffffd1 ffffff80 ffffffd0 ffffffb8 ffffffd0 ffffffb2 ffffffd0 ffffffb5 ffffffd1 ffffff82 2c 20 fffffff0 ffffff9f ffffff98 ffffff8e 21
48 65 6c 6c 6f 2c 20 d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 2c 20 f0 9f 98 8e 21
I'm curious why negative numbers are used to represent some characters. I tried to find an explanation in the cppreference.com and read two articles there:
https://en.cppreference.com/w/cpp/language/types, quote:
char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type). Multibyte characters strings use this type to represent code units. For every value of type unsigned char in range [0, 255], converting the value to char and then back to unsigned char produces the original value. (since C++11) The signedness of char depends on the compiler and the target platform: the defaults for ARM and PowerPC are typically unsigned, the defaults for x86 and x64 are typically signed.
and
https://en.cppreference.com/w/cpp/string/multibyte
But I didn't find a direct explanation there.
My questions. For what purpose are some characters represented by negative numbers? Is it in the standard or is it system-specific?
charissigned charin your compiler.charcan be either signed or unsigned. And that it is up to the implementation (i.e. the compiler) to decide. As mentioned, your compiler have decided that on your systemcharis signed. This is the reason whychar,signed charandunsigned charare considered three distinct types, not two as in the case for any other integer type.charin C++ is likely to besigned charand use the latter. But UTF-8 does not work with "numbers".char8_t(similar to, but distinct from,unsigned char) andstd::u8string, specifically to deal with UTF-8, bringing it on par with C++11'schar16_t/std::u16stringfor UTF-16 andchar32_t/std::u32stringfor UTF-32.