I have a question about how string literals are stored in memory for c++. I know that a char is stored according to their ascii code, but I am rather after the unicode character set. The reason for this is that I try to deal with some locales. Let us assume that what I am trying to do is to convert lower case characters to upper case. This works in Xcode terminal,
#include <iostream>
#include <string>
#include <cctype>
#include <clocale>
using namespace std;
int main()
{
wcout.imbue(std::locale("sv_SE.Utf-8"));
const std::ctype<wchar_t>& f = std::use_facet< std::ctype<wchar_t> >(std::locale("sv_SE.Utf-8"));
wstring str {L"åäö"}; // Swedish letters
f.toupper(&str[0], &str[0] + str.size());
std::wcout << str.length() << std::endl;
std::wcout << str << std::endl;
}
Output:
3
ÅÄÖ
However, when I try to run it in OS X terminal I get rubbish,
Output:
3
ÅÄÖ
Further when I prompt the user for input instead,
#include <iostream>
#include <string>
#include <cctype>
#include <clocale>
using namespace std;
int main()
{
wcin.imbue(std::locale(""));
wcout.imbue(std::locale("sv_SE.Utf-8"));
const std::ctype<wchar_t>& f = std::use_facet< std::ctype<wchar_t> >(std::locale("sv_SE.Utf-8"));
//wstring str {L"åäö"};
wcout << "Write something>> ";
wstring str;
getline(wcin, str);
f.toupper(&str[0], &str[0] + str.size());
std::wcout << str.length() << std::endl;
std::wcout << str << std::endl;
}
I get rubbish from Xcode terminal,
Output:
Write something>> åäö
6
åäö
And the OS X termial actually hangs when I use these letters. It is possible to modify the wcin stream to assume C encoding wcin.imbue(std::locale());, which still give the same output in Xcode, but gives following in OS X terminal:
Output:
Write something>> åäö
3
ŒŠš
So the problem is quite clearly related to encodings. So what I wonder how the string literals are actually stored in memory in c++. This can be split into 2 different cases.
Case 1: A string literal typed in source code, eg wstring str {L"åäö"};.
Case 2: A string entered via standard input stream (wcin in this case).
These two cases does not necessarily store the strings in the same way. I know that unicode is a character set and that utf-8 is an encoding, so what I wonder is more if the string literals are encoded when stored in memory and in that case how.
Further, if anyone know how to identify the encoding used in the current terminal in an automatic way it would be great.
BR Patrik
EDIT
I get some comment which, even though some of them are good, are not exactly related to the question. This means that the question probably needs some clarification. The question can be seen as a generalization of the fairly ill formulated question:
"Can I assume that string literals are stored with their unicode pointcode in memory?"
This question is badly formulated for at least two reasons. First it make an assumption about how the string literals are stored (with their unicode codepoint). This means that the answer must relate to unicode, even though this relation may be completely pointless. Further this question is a yes or no type of question, which will give no help in case the answer is no.
I also understand that this can be tested converting the codepoint to its integer equivalent and print it, but this would require that I test it towards the entire unicode character set (which seems to be an unreasonable way of doing this).
utf8, you should usestring,coutetc rather thanw-equivalents.åäödoes not fit to a single char. This gives me the incorrect length of the string. Do you mean that I should split these problems into two and handle them separately? Further what is the reason why this is appropriate?åäödoes not fit to a single char" - that's the point of utf8 encoding - to fit those letters into multiple chars. It's best to treatlength()as "number of bytes", as it is broken anyway. See utf8everywhere.org and programmers.stackexchange.com/questions/102205/…\xHHor\uHHHHnotation, or put such strings into some kind of a resource file, to be loaded at run-time (the latter would also help with localization).