3

I have a question about how string literals are stored in memory for c++. I know that a char is stored according to their ascii code, but I am rather after the unicode character set. The reason for this is that I try to deal with some locales. Let us assume that what I am trying to do is to convert lower case characters to upper case. This works in Xcode terminal,

#include <iostream>
#include <string>
#include <cctype>
#include <clocale>

using namespace std;

int main()
{
wcout.imbue(std::locale("sv_SE.Utf-8"));
const std::ctype<wchar_t>& f = std::use_facet< std::ctype<wchar_t> >(std::locale("sv_SE.Utf-8"));

wstring str {L"åäö"}; // Swedish letters

f.toupper(&str[0], &str[0] + str.size());

std::wcout << str.length() << std::endl;
std::wcout << str << std::endl;
}

Output:
3
ÅÄÖ

However, when I try to run it in OS X terminal I get rubbish,

Output:
3
ÅÄÖ

Further when I prompt the user for input instead,

#include <iostream>
#include <string>
#include <cctype>
#include <clocale>

using namespace std;

int main()
{
wcin.imbue(std::locale(""));
wcout.imbue(std::locale("sv_SE.Utf-8"));
const std::ctype<wchar_t>& f = std::use_facet< std::ctype<wchar_t> >(std::locale("sv_SE.Utf-8"));

//wstring str {L"åäö"};
wcout << "Write something>> ";
wstring str;
getline(wcin, str);

f.toupper(&str[0], &str[0] + str.size());

std::wcout << str.length() << std::endl;
std::wcout << str << std::endl;
}

I get rubbish from Xcode terminal,

Output:
Write something>> åäö
6
åäö

And the OS X termial actually hangs when I use these letters. It is possible to modify the wcin stream to assume C encoding wcin.imbue(std::locale());, which still give the same output in Xcode, but gives following in OS X terminal:

Output:
Write something>> åäö
3
ŒŠš

So the problem is quite clearly related to encodings. So what I wonder how the string literals are actually stored in memory in c++. This can be split into 2 different cases.

Case 1: A string literal typed in source code, eg wstring str {L"åäö"};.

Case 2: A string entered via standard input stream (wcin in this case).

These two cases does not necessarily store the strings in the same way. I know that unicode is a character set and that utf-8 is an encoding, so what I wonder is more if the string literals are encoded when stored in memory and in that case how.

Further, if anyone know how to identify the encoding used in the current terminal in an automatic way it would be great.

BR Patrik

EDIT

I get some comment which, even though some of them are good, are not exactly related to the question. This means that the question probably needs some clarification. The question can be seen as a generalization of the fairly ill formulated question:

"Can I assume that string literals are stored with their unicode pointcode in memory?"

This question is badly formulated for at least two reasons. First it make an assumption about how the string literals are stored (with their unicode codepoint). This means that the answer must relate to unicode, even though this relation may be completely pointless. Further this question is a yes or no type of question, which will give no help in case the answer is no.

I also understand that this can be tested converting the codepoint to its integer equivalent and print it, but this would require that I test it towards the entire unicode character set (which seems to be an unreasonable way of doing this).

9
  • 2
    It you use utf8, you should use string, cout etc rather than w- equivalents. Commented Oct 16, 2015 at 6:22
  • @el.pescado That is what I have read. The problem is that the letters åäö does not fit to a single char. This gives me the incorrect length of the string. Do you mean that I should split these problems into two and handle them separately? Further what is the reason why this is appropriate? Commented Oct 16, 2015 at 6:25
  • 2
    "The problem is that the letters åäö does not fit to a single char" - that's the point of utf8 encoding - to fit those letters into multiple chars. It's best to treat length() as "number of bytes", as it is broken anyway. See utf8everywhere.org and programmers.stackexchange.com/questions/102205/… Commented Oct 16, 2015 at 7:03
  • @el.pescado I understand, but that still does not solve the most immediate problem and now toupper conversion does not work at all. Do you have any input? And how about the other problem, how string literals are stored? Commented Oct 16, 2015 at 10:26
  • 1
    Basically, you don't want non-ASCII characters in your source code - it is difficult to predict how they'll end up in the executable binary. It depends on a) what encoding your text editor saves the source file with, b) what encoding your compiler believes the source file is in, and c) what encoding the compiler believes the executable should use. Specify explicit codepoints, via \xHH or \uHHHH notation, or put such strings into some kind of a resource file, to be loaded at run-time (the latter would also help with localization). Commented Oct 16, 2015 at 20:22

2 Answers 2

2

First the way the file is interpreted as a sequence of characters is implementation defined. You have to consult your compiler documentation for determining this.

Second the character set that is used is also implementation defined. So again you have to consult your compiler for this.

What's likely to happen when you insert non-ascii characters (possibly when using ascii too) is that the compiler would interpret them differently. You have to check that the different compilers actually can handle the same encoding, the most likely source encoding to work portably would be UTF-8.

In addition maybe you would be better of using UTF-8-encoded text for the most of the program (only near API that requires wchar_t would need to handle the strings this way).

Bottom line. Make sure your compiler stores the string literal verbatim and use ordinary (narrow) strings, and use an editor that saves in UTF-8 encoding.

Sign up to request clarification or add additional context in comments.

Comments

0

There's good background covered on this subject in the string_literal page seen here

https://en.cppreference.com/w/cpp/language/string_literal

I landed on this question not for the matter of bytes and encoding storage but about where in memory they live, which is in the static memory of the app:

String literals have static storage duration, and thus exist in memory for the life of the program.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.