1

Here's an Ideone: http://ideone.com/vjByty.

#include <iostream>
using namespace std;
#include <string>

int main() {
    string s = "\u0001\u0001";
    cout << s.length() << endl;
    if (s[0] == s[1]) {
        cout << "equal\n";
    }
    return 0;
}

I'm confused on so many levels.

What does it mean when I type in an escaped Unicode string literal in my C++ program?

Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)

Why are the first two characters of s (first two bytes) equal?

8
  • 1
    It is probably compiler and operating system specific. And also depending on the version of the C++ standard. BTW, your assumption "utf-16" is often false. Commented Feb 5, 2015 at 18:09
  • Could it be using a utf-8? Commented Feb 5, 2015 at 18:13
  • @BasileStarynkevitch not often false, always false. Unless you use the leading L on the string literal, then I suppose it's often. But that's not what we have here. Commented Feb 5, 2015 at 18:25
  • 2
    @MarkRansom Not necessarily always false. A platform could have 16 bit char, with the UTF-16 as the basic execution character set. (I don't know of any that do, but the standard definitely allows it.) Commented Feb 5, 2015 at 19:08
  • 1
    @JamesKanze yes, the standard is flexible enough to allow it. But without a concrete example, I'm sticking by my statement. Commented Feb 5, 2015 at 19:32

2 Answers 2

3

So the draft C++11 standard says the following about universal characters in narrow string literals (emphasis mine going forward):

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (2.14.3), except that the single quote [...] In a narrow string literal, a universal-charactername may map to more than one char element due to multibyte encoding

and includes the following note:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’\0’.

Section 2.14.3 referred to above says:

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding.

if I try this example (see it live):

string s = "\u0F01\u0001";

The first universal character does map to multiple characters.

Sign up to request clarification or add additional context in comments.

Comments

1

What does it mean when I type in an escaped Unicode string literal in my C++ program?

To quote the standard:

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

Typically, the execution character set will be ASCII, which contains a character with value 1. So \u0001 will be translated into a single character with value 1.

If you were to specify non-ASCII characters, like \u263A, you might see more than one byte per character.

Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)

If it were UTF-16, yes. But string can't be encoded with UTF-16, unless char has 16 bits, which it usually doesn't. UTF-8 is a more likely encoding, in which characters with values up to 127 (that is, the whole ASCII set) are encoded with a single byte.

Why are the first two characters of s (first two bytes) equal?

With the above assumptions, they are both the character with value 1.

8 Comments

I don't know that UTF-8 is more likely. But even with other encodings (like ISO 8859-1), '\u0001' will translate into a single byte.
@JamesKanze: Indeed, if the execution character set includes 1 (which it will if it's ASCII or a superset), then \u0001 must translate to a single byte, as the answer says. Perhaps "more likely" was a poor choice of words, since apparently there are some quaint systems that still use 8-bit encodings. But I've no idea what they might do with arbitrary Unicode points, and don't really want to know.
I'm sorry I made a mistake in the Ideone. Here's one with string :\u0001\u0000. ideone.com/N6KkHt Both the characters are still treated as identical. They should both be treated as ascii and that means they should be treated as 1 and 0 respectively.
@batman: That's a GCC bug, \u0000 is incorrectly translated to 1. gcc.gnu.org/bugzilla/show_bug.cgi?id=53690
@MikeSeymour what a coincidence :P
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.