std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

Question

I'm happy to see the std::u16string and std::u32string in C++11, but I'm wondering why there is no std::u8string to handle the UTF-8 case. I'm under the impression that std::string is intended for UTF-8, but it doesn't seem to do it very well. What I mean is, doesn't std::string.length() still return the size of the string's buffer rather than the number of characters in the string?

So, how is the length() method of the standard strings defined for the new C++11 classes? Do they return the size of the string's buffer, the number of codepoints, or the number of characters (assuming a surrogate pair is 2 code points, but one character. Please correct me if I'm wrong)?

And what about size(); isn't it equal to length()? See http://en.cppreference.com/w/cpp/string/basic_string/length for the source of my confusion.

So, I guess, my fundamental question is how does one use std::string, std::u16string, and std::u32string and properly distinguish between buffer size, number of codepoints, and number of characters? If you use the standard iterators, are you iterating over bytes, codepoints, or characters?

std::string works as well for utf8 as u16string does for utf16: it handles elements of the corresponding type, and doesn't deal with characters that are represented by a sequence of more than one element. — Pete Becker
– Pete Becker, Commented Sep 3, 2012 at 16:30

Nicol Bolas · Accepted Answer · 2012-09-03 16:47:11Z

18

u16string and u32string are not "new C++11 classes". They're just typedefs of std::basic_string for char16_t and cha32_t types.

length is always equal to size for any basic_string. It is the number of T's in the string, where T is the template type for the basic_string.

basic_string is not Unicode aware in any way, shape, or form. It has no concept of codepoints, graphemes, Unicode characters, Unicode normalization, or anything of the kind. It is simply a ordered sequence of Ts. The only thing that is Unicode-aware about u16string and u32string is that they use the type returned by u"" and U"" literals. Thus, they can store Unicode-encoded strings, but they do nothing that requires knowledge of said encoding.

Iterators iterate over elements of T, not "bytes, codepoints, or characters". If T is char16_t, then it will iterate over char16_ts. If the string is UTF-16-encoded, then it is iterating over UTF-16 code units, not Unicode codepoints or bytes.

edited Sep 3, 2012 at 16:47

answered Sep 3, 2012 at 16:37

Nicol Bolas

481k66 gold badges861 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

eonil Over a year ago

And code unit != code point. They are two different concepts. Just for later reference because I didn't know that...

Pete Becker · Accepted Answer · 2012-09-03 16:29:09Z

1

All the string types do the same thing: they hold a sequence of elements, each of whose type is the character type for the string. length() and size() both return the number of elements. Iterators iterator over elements. Higher-level analysis, such as figuring out the number of characters, require much more complex calculations.

answered Sep 3, 2012 at 16:29

Pete Becker

77.2k8 gold badges82 silver badges171 bronze badges

Comments

eestrada · Accepted Answer · 2012-11-29 07:32:09Z

0

Currently there is nothing built into the standard to distinguish between code units, codepoints or individual bytes. However, there do seem to be some things in the works to deal with this sort of thing. Depending on what the standards committee decides, it may be part of TR2 or the next standard.

answered Nov 29, 2012 at 7:32

eestrada

1,60315 silver badges25 bronze badges

Collectives™ on Stack Overflow

std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related