I'm happy to see the std::u16string and std::u32string in C++11, but I'm wondering why there is no std::u8string to handle the UTF-8 case. I'm under the impression that std::string is intended for UTF-8, but it doesn't seem to do it very well. What I mean is, doesn't std::string.length() still return the size of the string's buffer rather than the number of characters in the string?
So, how is the length() method of the standard strings defined for the new C++11 classes? Do they return the size of the string's buffer, the number of codepoints, or the number of characters (assuming a surrogate pair is 2 code points, but one character. Please correct me if I'm wrong)?
And what about size(); isn't it equal to length()?
See http://en.cppreference.com/w/cpp/string/basic_string/length for the source of my confusion.
So, I guess, my fundamental question is how does one use std::string, std::u16string, and std::u32string and properly distinguish between buffer size, number of codepoints, and number of characters? If you use the standard iterators, are you iterating over bytes, codepoints, or characters?
std::stringworks as well for utf8 asu16stringdoes for utf16: it handles elements of the corresponding type, and doesn't deal with characters that are represented by a sequence of more than one element.