Any good solutions for C++ string code point and code unit?

Question

In Java, a String has methods:

length()/charAt(), codePointCount()/codePointAt()

C++11 has std::string a = u8"很烫烫的一锅汤";

but a.size() is the length of char array, cannot index the unicode char.

Is there any solutions for unicode in C++ string ?

Have you checked this answer?: stackoverflow.com/a/31475700/58129 — Anthony Kong
– Anthony Kong, Commented Apr 9, 2017 at 2:06
I usually convert utf-8 to UTF-32/UCS-2 std::wstring so that each code point is one character. There is code to convert in this answer here: stackoverflow.com/questions/42791433/… else use a library — Galik
– Galik, Commented Apr 9, 2017 at 2:24
UTF-16 does not have room for all Chinese characters in a single 'character'. So a.size() will (I think) be incorrect. — Rick James
– Rick James, Commented Apr 11, 2017 at 5:22

Galik · Accepted Answer · 2018-12-03 13:37:14Z

10

I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here:

// This should convert to whatever the system wide character encoding 
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

int main()
{
    std::string s = u8"很烫烫的一锅汤";

    auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2)

    // now we can use code-point indexes on the wide string

    std::cout << s << " is " << w.size() << " characters long" << '\n';
}

Output:

很烫烫的一锅汤 is 7 characters long

If you want to convert to and from UTF-32 regardless of platform then you can use the following (not so well tested) conversion routines:

std::string utf32_to_utf8(std::u32string const& utf32)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::string utf8 = cnv.to_bytes(utf32);
    if(cnv.converted() < utf32.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::u32string utf8_to_utf32(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::u32string utf32 = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return utf32;
}

NOTE: As of C++17 std::wstring_convert is deprecated.

However I still prefer to use it over a third party library because it is portable, it avoids external dependencies, it won't be removed until a replacement is provided and in all cases it will be easy to replace the implementations of these functions without having to change all the code that uses them.

edited Dec 3, 2018 at 13:37

answered Apr 9, 2017 at 2:37

Galik

49k5 gold badges85 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

linrongbin Over a year ago

cool, but I have see some discussions, which says, in different platform, wchar_t can be uint16_t, not uint32_t. It can raise error when indexing char in unicode strings.

Galik Over a year ago

@zhaochenyou This should convert correctly for each platform. On Windows it will create 2-byte wchar_t characters encoded in UCS-2 and on Linux it will create 4-byte wchar_t characters encoded with UTF-32.

Miles Budnek Over a year ago

This will work well until someone goes and gives you a string with a '💩' character in it. Then you'll get different lengths on different platforms.

Galik Over a year ago

@MilesBudnek I have added code to convert to UTF-32 regardless of platform which, I assume, should fix any problems 2 char encoding may have (your character works fine on Linux I can't test on Windows unfortunately)

Miles Budnek Over a year ago

Yes, all currently existing Unicode code points will fit into a single UTF-32 unit.

|

Collectives™ on Stack Overflow

Any good solutions for C++ string code point and code unit?

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related