2

In Java, a String has methods:

length()/charAt(), codePointCount()/codePointAt()

C++11 has std::string a = u8"很烫烫的一锅汤";

but a.size() is the length of char array, cannot index the unicode char.

Is there any solutions for unicode in C++ string ?

5
  • 1
    Have you checked this answer?: stackoverflow.com/a/31475700/58129 Commented Apr 9, 2017 at 2:06
  • I usually convert utf-8 to UTF-32/UCS-2 std::wstring so that each code point is one character. There is code to convert in this answer here: stackoverflow.com/questions/42791433/… else use a library Commented Apr 9, 2017 at 2:24
  • 1
    UCS-2 does not have room for all Chinese characters. Commented Apr 9, 2017 at 3:52
  • @RickJames: Galik likely meant UTF-16 instead Commented Apr 10, 2017 at 20:42
  • 1
    UTF-16 does not have room for all Chinese characters in a single 'character'. So a.size() will (I think) be incorrect. Commented Apr 11, 2017 at 5:22

1 Answer 1

10

I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here:

// This should convert to whatever the system wide character encoding 
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

int main()
{
    std::string s = u8"很烫烫的一锅汤";

    auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2)

    // now we can use code-point indexes on the wide string

    std::cout << s << " is " << w.size() << " characters long" << '\n';
}

Output:

很烫烫的一锅汤 is 7 characters long

If you want to convert to and from UTF-32 regardless of platform then you can use the following (not so well tested) conversion routines:

std::string utf32_to_utf8(std::u32string const& utf32)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::string utf8 = cnv.to_bytes(utf32);
    if(cnv.converted() < utf32.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::u32string utf8_to_utf32(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::u32string utf32 = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return utf32;
}

NOTE: As of C++17 std::wstring_convert is deprecated.

However I still prefer to use it over a third party library because it is portable, it avoids external dependencies, it won't be removed until a replacement is provided and in all cases it will be easy to replace the implementations of these functions without having to change all the code that uses them.

Sign up to request clarification or add additional context in comments.

10 Comments

cool, but I have see some discussions, which says, in different platform, wchar_t can be uint16_t, not uint32_t. It can raise error when indexing char in unicode strings.
@zhaochenyou This should convert correctly for each platform. On Windows it will create 2-byte wchar_t characters encoded in UCS-2 and on Linux it will create 4-byte wchar_t characters encoded with UTF-32.
This will work well until someone goes and gives you a string with a '💩' character in it. Then you'll get different lengths on different platforms.
@MilesBudnek I have added code to convert to UTF-32 regardless of platform which, I assume, should fix any problems 2 char encoding may have (your character works fine on Linux I can't test on Windows unfortunately)
Yes, all currently existing Unicode code points will fit into a single UTF-32 unit.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.