3

I'm trying to convert the string "pokémon" from std::string to std::wstring using

std::wstring wsTmp(str.begin(), str.end());

This works on Windows, but on Linux it returns "pok\xffffffc3\xffffffa9mon"

How can I make it work on Linux?

2
  • 1
    C++ isn't great with diffrent character encodings. getting a dedicated library can be very helpful. Commented Jun 24, 2022 at 3:31
  • "This works on Windows" - no, it doesn't, actually. All that constructor does is copy each char as-is to wchar_t, extending the value from 8bits to 16bits on Windows or 32bits on Posix. There is no encoding conversion performed. What is the actual encoding of the std::string? ANSI (system locale)? UTF-8? It makes a BIG difference in how the data needs to be converted to std::wstring properly. Commented Jun 24, 2022 at 6:18

2 Answers 2

1

This worked for me on POSIX.

#include <codecvt>
#include <string>
#include <locale>

int main() {
    
    std::string a = "pokémon";
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> cv;
    std::wstring wide = cv.from_bytes(a);
    
    return 0;
}

The wstring holds the correct string at the end.

Important note by @NathanOliver: std::codecvt_utf8_utf16 was deprecated in C++17 and may be removed from the standard in a future version.

Sign up to request clarification or add additional context in comments.

3 Comments

Do note that std::codecvt_utf8_utf16 was deprecated in C++17 and may be removed from the standard in a future version.
I will add this to my answer. Thank you. OP did not specify a C++ version though...
This example will work correctly only if the .cpp file is saved as UTF-8, and the compiler parses the file as UTF-8. Consider using the u8 prefix on the string literal to force it to UTF-8, even if the file is not using UTF-8, eg: std::string a = u8"pokémon"; But whatever charset the .cpp is actually encoded in, make sure the compile is setup to interpret the file in that same charset.
0

The problem you seem to be running into here is that it's treating the é's two code units as separate code points when converting. There's no good way to do this with the standard library past C++17, as std::wstring_convert was deprecated without being given a proper replacement. You have several options, none of them great:

  1. Use the deprecated std::wstring_convert and ignore the deprecation warnings and the fact that it may be removed in a future revision of C++.
  2. Implement your own widening conversion routine (You could use icu4c's BreakIterator to assist with this).
  3. Use a heavier library like Boost.Locale to do all the heavy lifting for you.

Also somewhat unrelated, but if you care about consistency across different platforms you should be using std::u16string or std::u32string. std::wstring's character size depends on the size of wchar_t, which varies between different compilers and platforms.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.