2

I'm getting 16 bits from a struct in memory, and I need to convert them into a string. The 16 bits represent a unicode char:

typedef struct my_struct {
    unsigned    unicode     : 16;
} my_struct;

I started by casting the bits into an unsigned char, which worked for values small enough to fit in one char. However, for characters like '♪', it truncates incorrectly. This is what I have so far:

        char buffer[2] = { 0 };
        wchar_t wc[1] = { 0 };

        wc[0] = page->text[index].unicode;
        std::cout << wc[0] << std::endl; //PRINT LINE 1
        int ret = wcstombs(buffer, wc, sizeof(buffer));
        if(ret < 0)
            printf("SOMETHING WENT WRONG \n");
        std::string my_string(buffer);
        printf("%s \n", my_string.c_str()); //PRINT LINE 2

Print line 1 currently prints: "9834" and print line 2 prints: "" (empty string). I'm trying to get my_string to contain '♪'.

18
  • 3
    You can't fit 16 bits into 8 bits without losing something. Your choices are to convert from (apparently) UTF-16 to UTF-8 (uses multiple 8-bit characters to hold one 16-bit code unit) or leave it in UTF-16 (e.g., std::wstring holds units of wchar_t, which may be UTF-16). If neither of those works, you could instantiate std::basic_string over your my_struct directly: std::basic_string<my_struct> whatever; Commented Jul 29, 2013 at 18:33
  • 1
    You can't put 16 pounds of flour in a 8 pound sack. Commented Jul 29, 2013 at 18:35
  • 1
    @Jerry Coffin: a bit pedantic, but std::*string doesn't store (or care about) character encoding. Even if wchar_t is 16-bit, it could be UCS-2. In general, you want either UCS-4 or UTF-8. UTF-16 combines disadvantages of both with no gain. Commented Jul 29, 2013 at 18:41
  • 1
    @DanielKO: I certainly wouldn't recommend UTF-16 as a general rule -- that's simply reflecting the OP's use of 16 bits. UCS-2 has been obsolete for a long time now. Commented Jul 29, 2013 at 18:46
  • 2
    @mirandak: Unless the library is really old (and hasn't been updated within the last decade or so) it's probably UTF-16 rather than UCS-2. Commented Jul 29, 2013 at 18:50

3 Answers 3

2

If I've done my conversion correctly, 0x9834 in UTF-16 (16 bit Unicode) translates to the three byte sequence 0xE9, 0xA0, 0xB4 in UTF-8 (8 bit Unicode). I don't know about other narrow byte encodings, but I doubt any would be shorter than 2 bytes. You pass a buffer of two bytes to wcstombs, which means a returned string of at most 1 bytes. wcstombs stops translating (without failing!) when there's no more room in the destination buffer. You've also failed to L'\0' terminate the input buffer. It's not a problem at the moment, because wcstombs will stop translating before it gets there, but you should normally add the extra L'\0'.

So what to do:

First, and formost, when debugging this sort of thing, look at the return value of wcstombs. I'll bet that it's 0, because of the lack of space.

Second, I'd give myself a little bit of margin. Legal Unicode can result in up to four bytes in UTF-8, so I'd allocate at least 5 bytes for the output (don't forget the trailing '\0'). Along the same lines, you need a trailing L'\0' for the input. So:

char buffer[ 5 ];
wchar_t wc[] = { page->text[index].unicode, L'\0' };
int ret = wcstombs( buffer, wc, sizeof( buffer ) );
if ( ret < 1 ) {    //  And *not* 0
    std::cerr << "OOPS\n";
}
std::string str( buffer, buffer + ret );
std::cout << str << '\n';

Of course, after all that, there is still the question of what the (final) display device does with UTF-8 (or whatever the multi-byte narrow character encoding is---UTF-8 is almost universal under Unix, but I'm not sure about Windows.) But since you say that displaying "\u9834" seems to work, it should be alright.

Sign up to request clarification or add additional context in comments.

7 Comments

The Windows console can display UTf-8 in theory, but getting it to actually do so is tricky.
I know you can't peer into my computer, but with this code wcstombs is returns -1 once a char with value > 127 comes up. edit: err not a char but you know what i mean
Think it was a locale issue, because I slapped "setlocale(LC_ALL, "");" in there and it suddenly worked! Now to figure out what locale I actually need... But thanks!!!
The 9834 value from the question appears to be decimal. The music note shown is U+266A (which happens to be hexadecimal for 9834).
@mirandak Yes. wcstombs is locale sensitive, and will probably not translate characters greater than 127 in the default "C" locale. I should have mentioned that. (But the fact that you didn't mention getting an error from it, and that you could display "\9834" led me to believe that you had these aspects sorted out.)
|
1

Please read a bit about what "character encoding" means, like this: What is character encoding and why should I bother with it

Then figure out what encoding you are getting in, and what encoding you need to use on the output. That means figuring out what your file format / GUI library / console is expecting.

Then use something reliable like libiconv to convert between them, instead of the so-implementation-defined-that-is-almost-useless wcstombs()+wchar_t.

For example, you might find that your input is UCS-2, and you need to output it into UTF-8. My system has 32-bit wchar_t, I wouldn't count on it converting from UCS-2 to UTF-8.

Comments

1

To convert from UTF-16 to UTF-8, use codecvt_utf8<char16_t>:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

int main() {
    char16_t wstr16[2] = {0x266A, 0};
    auto conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{};
    auto u8str = std::string{conv.to_bytes(wstr16)};
    std::cout << u8str << '\n';
}

5 Comments

What's the point of auto u8str = std::string{ conv.to_bytes( wstr16 ) };, rather than std::string u8str( conv.to_bytes( wstr16 ) );, except maybe obfuscation?
@JamesKanze it's AAA style: herbsutter.com/2013/06/13/…
@mirandak you're using embedded Unicode codepoints in your comment std::string s("\u266A"); , which are a C++11 feature.
@ecatmur Another anti-pattern. If you don't want to name the types, use Python. But except in a few particular cases, you do want to name the type, so that the reader has some idea of what is going on. AAA is just bad engineering.
@ecatmur They're in my copy of the C++98 standard (and in C90 as well).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.