Convert 16 bits in memory into std::string

Question

I'm getting 16 bits from a struct in memory, and I need to convert them into a string. The 16 bits represent a unicode char:

typedef struct my_struct {
    unsigned    unicode     : 16;
} my_struct;

I started by casting the bits into an unsigned char, which worked for values small enough to fit in one char. However, for characters like '♪', it truncates incorrectly. This is what I have so far:

        char buffer[2] = { 0 };
        wchar_t wc[1] = { 0 };

        wc[0] = page->text[index].unicode;
        std::cout << wc[0] << std::endl; //PRINT LINE 1
        int ret = wcstombs(buffer, wc, sizeof(buffer));
        if(ret < 0)
            printf("SOMETHING WENT WRONG \n");
        std::string my_string(buffer);
        printf("%s \n", my_string.c_str()); //PRINT LINE 2

Print line 1 currently prints: "9834" and print line 2 prints: "" (empty string). I'm trying to get my_string to contain '♪'.

You can't fit 16 bits into 8 bits without losing something. Your choices are to convert from (apparently) UTF-16 to UTF-8 (uses multiple 8-bit characters to hold one 16-bit code unit) or leave it in UTF-16 (e.g., std::wstring holds units of wchar_t, which may be UTF-16). If neither of those works, you could instantiate std::basic_string over your my_struct directly: std::basic_string<my_struct> whatever; — Jerry Coffin
– Jerry Coffin, Commented Jul 29, 2013 at 18:33
@Jerry Coffin: a bit pedantic, but std::*string doesn't store (or care about) character encoding. Even if wchar_t is 16-bit, it could be UCS-2. In general, you want either UCS-4 or UTF-8. UTF-16 combines disadvantages of both with no gain. — DanielKO
– DanielKO, Commented Jul 29, 2013 at 18:41
@DanielKO: I certainly wouldn't recommend UTF-16 as a general rule -- that's simply reflecting the OP's use of 16 bits. UCS-2 has been obsolete for a long time now. — Jerry Coffin
– Jerry Coffin, Commented Jul 29, 2013 at 18:46
@mirandak: Unless the library is really old (and hasn't been updated within the last decade or so) it's probably UTF-16 rather than UCS-2. — Jerry Coffin
– Jerry Coffin, Commented Jul 29, 2013 at 18:50

James Kanze · Accepted Answer · 2013-07-29 19:04:41Z

2

If I've done my conversion correctly, 0x9834 in UTF-16 (16 bit Unicode) translates to the three byte sequence 0xE9, 0xA0, 0xB4 in UTF-8 (8 bit Unicode). I don't know about other narrow byte encodings, but I doubt any would be shorter than 2 bytes. You pass a buffer of two bytes to wcstombs, which means a returned string of at most 1 bytes. wcstombs stops translating (without failing!) when there's no more room in the destination buffer. You've also failed to L'\0' terminate the input buffer. It's not a problem at the moment, because wcstombs will stop translating before it gets there, but you should normally add the extra L'\0'.

So what to do:

First, and formost, when debugging this sort of thing, look at the return value of wcstombs. I'll bet that it's 0, because of the lack of space.

Second, I'd give myself a little bit of margin. Legal Unicode can result in up to four bytes in UTF-8, so I'd allocate at least 5 bytes for the output (don't forget the trailing '\0'). Along the same lines, you need a trailing L'\0' for the input. So:

char buffer[ 5 ];
wchar_t wc[] = { page->text[index].unicode, L'\0' };
int ret = wcstombs( buffer, wc, sizeof( buffer ) );
if ( ret < 1 ) {    //  And *not* 0
    std::cerr << "OOPS\n";
}
std::string str( buffer, buffer + ret );
std::cout << str << '\n';

Of course, after all that, there is still the question of what the (final) display device does with UTF-8 (or whatever the multi-byte narrow character encoding is---UTF-8 is almost universal under Unix, but I'm not sure about Windows.) But since you say that displaying "\u9834" seems to work, it should be alright.

answered Jul 29, 2013 at 19:04

James Kanze

155k20 gold badges191 silver badges338 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Mooing Duck Over a year ago

The Windows console can display UTf-8 in theory, but getting it to actually do so is tricky.

mirandak Over a year ago

I know you can't peer into my computer, but with this code wcstombs is returns -1 once a char with value > 127 comes up. edit: err not a char but you know what i mean

mirandak Over a year ago

Think it was a locale issue, because I slapped "setlocale(LC_ALL, "");" in there and it suddenly worked! Now to figure out what locale I actually need... But thanks!!!

Adrian McCarthy Over a year ago

The 9834 value from the question appears to be decimal. The music note shown is U+266A (which happens to be hexadecimal for 9834).

James Kanze Over a year ago

@mirandak Yes. wcstombs is locale sensitive, and will probably not translate characters greater than 127 in the default "C" locale. I should have mentioned that. (But the fact that you didn't mention getting an error from it, and that you could display "\9834" led me to believe that you had these aspects sorted out.)

|

Community · Accepted Answer · 2017-05-23 11:57:46Z

1

Please read a bit about what "character encoding" means, like this: What is character encoding and why should I bother with it

Then figure out what encoding you are getting in, and what encoding you need to use on the output. That means figuring out what your file format / GUI library / console is expecting.

Then use something reliable like libiconv to convert between them, instead of the so-implementation-defined-that-is-almost-useless wcstombs()+wchar_t.

For example, you might find that your input is UCS-2, and you need to output it into UTF-8. My system has 32-bit wchar_t, I wouldn't count on it converting from UCS-2 to UTF-8.

edited May 23, 2017 at 11:57

CommunityBot

11 silver badge

answered Jul 29, 2013 at 18:49

DanielKO

4,54321 silver badges31 bronze badges

Comments

ecatmur · Accepted Answer · 2013-07-29 18:53:47Z

1

To convert from UTF-16 to UTF-8, use codecvt_utf8<char16_t>:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

int main() {
    char16_t wstr16[2] = {0x266A, 0};
    auto conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{};
    auto u8str = std::string{conv.to_bytes(wstr16)};
    std::cout << u8str << '\n';
}

answered Jul 29, 2013 at 18:53

ecatmur

158k28 gold badges311 silver badges387 bronze badges

5 Comments

James Kanze Over a year ago

What's the point of auto u8str = std::string{ conv.to_bytes( wstr16 ) };, rather than std::string u8str( conv.to_bytes( wstr16 ) );, except maybe obfuscation?

ecatmur Over a year ago

@JamesKanze it's AAA style: herbsutter.com/2013/06/13/…

ecatmur Over a year ago

@mirandak you're using embedded Unicode codepoints in your comment std::string s("\u266A"); , which are a C++11 feature.

James Kanze Over a year ago

@ecatmur Another anti-pattern. If you don't want to name the types, use Python. But except in a few particular cases, you do want to name the type, so that the reader has some idea of what is going on. AAA is just bad engineering.

James Kanze Over a year ago

@ecatmur They're in my copy of the C++98 standard (and in C90 as well).

Collectives™ on Stack Overflow

Convert 16 bits in memory into std::string

3 Answers 3

7 Comments

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related