Convert std::string to Unicode in Linux

Question

EDIT I modified the question after realizing it was wrong to begin with.

I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string:

string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);

So that the bytes array is now:

"65 00 66 00 67 00" (bytes)

How can I achieve the same in C++ on Linux? I have a myString defined as std::string, and it seems that std::wstring on Linux is 4 bytes?

Your conversion is to UTF-16, your example suggests you want UTF-16LE but what are you converting from? (US-ASCII, UTF-8, ISO-8859-*) and have you chosen a unicode library to use? — Lara Bailey
– Lara Bailey, Commented May 15, 2011 at 10:59
I'm converting from the standard (UTF-16?) .NET string. Also (and sorry, I lack knowledge in this area), I don't use any other libraries other than the standard ones, which one should I use? :) — Igal Tabachnik
– Igal Tabachnik, Commented May 15, 2011 at 11:06
I "kinda" hacked it, by declaring an array twice the size of my string, and just setting every [i * 2] character, but that's the LAST thing I want :) — Igal Tabachnik
– Igal Tabachnik, Commented May 15, 2011 at 11:07
@hmemcpy: In that case your question is misleading because the string isn't "65 66 67" (bytes) to begin with. I think you need to supply more C++ context for your question seeing as you are looking for a C++ answer. If the string is already in the encoding you want you just need to read the bytes of the string one at a time. — Lara Bailey
– Lara Bailey, Commented May 15, 2011 at 11:08
@Charles You're right, of course! After re-reading the question, I realize now it's not correct - the .NET string is already UTF-16, meaning 2 bytes for every character. All I need is, actually, the implementation of GetBytes... I'll change my question — Igal Tabachnik
– Igal Tabachnik, Commented May 15, 2011 at 11:23

AProgrammer · Accepted Answer · 2011-05-15 16:20:14Z

4

You question isn't really clear, but I'll try to clear up some confusion.

Introduction

Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard.

the character set used is given by the current locale
wchar_t is meant to store code point
char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte)
string literals are encoded in an implementation defined manner. If they use characters outside of the basic character set, you can't assume they are valid in all locale.

Thus with a 16 bits wchar_t you are restricted to the BMP. Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. Those who delayed their Unicode support tend to use a 32 bits wchar_t.

Newer standards don't change much. Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. There is little or no additional support for Unicode in the standard libraries.

How to do the transformation of one encoding to the other

You have to be in a locale which use Unicode. Hopefully

std::locale::global(locale(""));

will be enough for that. If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.).

C Style

Use the wcstomsb and mbstowcs functions. Here is an example for what you asked.

std::string narrow(std::wstring const& s)
{
    std::vector<char> result(4*s.size() + 1);
    size_t used = wcstomsb(&result[0], s.data(), result.size());
    assert(used < result.size());
    return result.data();
}

C++ Style

The codecvt facet of the locale provide the needed functionality. The advantage is that you don't have to change the global locale for using it. The inconvenient is that the usage is more complex.

#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>

std::string narrow(std::wstring const& s,
                   std::locale loc = std::locale())
{
    std::vector<char> result(4*s.size() + 1);
    wchar_t const* fromNext;
    char* toNext;
    mbstate_t state = {0};
    std::codecvt_base::result convResult
        = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
        .out(state,&s[0], &s[s.size()], fromNext,
             &result[0], &result[result.size()], toNext);

    assert(fromNext == &s[s.size()]);
    assert(toNext != &result[result.size()]);
    assert(convResult == std::codecvt_base::ok);
    *toNext = '\0';

    return &result[0];
}

std::wstring widen(std::string const& s,
                   std::locale loc = std::locale())
{
    std::vector<wchar_t> result(s.size() + 1);
    char const* fromNext;
    wchar_t* toNext;
    mbstate_t state = {0};
    std::codecvt_base::result convResult
        = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
        .in(state, &s[0], &s[s.size()], fromNext,
            &result[0], &result[result.size()], toNext);

    assert(fromNext == &s[s.size()]);
    assert(toNext != &result[result.size()]);
    assert(convResult == std::codecvt_base::ok);
    *toNext = L'\0';

    return &result[0];
}

you should replace the assertions by better handling.

BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion).

edited May 15, 2011 at 16:20

answered May 15, 2011 at 12:46

AProgrammer

52.6k8 gold badges96 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

André Caron Over a year ago

Your C-style narrow() function asserts that the number of characters was "sufficient", but it doesn't truncate the string.

André Caron Over a year ago

Also, in some normalization forms, a single character can consume up to 6 bytes.

davka Over a year ago

typo: wchat_t; tried to save you the trouble by editing it myself but they don't allow 1-letter edits :)

Tyler Liu Over a year ago

If the encoding of the input string is known, then technically speaking, I do NOT "have to be in a locale". Do I?

Nemanja Trifunovic · Accepted Answer · 2015-09-15 22:22:00Z

4

The easiest way is to grab a small library, such as UTF8 CPP and do something like:

utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));

edited Sep 15, 2015 at 22:22

answered May 15, 2011 at 13:59

Nemanja Trifunovic

24.6k4 gold badges53 silver badges89 bronze badges

Comments

StackedCrooked · Accepted Answer · 2011-05-15 12:29:17Z

2

I usually use the UnicodeConverter class from the Poco C++ libraries. If you don't want the dependency then you can have a look at the code.

answered May 15, 2011 at 12:29

StackedCrooked

35.7k46 gold badges164 silver badges290 bronze badges

Collectives™ on Stack Overflow

Convert std::string to Unicode in Linux

3 Answers 3

Introduction

How to do the transformation of one encoding to the other

C Style

C++ Style

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Introduction

How to do the transformation of one encoding to the other

C Style

C++ Style

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related