How do I convert a string in UTF-16 to UTF-8 in C++

Question

Consider:

STDMETHODIMP CFileSystemAPI::setRRConfig( BSTR config_str, VARIANT* ret )
{
mReportReaderFactory.reset( new sbis::report_reader::ReportReaderFactory() );

USES_CONVERSION;
std::string configuration_str = W2A( config_str );

But in config_str I get a string in UTF-16. How can I convert it to UTF-8 in this piece of code?

Arty · Accepted Answer · 2021-01-06 19:26:54Z

I implemented two variants of conversion between UTF-8<->UTF-16<->UTF-32, first variant fully implements all conversions from scratch, second uses standard std::codecvt and std::wstring_convert (these two classes are deprecated starting from C++17, but still exist, also guaranteed to exist in C++11/C++14).

If you don't like my code then you may use almost-single-header C++ library utfcpp, which should be very well tested by many customers.

To convert UTF-8 to UTF-16 just call Utf32To16(Utf8To32(str)) and to convert UTF-16 to UTF-8 call Utf32To8(Utf16To32(str)). Or you may just use my handy helper function UtfConv<std::wstring>(std::string("abc")) for UTF-8 to UTF-16 or UtfConv<std::string>(std::wstring(L"abc")) for UTF-16 to UTF-8, UtfConv actually can convert from any to any Utf-encoded string. See examples of these and other usages inside Test(cs) macro.

Both variants are C++11 compliant. Also they compile in CLang/GCC/MSVC compilers (see "Try it online!" links down below) and tested to work in Windows/Linux OSes.

You have to save both of my code snippets in file with UTF-8 encoding and provide options -finput-charset=UTF-8 -fexec-charset=UTF-8 to CLang/GCC, and options /utf-8 to MSVC. This utf-8 saving and options are needed only if you put literal strings with non-ascii characters, like I did in my code for testing only purposes. To use functions themselves you don't need this utf-8 saving and options.

Inclusions of <windows.h> and <clocale> and <iostream>, also call to SetConsoleOutputCP(65001) and std::setlocale(LC_ALL, "en_US.UTF-8") are needed only for testing purposes to setup and output correctly to UTF-8 console. These things are not needed for conversion functions.

Part of code is not very necessary, I mean UtfHelper-related structure and functions, they are just helper functions for conversion, mainly created to handle in cross-platform way std::wstring, because wchar_t is usually 32-bit on Linux and 16-bit on Windows. Only low-level functions Utf8To32, Utf32To8, Utf16To32, Utf32To16 are the only things that are really needed for conversion.

Variant 1 was created out of Wikipedia description of UTF-8 and UTF-16 encodings.

If you find bugs or any improvements (especially in Variant 1) please tell me, I'll fix them.

Variant 1

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: " + std::string(msg)); }
#define ASSERT(cond) ASSERT_MSG(cond, "")

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef typename U8StrT::value_type VT;
    typedef uint8_t u8;
    U8StrT r;
    for (auto c: s) {
        size_t nby = c <= 0x7FU ? 1 : c <= 0x7FFU ? 2 : c <= 0xFFFFU ? 3 : c <= 0x1FFFFFU ? 4 : c <= 0x3FFFFFFU ? 5 : c <= 0x7FFFFFFFU ? 6 : 7;
        r.push_back(VT(
            nby <= 1 ? u8(c) : (
                (u8(0xFFU) << (8 - nby)) |
                u8(c >> (6 * (nby - 1)))
            )
        ));
        for (size_t i = 1; i < nby; ++i)
            r.push_back(VT(u8(0x80U | (u8(0x3FU) & u8(c >> (6 * (nby - 1 - i)))))));
    }
    return r;
}

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef uint8_t u8;
    std::u32string r;
    auto it = (u8 const *)s.c_str(), end = (u8 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it <= 0x7FU) {
            c = *it;
            ++it;
        } else {
            ASSERT((*it & 0xC0U) == 0xC0U);
            size_t nby = 0;
            for (u8 b = *it; (b & 0x80U) != 0; b <<= 1, ++nby) {(void)0;}
            ASSERT(nby <= 7);
            ASSERT((end - it) >= nby);
            c = *it & (u8(0xFFU) >> (nby + 1));
            for (size_t i = 1; i < nby; ++i) {
                ASSERT((it[i] & 0xC0U) == 0x80U);
                c = (c << 6) | (it[i] & 0x3FU);
            }
            it += nby;
        }
        r.push_back(c);
    }
    return r;
}


template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef typename U16StrT::value_type VT;
    typedef uint16_t u16;
    U16StrT r;
    for (auto c: s) {
        if (c <= 0xFFFFU)
            r.push_back(VT(c));
        else {
            ASSERT(c <= 0x10FFFFU);
            c -= 0x10000U;
            r.push_back(VT(u16(0xD800U | ((c >> 10) & 0x3FFU))));
            r.push_back(VT(u16(0xDC00U | (c & 0x3FFU))));
        }
    }
    return r;
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef uint16_t u16;
    std::u32string r;
    auto it = (u16 const *)s.c_str(), end = (u16 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it < 0xD800U || *it > 0xDFFFU) {
            c = *it;
            ++it;
        } else if (*it >= 0xDC00U) {
            ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
        } else {
            ASSERT(end - it >= 2);
            c = (*it & 0x3FFU) << 10;
            if ((it[1] < 0xDC00U) || (it[1] > 0xDFFFU)) {
                ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
            } else {
                c |= it[1] & 0x3FFU;
                c += 0x10000U;
            }
            it += 2;
        }
        r.push_back(c);
    }
    return r;
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("𐐷𤭢");
        Test("𝞹");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢
UTF-8 num bytes: 8, UTF-16 num bytes: 8
𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹
UTF-8 num bytes: 4, UTF-16 num bytes: 4

Variant 2

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <locale>
#include <codecvt>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }

// Workaround for some of MSVC compilers.
#if defined(_MSC_VER) && (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    return utf_8_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    std::string res = utf_8_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U8StrT(
        (typename U8StrT::value_type const *)(res.c_str()),
        (typename U8StrT::value_type const *)(res.c_str() + res.length()));
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    return utf_16_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    std::string res = utf_16_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U16StrT(
        (typename U16StrT::value_type const *)(res.c_str()),
        (typename U16StrT::value_type const *)(res.c_str() + res.length()));
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("𐐷𤭢");
        Test("𝞹");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢
UTF-8 num bytes: 8, UTF-16 num bytes: 8
𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹
UTF-8 num bytes: 4, UTF-16 num bytes: 4

AndersK · Accepted Answer · 2017-01-15 19:20:46Z

You can do something like this

std::string WstrToUtf8Str(const std::wstring& wstr)
{
  std::string retStr;
  if (!wstr.empty())
  {
    int sizeRequired = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, NULL, 0, NULL, NULL);

    if (sizeRequired > 0)
    {
      std::vector<char> utf8String(sizeRequired);
      int bytesConverted = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(),    
                           -1, &utf8String[0], utf8String.size(), NULL, 
                           NULL);
      if (bytesConverted != 0)
      {
        retStr = &utf8String[0];
      }
      else
      {
        std::stringstream err;
        err << __FUNCTION__ 
            << " std::string WstrToUtf8Str failed to convert wstring '"
            << wstr.c_str() << L"'";
        throw std::runtime_error( err.str() );
      }
    }
  }
  return retStr;
}

You can give your BSTR to the function as a std::wstring

kvv · Accepted Answer · 2014-01-30 13:00:15Z

2

void encode_unicode_character(char* buffer, int* offset, wchar_t ucs_character)
{
    if (ucs_character <= 0x7F)
    {
        // Plain single-byte ASCII.
        buffer[(*offset)++] = (char) ucs_character;
    }
    else if (ucs_character <= 0x7FF)
    {
        // Two bytes.
        buffer[(*offset)++] = 0xC0 | (ucs_character >> 6);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0xFFFF)
    {
        // Three bytes.
        buffer[(*offset)++] = 0xE0 | (ucs_character >> 12);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x1FFFFF)
    {
        // Four bytes.
        buffer[(*offset)++] = 0xF0 | (ucs_character >> 18);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x3FFFFFF)
    {
        // Five bytes.
        buffer[(*offset)++] = 0xF8 | (ucs_character >> 24);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x7FFFFFFF)
    {
        // Six bytes.
        buffer[(*offset)++] = 0xFC | (ucs_character >> 30);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 24) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else
    {
        // Invalid char; don't encode anything.
    }
}

ISO10646-2012 it is all you need to understand UCS.

answered Jan 30, 2014 at 13:00

kvv

3561 silver badge13 bronze badges

18 Comments

rubenvb Over a year ago

UCS is not in the question. And UCS is not UTF-16. Is your code valid for UTF-16?

kvv Over a year ago

I did not say that utf-16 is ucs but it's rather part of it like utf-8 is.

David Heffernan Over a year ago

@kw I think you should answer the question that was asked rather than the question that you happen to have some code for. Take another read of the question title and observe that the question concerns conversion from UTF-16 to UTF-8.

David Heffernan Over a year ago

We don't understand why you mention UCS when the question is about UTF-16. We also fail to understand why you present code that converts UTF-32 to UTF-8 when the question is about UTF-16. It is also a mistake to assume that wchar_t can be used to hold a UTF-32 character element. On some systems it can, but not all.

David Heffernan Over a year ago

@kw No. The other way round. The question asks, and I feel like a stuck record, to convert from UTF-16. That means that the input is UTF-16. Your code converts from UTF-32. So to use it one would need to convert from UTF-16 to UTF-32, and then on to UTF-8.

|

Peter Mortensen · Accepted Answer · 2017-01-15 19:08:36Z

2

If you are using C++11 you may check this out:

http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/

edited Jan 15, 2017 at 19:08

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jan 30, 2014 at 12:49

beardedN5rd

1856 bronze badges

4 Comments

user3252635 Over a year ago

Could you show me an example, because I do not understand how to work with it. BSTR input parameter in UTF-16le

beardedN5rd Over a year ago

haven't the time to create one, but found link he covers that very explicitly. i hope that helps

Remy Lebeau Over a year ago

@user3252635 there is an example in the linked documentation. There are better examples at cppreference.com. Also , look at std::wstring_convert.

Dominik Kaszewski Nov 1 at 15:48

Deprecated in 17, removed in 26

Collectives™ on Stack Overflow

How do I convert a string in UTF-16 to UTF-8 in C++

4 Answers 4

Comments

Comments

18 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

18 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related